I'm sorry, but now I am totally confused. Are you saying that you are having problems with the default rsh component in the distributed 1.2.3 code??

Yes ...

Or are you having a problem with your customized version?

and yes.  Each exhibited the same problem, a bus error.

What compiler are you using? If it's your customized version, did you make sure to change the
names of the data structures and modules as I pointed out?

gcc 4.0.1, the default of Leopard. Yes, in the customized version, I did change the names of the data structures, subroutines, support file names, and where it says "rsh" just like you said.

We regularly work on Macs, both PPC and Intel based (I develop and test on both every day), and I have -never- seen this problem in our code base.
Hence my confusion.

I'm sorry to confuse. I'm starting with the shipping Mac OS X 10.5.1 "Leopard", which contains its own build of Open MPI (v1.2.3 according to "orterun -version"). So I assumed that the v1.2.3 branch from svn.open-mpi.org was the same code Apple used to build the Open MPI that ships in Leopard.

My motivation was to build a new pls module based on pls_rsh module's source code, substituting the rsh with my own name like you said, but I encountered a bus error. So to be sure I didn't screw up somewhere in my custom module I rebuilt the unmodified pls_rsh module and discovered the same problem.

Then, after downloading the Open MPI from opensource.apple.com (suspecting it was different), I tried recompiling the pls_rsh module from that source code, dropped in just the resulting mca_pls_rsh.la and mca_pls_rsh.so into the existing /usr/lib/openmpi of Leopard, overwriting Leopard's versions, and the bus error happened the same as before.

That's where I was with my first post to this list.

My last post regards the discovery that rearranging the elements of orte_pls_rsh_component_t, without changing anything else about the pls_rsh code, affects the bus error outcome. Then I padded out orte_pls_rsh_component_t and my "orte_pls_dean_component_t" by hand so that it would be "data alignment agnostic", if you will. Consequently the bus error no longer occurs and both pls modules now run as they should.

My hypothesis: Apple's procedure to build Open MPI into Leopard had a side effect requiring shared object code structures to follow a data alignment different than if I simply recompile Open MPI straight from its source.

I'm not saying anyone is to blame, but I'm recognizing that those builds have different timelines. I predict that if I overwrite all of Leopard's Open MPI object code, then it would all run too.

For my needs, I have a sufficient workaround: realign my data structures to be "agnostic". I'm sharing this little discovery just in case it might help somebody else out there; for all I know it could happen on non-Macs too.

Thanks,
  Dean

Reply via email to