I'm sorry, but now I am totally confused. Are you saying that you
are having
problems with the default rsh component in the distributed 1.2.3
code??
Yes ...
Or are you having a problem with your customized version?
and yes. Each exhibited the same problem, a bus error.
What compiler are you using? If it's your customized version, did
you make sure to change the
names of the data structures and modules as I pointed out?
gcc 4.0.1, the default of Leopard. Yes, in the customized version, I
did change the names of the data structures, subroutines, support
file names, and where it says "rsh" just like you said.
We regularly work on Macs, both PPC and Intel based (I develop and
test on
both every day), and I have -never- seen this problem in our code
base.
Hence my confusion.
I'm sorry to confuse. I'm starting with the shipping Mac OS X 10.5.1
"Leopard", which contains its own build of Open MPI (v1.2.3 according
to "orterun -version"). So I assumed that the v1.2.3 branch from
svn.open-mpi.org was the same code Apple used to build the Open MPI
that ships in Leopard.
My motivation was to build a new pls module based on pls_rsh module's
source code, substituting the rsh with my own name like you said, but
I encountered a bus error. So to be sure I didn't screw up somewhere
in my custom module I rebuilt the unmodified pls_rsh module and
discovered the same problem.
Then, after downloading the Open MPI from opensource.apple.com
(suspecting it was different), I tried recompiling the pls_rsh module
from that source code, dropped in just the resulting mca_pls_rsh.la
and mca_pls_rsh.so into the existing /usr/lib/openmpi of Leopard,
overwriting Leopard's versions, and the bus error happened the same
as before.
That's where I was with my first post to this list.
My last post regards the discovery that rearranging the elements of
orte_pls_rsh_component_t, without changing anything else about the
pls_rsh code, affects the bus error outcome. Then I padded out
orte_pls_rsh_component_t and my "orte_pls_dean_component_t" by hand
so that it would be "data alignment agnostic", if you will.
Consequently the bus error no longer occurs and both pls modules now
run as they should.
My hypothesis: Apple's procedure to build Open MPI into Leopard had a
side effect requiring shared object code structures to follow a data
alignment different than if I simply recompile Open MPI straight from
its source.
I'm not saying anyone is to blame, but I'm recognizing that those
builds have different timelines. I predict that if I overwrite all
of Leopard's Open MPI object code, then it would all run too.
For my needs, I have a sufficient workaround: realign my data
structures to be "agnostic". I'm sharing this little discovery just
in case it might help somebody else out there; for all I know it
could happen on non-Macs too.
Thanks,
Dean