Hi All,
I think I have a possible explanation for this problem. Previously
orterun was jumping to 0x00000000:
[Rotarran-X-5:04475] Failing at address: 0x0
[ 1] [0xbffff828, 0x00000000] (-P-)
On a hunch I tried changing the number of bool's in the
orte_pls_rsh_component_t data structure of pls_rsh.h. Another bus
error occurred with orterun jumping to 0x80000000 instead. So I went
further and changed the layout of the orte_pls_rsh_component_t struct
from something like this:
bool reap;
bool assume_same_shell;
bool force_rsh;
char** agent_argv;
int agent_argc;
char* agent_path;
to this:
char** agent_argv;
char* agent_path;
int agent_argc;
int unusedInt;
bool reap;
bool assume_same_shell;
bool force_rsh;
bool unusedB;
recompiled, dropped the new .la and .so pieces in, and then it all
worked.
My hunch is that I'm having a data alignment problem. Perhaps the
pointer reference to _launch of the pls module is stored after the
orte_pls_rsh_component_t struct, but then alignment that given build
assumes is different from that of my newly compiled pls module.
Apple usually compiles with every type on its "natural" alignment in
memory (PowerPC always liked it that way and the habit has stuck) and
looking at 3 bools followed by a char** tells me there could be padding.
The problem, rather than whether or not to have padding, is what do
we agree on. I don't know who put what memory align compiler flag in
what makefile or ./configure line, but if I rearrange the struct into
the latter example above then I have no ambiguity, so orterun() calls
_launch just fine in the rsh module and my own.
Thanks for your help,
Dean