Hi All,

I think I have a possible explanation for this problem. Previously orterun was jumping to 0x00000000:

[Rotarran-X-5:04475] Failing at address: 0x0
[ 1] [0xbffff828, 0x00000000] (-P-)

On a hunch I tried changing the number of bool's in the orte_pls_rsh_component_t data structure of pls_rsh.h. Another bus error occurred with orterun jumping to 0x80000000 instead. So I went further and changed the layout of the orte_pls_rsh_component_t struct from something like this:

    bool reap;
    bool assume_same_shell;
    bool force_rsh;
    char** agent_argv;
    int agent_argc;
    char* agent_path;

to this:

    char** agent_argv;
    char* agent_path;
    int agent_argc;
    int unusedInt;
    bool reap;
    bool assume_same_shell;
    bool force_rsh;
    bool unusedB;

recompiled, dropped the new .la and .so pieces in, and then it all worked.

My hunch is that I'm having a data alignment problem. Perhaps the pointer reference to _launch of the pls module is stored after the orte_pls_rsh_component_t struct, but then alignment that given build assumes is different from that of my newly compiled pls module. Apple usually compiles with every type on its "natural" alignment in memory (PowerPC always liked it that way and the habit has stuck) and looking at 3 bools followed by a char** tells me there could be padding.

The problem, rather than whether or not to have padding, is what do we agree on. I don't know who put what memory align compiler flag in what makefile or ./configure line, but if I rearrange the struct into the latter example above then I have no ambiguity, so orterun() calls _launch just fine in the rsh module and my own.

Thanks for your help,
   Dean

Reply via email to