Hi Ralph,

Thanks for addressing this issue.

I tried downloading your fork from that pull request and the seg fault
appears to be gone.  However I didn't install it on my remote machine
before testing, and I got this error:

bash: /opt/ompi-release-cmr-singlespawn/bin/orted: No such file or
directory (along with the usual complaints about ORTE not being able to
start one of the daemons).

On both machines I have openmpi installed to a directory in /opt, and
/opt/openmpi is a symlink to whatever installation I want to use...then my
paths point to the symlink.  I went to the remote machine and simply
changed the name of the directory to match the other one and I just got a
version mismatch error...a much more expected error. I'm not familiar with
OMPI source, but does this have to do with the prefix issue you mentioned
in the pull request? Should it handle symlinks?  Apologies if I'm misguided.

Evan

On Thu, Feb 5, 2015 at 9:51 AM, Ralph Castain <r...@open-mpi.org> wrote:

> Okay, I tracked this down - thanks for your patience! I have a fix pending
> review. You can track it here:
>
> https://github.com/open-mpi/ompi-release/pull/179
>
>
> On Feb 4, 2015, at 5:14 PM, Evan Samanas <evan.sama...@gmail.com> wrote:
>
> Indeed, I simply commented out all the MPI_Info stuff, which you
> essentially did by passing a dummy argument.  I'm still not able to get it
> to succeed.
>
> So here we go, my results defy logic.  I'm sure this could be my
> fault...I've only been an occasional user of OpenMPI and MPI in general
> over the years and I've never used MPI_Comm_spawn before this project. I
> tested simple_spawn like so:
> mpicc simple_spawn.c -o simple_spawn
> ./simple_spawn
>
> When my default hostfile points to a file that just lists localhost, this
> test completes successfully.  If it points to my hostfile with localhost
> and 5 remote hosts, here's the output:
> evan@lasarti:~/devel/toy_progs/mpi_spawn$ mpicc simple_spawn.c -o
> simple_spawn
> evan@lasarti:~/devel/toy_progs/mpi_spawn$ ./simple_spawn
> [pid 5703] starting up!
> 0 completed MPI_Init
> Parent [pid 5703] about to spawn!
> [lasarti:05703] [[14661,1],0] FORKING HNP: orted --hnp --set-sid
> --report-uri 14 --singleton-died-pipe 15 -mca state_novm_select 1 -mca
> ess_base_jobid 960823296
> [lasarti:05705] *** Process received signal ***
> [lasarti:05705] Signal: Segmentation fault (11)
> [lasarti:05705] Signal code: Address not mapped (1)
> [lasarti:05705] Failing at address: (nil)
> [lasarti:05705] [ 0]
> /lib/x86_64-linux-gnu/libpthread.so.0(+0x10340)[0x7fc185dcf340]
> [lasarti:05705] [ 1]
> /opt/openmpi-v1.8.4-54-g07f735a/lib/libopen-rte.so.7(orte_rmaps_base_compute_bindings+0x650)[0x7fc186033bb0]
> [lasarti:05705] [ 2]
> /opt/openmpi-v1.8.4-54-g07f735a/lib/libopen-rte.so.7(orte_rmaps_base_map_job+0x939)[0x7fc18602fb99]
> [lasarti:05705] [ 3]
> /opt/openmpi-v1.8.4-54-g07f735a/lib/libopen-pal.so.6(opal_libevent2021_event_base_loop+0x6e4)[0x7fc18577dcc4]
> [lasarti:05705] [ 4]
> /opt/openmpi-v1.8.4-54-g07f735a/lib/libopen-rte.so.7(orte_daemon+0xdf8)[0x7fc186010438]
> [lasarti:05705] [ 5] orted(main+0x47)[0x400887]
> [lasarti:05705] [ 6]
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7fc185a1aec5]
> [lasarti:05705] [ 7] orted[0x4008db]
> [lasarti:05705] *** End of error message ***
>
> You can see from the message that this particular run IS from the latest
> snapshot, though the failure happens on v.1.8.4 as well.  I didn't bother
> installing the snapshot on the remote nodes though.  Should I do that?  It
> looked to me like this error happened well before we got to a remote node,
> so that's why I didn't.
>
> Your thoughts?
>
> Evan
>
>
>
> On Tue, Feb 3, 2015 at 7:40 PM, Ralph Castain <r...@open-mpi.org> wrote:
>
>> I confess I am sorely puzzled. I replace the Info key with MPI_INFO_NULL,
>> but still had to pass a bogus argument to master since you still have the
>> Info_set code in there - otherwise, info_set segfaults due to a NULL
>> argv[1]. Doing that (and replacing "hostname" with an MPI example code)
>> makes everything work just fine.
>>
>> I've attached one of our example comm_spawn codes that we test against -
>> it also works fine with the current head of the 1.8 code base. I confess
>> that some changes have been made since 1.8.4 was released, and it is
>> entirely possible that this was a problem in 1.8.4 and has since been fixed.
>>
>> So I'd suggest trying with the nightly 1.8 tarball and seeing if it works
>> for you. You can download it from here:
>>
>> http://www.open-mpi.org/nightly/v1.8/
>>
>> HTH
>> Ralph
>>
>>
>> On Tue, Feb 3, 2015 at 6:20 PM, Evan Samanas <evan.sama...@gmail.com>
>> wrote:
>>
>>> Yes, I did.  I replaced the info argument of MPI_Comm_spawn with
>>> MPI_INFO_NULL.
>>>
>>> On Tue, Feb 3, 2015 at 5:54 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>
>>>> When running your comm_spawn code, did you remove the Info key code?
>>>> You wouldn't need to provide a hostfile or hosts any more, which is why it
>>>> should resolve that problem.
>>>>
>>>> I agree that providing either hostfile or host as an Info key will
>>>> cause the program to segfault - I'm woking on that issue.
>>>>
>>>>
>>>> On Tue, Feb 3, 2015 at 3:46 PM, Evan Samanas <evan.sama...@gmail.com>
>>>> wrote:
>>>>
>>>>> Setting these environment variables did indeed change the way mpirun
>>>>> maps things, and I didn't have to specify a hostfile.  However, setting
>>>>> these for my MPI_Comm_spawn code still resulted in the same segmentation
>>>>> fault.
>>>>>
>>>>> Evan
>>>>>
>>>>> On Tue, Feb 3, 2015 at 10:09 AM, Ralph Castain <r...@open-mpi.org>
>>>>> wrote:
>>>>>
>>>>>> If you add the following to your environment, you should run on
>>>>>> multiple nodes:
>>>>>>
>>>>>> OMPI_MCA_rmaps_base_mapping_policy=node
>>>>>> OMPI_MCA_orte_default_hostfile=<your hostfile>
>>>>>>
>>>>>> The first tells OMPI to map-by node. The second passes in your
>>>>>> default hostfile so you don't need to specify it as an Info key.
>>>>>>
>>>>>> HTH
>>>>>> Ralph
>>>>>>
>>>>>>
>>>>>> On Tue, Feb 3, 2015 at 9:23 AM, Evan Samanas <evan.sama...@gmail.com>
>>>>>>  wrote:
>>>>>>
>>>>>>> Hi Ralph,
>>>>>>>
>>>>>>> Good to know you've reproduced it.  I was experiencing this using
>>>>>>> both the hostfile and host key.  A simple comm_spawn was working for me 
>>>>>>> as
>>>>>>> well, but it was only launching locally, and I'm pretty sure each node 
>>>>>>> only
>>>>>>> has 4 slots given past behavior (the mpirun -np 8 example I gave in my
>>>>>>> first email launches on both hosts).  Is there a way to specify the 
>>>>>>> hosts I
>>>>>>> want to launch on without the hostfile or host key so I can test remote
>>>>>>> launch?
>>>>>>>
>>>>>>> And to the "hostname" response...no wonder it was hanging!  I just
>>>>>>> constructed that as a basic example.  In my real use I'm launching
>>>>>>> something that calls MPI_Init.
>>>>>>>
>>>>>>> Evan
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org
>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> Link to this post:
>>>>>>> http://www.open-mpi.org/community/lists/users/2015/02/26271.php
>>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> Link to this post:
>>>>>> http://www.open-mpi.org/community/lists/users/2015/02/26272.php
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> Link to this post:
>>>>> http://www.open-mpi.org/community/lists/users/2015/02/26281.php
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/users/2015/02/26285.php
>>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2015/02/26286.php
>>>
>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/02/26287.php
>>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/02/26292.php
>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/02/26294.php
>

Reply via email to