Dear Ralph,
This is the output I get when I execute with the verbose option.
[grsacc20:21012] [[23526,0],0] plm:base:receive processing msg
[grsacc20:21012] [[23526,0],0] plm:base:receive job launch command from
[[23526,1],0]
[grsacc20:21012] [[23526,0],0] plm:base:receive adding hosts
[grsacc20:21012] [[23526,0],0] plm:base:receive calling spawn
[grsacc20:21012] [[23526,0],0] plm:base:receive done processing commands
[grsacc20:21012] [[23526,0],0] plm:base:setup_job
[grsacc20:21012] [[23526,0],0] plm:base:setup_vm
[grsacc20:21012] [[23526,0],0] plm:base:setup_vm add new daemon [[23526,0],2]
[grsacc20:21012] [[23526,0],0] plm:base:setup_vm assigning new daemon
[[23526,0],2] to node grsacc17/1-4
[grsacc20:21012] [[23526,0],0] plm:base:setup_vm add new daemon [[23526,0],3]
[grsacc20:21012] [[23526,0],0] plm:base:setup_vm assigning new daemon
[[23526,0],3] to node grsacc17/0-5
[grsacc20:21012] [[23526,0],0] plm:tm: launching vm
[grsacc20:21012] [[23526,0],0] plm:tm: final top-level argv:
orted -mca ess tm -mca orte_ess_jobid 1541799936 -mca orte_ess_vpid
<template> -mca orte_ess_num_procs 4 -mca orte_hnp_uri
"1541799936.0;tcp://192.168.222.20:49049" -mca plm_base_verbose 5
[warn] opal_libevent2021_event_base_loop: reentrant invocation. Only one
event_base_loop can run on each event_base at once.
[grsacc20:21012] [[23526,0],0] plm:base:orted_cmd sending orted_exit commands
[grsacc20:21012] [[23526,0],0] plm:base:receive stop comm
Says something?
Best,
Suraj
On Sep 22, 2013, at 9:45 PM, Ralph Castain wrote:
> I'll still need to look at the intercomm_create issue, but I just tested both
> the trunk and current 1.7.3 branch for "add-host" and both worked just fine.
> This was on my little test cluster which only has rsh available - no Torque.
>
> You might add "-mca plm_base_verbose 5" to your cmd line to get some debug
> output as to the problem.
>
>
> On Sep 21, 2013, at 5:48 PM, Ralph Castain <[email protected]> wrote:
>
>>
>> On Sep 21, 2013, at 4:54 PM, Suraj Prabhakaran <[email protected]>
>> wrote:
>>
>>> Dear all,
>>>
>>> Really thanks a lot for your efforts. I too downloaded the trunk to check
>>> if it works for my case and as of revision 29215, it works for the original
>>> case I reported. Although it works, I still see the following in the
>>> output. Does it mean anything?
>>> [grsacc17][[13611,1],0][btl_openib_proc.c:157:mca_btl_openib_proc_create]
>>> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[13611,2],0]
>>
>> Yes - it means we don't quite have this right yet :-(
>>
>>>
>>> However, on another topic relevant to my use case, I have another problem
>>> to report. I am having problems using the "add-host" info to the
>>> MPI_Comm_spawn() when MPI is compiled with support for Torque resource
>>> manager. This problem is totally new in the 1.7 series and it worked
>>> perfectly until 1.6.5
>>>
>>> Basically, I am working on implementing dynamic resource management
>>> facilities in the Torque/Maui batch system. Through a new tm call, an
>>> application can get new resources for a job.
>>
>> FWIW: you'll find that we added an API to the orte RAS framework to support
>> precisely that operation. It allows an application to request that we
>> dynamically obtain additional resources during execution (e.g., as part of a
>> Comm_spawn call via an info_key). We originally implemented this with Slurm,
>> but you could add the calls into the Torque component as well if you like.
>>
>> This is in the trunk now - will come over to 1.7.4
>>
>>
>>> I want to use MPI_Comm_spawn() to spawn new processes in the new hosts.
>>> With my extended torque/maui batch system, I was able to perfectly use the
>>> "add-host" info argument to MPI_Comm_spawn() to spawn new processes on
>>> these hosts. Since MPI and Torque refer to the hosts through the nodeids, I
>>> made sure that OpenMPI uses the correct nodeid's for these new hosts.
>>> Until 1.6.5, this worked perfectly fine, except that due to the
>>> Intercomm_merge problem, I could not really run a real application to its
>>> completion.
>>>
>>> While this is now fixed in the trunk, I found that, however, when using the
>>> "add-host" info argument, everything collapses after printing out the
>>> following error.
>>>
>>> [warn] opal_libevent2021_event_base_loop: reentrant invocation. Only one
>>> event_base_loop can run on each event_base at once.
>>
>> I'll take a look - probably some stale code that hasn't been updated yet for
>> async ORTE operations
>>
>>>
>>> And due to this, I am still not really able to run my application! I also
>>> compiled the MPI without any Torque/PBS support and just used the
>>> "add-host" argument normally. Again, this worked perfectly in 1.6.5. But in
>>> the 1.7 series, it works but after printing out the following error.
>>>
>>> [grsacc17][[13731,1],0][btl_openib_proc.c:157:mca_btl_openib_proc_create]
>>> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[13731,2],0]
>>> [grsacc17][[13731,1],1][btl_openib_proc.c:157:mca_btl_openib_proc_create]
>>> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[13731,2],0]
>>
>> Yeah, the 1.7 series doesn't have the reentrant test in it - so we
>> "illegally" re-enter libevent. The error again means we don't have
>> Intercomm_create correct just yet.
>>
>> I'll see what I can do about this and get back to you
>>
>>>
>>> In short, with pbs/torque support, it fails and without pbs/torque support,
>>> it runs after spitting the above lines.
>>>
>>> I would really appreciate some help on this, since I need these features to
>>> actually test my case and (at least in my short experience) no other MPI
>>> implementation seem friendly to such dynamic scenarios.
>>>
>>> Thanks a lot!
>>>
>>> Best,
>>> Suraj
>>>
>>>
>>>
>>> On Sep 20, 2013, at 4:58 PM, Jeff Squyres (jsquyres) wrote:
>>>
>>>> Just to close my end of this loop: as of trunk r29213, it all works for
>>>> me. Thanks!
>>>>
>>>>
>>>> On Sep 18, 2013, at 12:52 PM, Ralph Castain <[email protected]> wrote:
>>>>
>>>>> Thanks George - much appreciated
>>>>>
>>>>> On Sep 18, 2013, at 9:49 AM, George Bosilca <[email protected]> wrote:
>>>>>
>>>>>> The test case was broken. I just pushed a fix.
>>>>>>
>>>>>> George.
>>>>>>
>>>>>> On Sep 18, 2013, at 16:49 , Ralph Castain <[email protected]> wrote:
>>>>>>
>>>>>>> Hangs with any np > 1
>>>>>>>
>>>>>>> However, I'm not sure if that's an issue with the test vs the
>>>>>>> underlying implementation
>>>>>>>
>>>>>>> On Sep 18, 2013, at 7:40 AM, "Jeff Squyres (jsquyres)"
>>>>>>> <[email protected]> wrote:
>>>>>>>
>>>>>>>> Does it hang when you run with -np 4?
>>>>>>>>
>>>>>>>> Sent from my phone. No type good.
>>>>>>>>
>>>>>>>> On Sep 18, 2013, at 4:10 PM, "Ralph Castain" <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> Strange - it works fine for me on my Mac. However, I see one
>>>>>>>>> difference - I only run it with np=1
>>>>>>>>>
>>>>>>>>> On Sep 18, 2013, at 2:22 AM, Jeff Squyres (jsquyres)
>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> On Sep 18, 2013, at 9:33 AM, George Bosilca <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> 1. sm doesn't work between spawned processes. So you must have
>>>>>>>>>>> another network enabled.
>>>>>>>>>>
>>>>>>>>>> I know :-). I have tcp available as well (OMPI will abort if you
>>>>>>>>>> only run with sm,self because the comm_spawn will fail with
>>>>>>>>>> unreachable errors -- I just tested/proved this to myself).
>>>>>>>>>>
>>>>>>>>>>> 2. Don't use the test case attached to my email, I left an xterm
>>>>>>>>>>> based spawn and the debugging. It can't work without xterm support.
>>>>>>>>>>> Instead try using the test case from the trunk, the one committed
>>>>>>>>>>> by Ralph.
>>>>>>>>>>
>>>>>>>>>> I didn't see any "xterm" strings in there, but ok. :-) I ran with
>>>>>>>>>> orte/test/mpi/intercomm_create.c, and that hangs for me as well:
>>>>>>>>>>
>>>>>>>>>> -----
>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create
>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create
>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201,
>>>>>>>>>> &inter) [rank 4]
>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201,
>>>>>>>>>> &inter) [rank 5]
>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201,
>>>>>>>>>> &inter) [rank 6]
>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201,
>>>>>>>>>> &inter) [rank 7]
>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter)
>>>>>>>>>> [rank 4]
>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter)
>>>>>>>>>> [rank 5]
>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter)
>>>>>>>>>> [rank 6]
>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter)
>>>>>>>>>> [rank 7]
>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0)
>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0)
>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0)
>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0)
>>>>>>>>>> [hang]
>>>>>>>>>> -----
>>>>>>>>>>
>>>>>>>>>> Similarly, on my Mac, it hangs with no output:
>>>>>>>>>>
>>>>>>>>>> -----
>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create
>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create
>>>>>>>>>> [hang]
>>>>>>>>>> -----
>>>>>>>>>>
>>>>>>>>>>> George.
>>>>>>>>>>>
>>>>>>>>>>> On Sep 18, 2013, at 07:53 , "Jeff Squyres (jsquyres)"
>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> George --
>>>>>>>>>>>>
>>>>>>>>>>>> When I build the SVN trunk (r29201) on 64 bit linux, your attached
>>>>>>>>>>>> test case hangs:
>>>>>>>>>>>>
>>>>>>>>>>>> -----
>>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create
>>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create
>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201,
>>>>>>>>>>>> &inter) [rank 4]
>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201,
>>>>>>>>>>>> &inter) [rank 5]
>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201,
>>>>>>>>>>>> &inter) [rank 6]
>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201,
>>>>>>>>>>>> &inter) [rank 7]
>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0)
>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0)
>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0)
>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0)
>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter)
>>>>>>>>>>>> [rank 4]
>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter)
>>>>>>>>>>>> [rank 5]
>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter)
>>>>>>>>>>>> [rank 6]
>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter)
>>>>>>>>>>>> [rank 7]
>>>>>>>>>>>> [hang]
>>>>>>>>>>>> -----
>>>>>>>>>>>>
>>>>>>>>>>>> On my Mac, it hangs without printing anything:
>>>>>>>>>>>>
>>>>>>>>>>>> -----
>>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create
>>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create
>>>>>>>>>>>> [hang]
>>>>>>>>>>>> -----
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sep 18, 2013, at 1:48 AM, George Bosilca <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Here is a quick (and definitively not the cleanest) patch that
>>>>>>>>>>>>> addresses the MPI_Intercomm issue at the MPI level. It should be
>>>>>>>>>>>>> applied after removal of 29166.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I also added the corrected test case stressing the corner cases
>>>>>>>>>>>>> by doing barriers at every inter-comm creation and doing a clean
>>>>>>>>>>>>> disconnect.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Jeff Squyres
>>>>>>>>>>>> [email protected]
>>>>>>>>>>>> For corporate legal information go to:
>>>>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>> [email protected]
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> devel mailing list
>>>>>>>>>>> [email protected]
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Jeff Squyres
>>>>>>>>>> [email protected]
>>>>>>>>>> For corporate legal information go to:
>>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> devel mailing list
>>>>>>>>>> [email protected]
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> [email protected]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> [email protected]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> [email protected]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> [email protected]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> [email protected]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>>
>>>> --
>>>> Jeff Squyres
>>>> [email protected]
>>>> For corporate legal information go to:
>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> [email protected]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>> _______________________________________________
>>> devel mailing list
>>> [email protected]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
> _______________________________________________
> devel mailing list
> [email protected]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel