On Sep 21, 2013, at 4:54 PM, Suraj Prabhakaran <suraj.prabhaka...@gmail.com> 
wrote:

> Dear all,
> 
> Really thanks a lot for your efforts. I too downloaded the trunk to check if 
> it works for my case and as of revision 29215, it works for the original case 
> I reported. Although it works, I still see the following in the output. Does 
> it mean anything?
> [grsacc17][[13611,1],0][btl_openib_proc.c:157:mca_btl_openib_proc_create] 
> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[13611,2],0]

Yes - it means we don't quite have this right yet :-(

> 
> However, on another topic relevant to my use case, I have another problem to 
> report. I am having problems using the "add-host" info to the 
> MPI_Comm_spawn() when MPI is compiled with support for Torque resource 
> manager. This problem is totally new in the 1.7 series and it worked 
> perfectly until 1.6.5 
> 
> Basically, I am working on implementing dynamic resource management 
> facilities in the Torque/Maui batch system. Through a new tm call, an 
> application can get new resources for a job.

FWIW: you'll find that we added an API to the orte RAS framework to support 
precisely that operation. It allows an application to request that we 
dynamically obtain additional resources during execution (e.g., as part of a 
Comm_spawn call via an info_key). We originally implemented this with Slurm, 
but you could add the calls into the Torque component as well if you like.

This is in the trunk now - will come over to 1.7.4


> I want to use MPI_Comm_spawn() to spawn new processes in the new hosts. With 
> my extended torque/maui batch system, I was able to perfectly use the 
> "add-host" info argument to MPI_Comm_spawn() to spawn new processes on these 
> hosts. Since MPI and Torque refer to the hosts through the nodeids, I made 
> sure that OpenMPI uses the correct nodeid's for these new hosts. 
> Until 1.6.5, this worked perfectly fine, except that due to the 
> Intercomm_merge problem, I could not really run a real application to its 
> completion.
> 
> While this is now fixed in the trunk, I found that, however, when using the 
> "add-host" info argument, everything collapses after printing out the 
> following error. 
> 
> [warn] opal_libevent2021_event_base_loop: reentrant invocation.  Only one 
> event_base_loop can run on each event_base at once.

I'll take a look - probably some stale code that hasn't been updated yet for 
async ORTE operations

> 
> And due to this, I am still not really able to run my application! I also 
> compiled the MPI without any Torque/PBS support and just used the "add-host" 
> argument normally. Again, this worked perfectly in 1.6.5. But in the 1.7 
> series, it works but after printing out the following error.
> 
> [grsacc17][[13731,1],0][btl_openib_proc.c:157:mca_btl_openib_proc_create] 
> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[13731,2],0]
> [grsacc17][[13731,1],1][btl_openib_proc.c:157:mca_btl_openib_proc_create] 
> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[13731,2],0]

Yeah, the 1.7 series doesn't have the reentrant test in it - so we "illegally" 
re-enter libevent. The error again means we don't have Intercomm_create correct 
just yet.

I'll see what I can do about this and get back to you

> 
> In short, with pbs/torque support, it fails and without pbs/torque support, 
> it runs after spitting the above lines. 
> 
> I would really appreciate some help on this, since I need these features to 
> actually test my case and (at least in my short experience) no other MPI 
> implementation seem friendly to such dynamic scenarios. 
> 
> Thanks a lot!
> 
> Best,
> Suraj
> 
> 
> 
> On Sep 20, 2013, at 4:58 PM, Jeff Squyres (jsquyres) wrote:
> 
>> Just to close my end of this loop: as of trunk r29213, it all works for me.  
>> Thanks!
>> 
>> 
>> On Sep 18, 2013, at 12:52 PM, Ralph Castain <r...@open-mpi.org> wrote:
>> 
>>> Thanks George - much appreciated
>>> 
>>> On Sep 18, 2013, at 9:49 AM, George Bosilca <bosi...@icl.utk.edu> wrote:
>>> 
>>>> The test case was broken. I just pushed a fix.
>>>> 
>>>> George.
>>>> 
>>>> On Sep 18, 2013, at 16:49 , Ralph Castain <r...@open-mpi.org> wrote:
>>>> 
>>>>> Hangs with any np > 1
>>>>> 
>>>>> However, I'm not sure if that's an issue with the test vs the underlying 
>>>>> implementation
>>>>> 
>>>>> On Sep 18, 2013, at 7:40 AM, "Jeff Squyres (jsquyres)" 
>>>>> <jsquy...@cisco.com> wrote:
>>>>> 
>>>>>> Does it hang when you run with -np 4?
>>>>>> 
>>>>>> Sent from my phone. No type good. 
>>>>>> 
>>>>>> On Sep 18, 2013, at 4:10 PM, "Ralph Castain" <r...@open-mpi.org> wrote:
>>>>>> 
>>>>>>> Strange - it works fine for me on my Mac. However, I see one difference 
>>>>>>> - I only run it with np=1
>>>>>>> 
>>>>>>> On Sep 18, 2013, at 2:22 AM, Jeff Squyres (jsquyres) 
>>>>>>> <jsquy...@cisco.com> wrote:
>>>>>>> 
>>>>>>>> On Sep 18, 2013, at 9:33 AM, George Bosilca <bosi...@icl.utk.edu> 
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> 1. sm doesn't work between spawned processes. So you must have 
>>>>>>>>> another network enabled.
>>>>>>>> 
>>>>>>>> I know :-).  I have tcp available as well (OMPI will abort if you only 
>>>>>>>> run with sm,self because the comm_spawn will fail with unreachable 
>>>>>>>> errors -- I just tested/proved this to myself).
>>>>>>>> 
>>>>>>>>> 2. Don't use the test case attached to my email, I left an xterm 
>>>>>>>>> based spawn and the debugging. It can't work without xterm support. 
>>>>>>>>> Instead try using the test case from the trunk, the one committed by 
>>>>>>>>> Ralph.
>>>>>>>> 
>>>>>>>> I didn't see any "xterm" strings in there, but ok.  :-)  I ran with 
>>>>>>>> orte/test/mpi/intercomm_create.c, and that hangs for me as well:
>>>>>>>> 
>>>>>>>> -----
>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create
>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create
>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, &inter) 
>>>>>>>> [rank 4]
>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, &inter) 
>>>>>>>> [rank 5]
>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, &inter) 
>>>>>>>> [rank 6]
>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, &inter) 
>>>>>>>> [rank 7]
>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) 
>>>>>>>> [rank 4]
>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) 
>>>>>>>> [rank 5]
>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) 
>>>>>>>> [rank 6]
>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) 
>>>>>>>> [rank 7]
>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0)
>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0)
>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0)
>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0)
>>>>>>>> [hang]
>>>>>>>> -----
>>>>>>>> 
>>>>>>>> Similarly, on my Mac, it hangs with no output:
>>>>>>>> 
>>>>>>>> -----
>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create
>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create
>>>>>>>> [hang]
>>>>>>>> -----
>>>>>>>> 
>>>>>>>>> George.
>>>>>>>>> 
>>>>>>>>> On Sep 18, 2013, at 07:53 , "Jeff Squyres (jsquyres)" 
>>>>>>>>> <jsquy...@cisco.com> wrote:
>>>>>>>>> 
>>>>>>>>>> George --
>>>>>>>>>> 
>>>>>>>>>> When I build the SVN trunk (r29201) on 64 bit linux, your attached 
>>>>>>>>>> test case hangs:
>>>>>>>>>> 
>>>>>>>>>> -----
>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create
>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create
>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, 
>>>>>>>>>> &inter) [rank 4]
>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, 
>>>>>>>>>> &inter) [rank 5]
>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, 
>>>>>>>>>> &inter) [rank 6]
>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, 
>>>>>>>>>> &inter) [rank 7]
>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0)
>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0)
>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0)
>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0)
>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) 
>>>>>>>>>> [rank 4]
>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) 
>>>>>>>>>> [rank 5]
>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) 
>>>>>>>>>> [rank 6]
>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) 
>>>>>>>>>> [rank 7]
>>>>>>>>>> [hang]
>>>>>>>>>> -----
>>>>>>>>>> 
>>>>>>>>>> On my Mac, it hangs without printing anything:
>>>>>>>>>> 
>>>>>>>>>> -----
>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create
>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create   
>>>>>>>>>> [hang]
>>>>>>>>>> -----
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Sep 18, 2013, at 1:48 AM, George Bosilca <bosi...@icl.utk.edu> 
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Here is a quick (and definitively not the cleanest) patch that 
>>>>>>>>>>> addresses the MPI_Intercomm issue at the MPI level. It should be 
>>>>>>>>>>> applied after removal of 29166.
>>>>>>>>>>> 
>>>>>>>>>>> I also added the corrected test case stressing the corner cases by 
>>>>>>>>>>> doing barriers at every inter-comm creation and doing a clean 
>>>>>>>>>>> disconnect.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> -- 
>>>>>>>>>> Jeff Squyres
>>>>>>>>>> jsquy...@cisco.com
>>>>>>>>>> For corporate legal information go to: 
>>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>>>>>> 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> devel mailing list
>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> de...@open-mpi.org
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>> 
>>>>>>>> 
>>>>>>>> -- 
>>>>>>>> Jeff Squyres
>>>>>>>> jsquy...@cisco.com
>>>>>>>> For corporate legal information go to: 
>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> de...@open-mpi.org
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> de...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to: 
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to