I'm going to need a little help here. The problem is that you launch two new daemons, and one of them exits immediately because it thinks it lost the connection back to mpirun - before it even gets a chance to create it.
Can you give me a little more info as to exactly what you are doing? Perhaps send me your test code? On Sep 24, 2013, at 7:48 AM, Suraj Prabhakaran <suraj.prabhaka...@gmail.com> wrote: > Hi Ralph, > > Output attached in a file. > Thanks a lot! > > Best, > Suraj > > <output.rtf> > > On Sep 24, 2013, at 4:11 PM, Ralph Castain wrote: > >> Afraid I don't see the problem offhand - can you add the following to your >> cmd line? >> >> -mca state_base_verbose 10 -mca errmgr_base_verbose 10 >> >> Thanks >> Ralph >> >> On Sep 24, 2013, at 6:35 AM, Suraj Prabhakaran <suraj.prabhaka...@gmail.com> >> wrote: >> >>> Hi Ralph, >>> >>> I always got this output from any MPI job that ran on our nodes. There >>> seems to be a problem somewhere but it never stopped the applications from >>> running. But anyway, I ran it again now with only tcp and excluded the >>> infiniband and I get the same output again. Except that this time, the >>> error related to this openib is not there anymore. Printing out the log >>> again. >>> >>> [grsacc20:04578] [[6160,0],0] plm:base:receive processing msg >>> [grsacc20:04578] [[6160,0],0] plm:base:receive job launch command from >>> [[6160,1],0] >>> [grsacc20:04578] [[6160,0],0] plm:base:receive adding hosts >>> [grsacc20:04578] [[6160,0],0] plm:base:receive calling spawn >>> [grsacc20:04578] [[6160,0],0] plm:base:receive done processing commands >>> [grsacc20:04578] [[6160,0],0] plm:base:setup_job >>> [grsacc20:04578] [[6160,0],0] plm:base:setup_vm >>> [grsacc20:04578] [[6160,0],0] plm:base:setup_vm add new daemon [[6160,0],2] >>> [grsacc20:04578] [[6160,0],0] plm:base:setup_vm assigning new daemon >>> [[6160,0],2] to node grsacc18 >>> [grsacc20:04578] [[6160,0],0] plm:tm: launching vm >>> [grsacc20:04578] [[6160,0],0] plm:tm: final top-level argv: >>> orted -mca ess tm -mca orte_ess_jobid 403701760 -mca orte_ess_vpid >>> <template> -mca orte_ess_num_procs 3 -mca orte_hnp_uri >>> "403701760.0;tcp://192.168.222.20:35163" -mca plm_base_verbose 5 -mca btl >>> tcp,sm,self >>> [grsacc20:04578] [[6160,0],0] plm:tm: launching on node grsacc19 >>> [grsacc20:04578] [[6160,0],0] plm:tm: executing: >>> orted -mca ess tm -mca orte_ess_jobid 403701760 -mca orte_ess_vpid 1 >>> -mca orte_ess_num_procs 3 -mca orte_hnp_uri >>> "403701760.0;tcp://192.168.222.20:35163" -mca plm_base_verbose 5 -mca btl >>> tcp,sm,self >>> [grsacc20:04578] [[6160,0],0] plm:tm: launching on node grsacc18 >>> [grsacc20:04578] [[6160,0],0] plm:tm: executing: >>> orted -mca ess tm -mca orte_ess_jobid 403701760 -mca orte_ess_vpid 2 >>> -mca orte_ess_num_procs 3 -mca orte_hnp_uri >>> "403701760.0;tcp://192.168.222.20:35163" -mca plm_base_verbose 5 -mca btl >>> tcp,sm,self >>> [grsacc20:04578] [[6160,0],0] plm:tm:launch: finished spawning orteds >>> [grsacc19:28821] mca:base:select:( plm) Querying component [rsh] >>> [grsacc19:28821] [[6160,0],1] plm:rsh_lookup on agent ssh : rsh path NULL >>> [grsacc19:28821] mca:base:select:( plm) Query of component [rsh] set >>> priority to 10 >>> [grsacc19:28821] mca:base:select:( plm) Selected component [rsh] >>> [grsacc19:28821] [[6160,0],1] plm:rsh_setup on agent ssh : rsh path NULL >>> [grsacc19:28821] [[6160,0],1] plm:base:receive start comm >>> [grsacc19:28821] [[6160,0],1] plm:base:receive stop comm >>> [grsacc18:16717] mca:base:select:( plm) Querying component [rsh] >>> [grsacc18:16717] [[6160,0],2] plm:rsh_lookup on agent ssh : rsh path NULL >>> [grsacc18:16717] mca:base:select:( plm) Query of component [rsh] set >>> priority to 10 >>> [grsacc18:16717] mca:base:select:( plm) Selected component [rsh] >>> [grsacc18:16717] [[6160,0],2] plm:rsh_setup on agent ssh : rsh path NULL >>> [grsacc18:16717] [[6160,0],2] plm:base:receive start comm >>> [grsacc20:04578] [[6160,0],0] plm:base:orted_report_launch from daemon >>> [[6160,0],2] >>> [grsacc20:04578] [[6160,0],0] plm:base:orted_report_launch from daemon >>> [[6160,0],2] on node grsacc18 >>> [grsacc20:04578] [[6160,0],0] plm:base:orted_report_launch completed for >>> daemon [[6160,0],2] at contact 403701760.2;tcp://192.168.222.18:44229 >>> [grsacc20:04578] [[6160,0],0] plm:base:launch_apps for job [6160,2] >>> [grsacc20:04578] [[6160,0],0] plm:base:receive processing msg >>> [grsacc20:04578] [[6160,0],0] plm:base:receive update proc state command >>> from [[6160,0],2] >>> [grsacc20:04578] [[6160,0],0] plm:base:receive got update_proc_state for >>> job [6160,2] >>> [grsacc20:04578] [[6160,0],0] plm:base:receive got update_proc_state for >>> vpid 0 state RUNNING exit_code 0 >>> [grsacc20:04578] [[6160,0],0] plm:base:receive done processing commands >>> [grsacc20:04578] [[6160,0],0] plm:base:launch wiring up iof for job [6160,2] >>> [grsacc20:04578] [[6160,0],0] plm:base:receive processing msg >>> [grsacc20:04578] [[6160,0],0] plm:base:receive done processing commands >>> [grsacc20:04578] [[6160,0],0] plm:base:launch registered event >>> [grsacc20:04578] [[6160,0],0] plm:base:launch sending dyn release of job >>> [6160,2] to [[6160,1],0] >>> [grsacc20:04578] [[6160,0],0] plm:base:orted_cmd sending orted_exit commands >>> [grsacc19:28815] [[6160,0],1] plm:base:receive stop comm >>> [grsacc20:04578] [[6160,0],0] plm:base:receive stop comm >>> -bash-4.1$ [grsacc18:16717] [[6160,0],2] plm:base:receive stop comm >>> >>> Best, >>> Suraj >>> On Sep 24, 2013, at 3:24 PM, Ralph Castain wrote: >>> >>>> Your output shows that it launched your apps, but they exited. The error >>>> is reported here, though it appears we aren't flushing the message out >>>> before exiting due to a race condition: >>>> >>>>> [grsacc20:04511] 1 more process has sent help message >>>>> help-mpi-btl-openib.txt / no active ports found >>>> >>>> Here is the full text: >>>> [no active ports found] >>>> WARNING: There is at least non-excluded one OpenFabrics device found, >>>> but there are no active ports detected (or Open MPI was unable to use >>>> them). This is most certainly not what you wanted. Check your >>>> cables, subnet manager configuration, etc. The openib BTL will be >>>> ignored for this job. >>>> >>>> Local host: %s >>>> >>>> Looks like at least one node being used doesn't have an active Infiniband >>>> port on it? >>>> >>>> >>>> On Sep 24, 2013, at 6:11 AM, Suraj Prabhakaran >>>> <suraj.prabhaka...@gmail.com> wrote: >>>> >>>>> Hi Ralph, >>>>> >>>>> I tested it with the trunk r29228. I still have the following problem. >>>>> Now, it even spawns the daemon on the new node through torque but then >>>>> suddently quits. The following is the output. Can you please have a look? >>>>> >>>>> Thanks >>>>> Suraj >>>>> >>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive processing msg >>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive job launch command from >>>>> [[6253,1],0] >>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive adding hosts >>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive calling spawn >>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive done processing commands >>>>> [grsacc20:04511] [[6253,0],0] plm:base:setup_job >>>>> [grsacc20:04511] [[6253,0],0] plm:base:setup_vm >>>>> [grsacc20:04511] [[6253,0],0] plm:base:setup_vm add new daemon >>>>> [[6253,0],2] >>>>> [grsacc20:04511] [[6253,0],0] plm:base:setup_vm assigning new daemon >>>>> [[6253,0],2] to node grsacc18 >>>>> [grsacc20:04511] [[6253,0],0] plm:tm: launching vm >>>>> [grsacc20:04511] [[6253,0],0] plm:tm: final top-level argv: >>>>> orted -mca ess tm -mca orte_ess_jobid 409796608 -mca orte_ess_vpid >>>>> <template> -mca orte_ess_num_procs 3 -mca orte_hnp_uri >>>>> "409796608.0;tcp://192.168.222.20:53097" -mca plm_base_verbose 6 >>>>> [grsacc20:04511] [[6253,0],0] plm:tm: launching on node grsacc19 >>>>> [grsacc20:04511] [[6253,0],0] plm:tm: executing: >>>>> orted -mca ess tm -mca orte_ess_jobid 409796608 -mca orte_ess_vpid 1 >>>>> -mca orte_ess_num_procs 3 -mca orte_hnp_uri >>>>> "409796608.0;tcp://192.168.222.20:53097" -mca plm_base_verbose 6 >>>>> [grsacc20:04511] [[6253,0],0] plm:tm: launching on node grsacc18 >>>>> [grsacc20:04511] [[6253,0],0] plm:tm: executing: >>>>> orted -mca ess tm -mca orte_ess_jobid 409796608 -mca orte_ess_vpid 2 >>>>> -mca orte_ess_num_procs 3 -mca orte_hnp_uri >>>>> "409796608.0;tcp://192.168.222.20:53097" -mca plm_base_verbose 6 >>>>> [grsacc20:04511] [[6253,0],0] plm:tm:launch: finished spawning orteds >>>>> [grsacc19:28754] mca:base:select:( plm) Querying component [rsh] >>>>> [grsacc19:28754] [[6253,0],1] plm:rsh_lookup on agent ssh : rsh path NULL >>>>> [grsacc19:28754] mca:base:select:( plm) Query of component [rsh] set >>>>> priority to 10 >>>>> [grsacc19:28754] mca:base:select:( plm) Selected component [rsh] >>>>> [grsacc19:28754] [[6253,0],1] plm:rsh_setup on agent ssh : rsh path NULL >>>>> [grsacc19:28754] [[6253,0],1] plm:base:receive start comm >>>>> [grsacc19:28754] [[6253,0],1] plm:base:receive stop comm >>>>> [grsacc18:16648] mca:base:select:( plm) Querying component [rsh] >>>>> [grsacc18:16648] [[6253,0],2] plm:rsh_lookup on agent ssh : rsh path NULL >>>>> [grsacc18:16648] mca:base:select:( plm) Query of component [rsh] set >>>>> priority to 10 >>>>> [grsacc18:16648] mca:base:select:( plm) Selected component [rsh] >>>>> [grsacc18:16648] [[6253,0],2] plm:rsh_setup on agent ssh : rsh path NULL >>>>> [grsacc18:16648] [[6253,0],2] plm:base:receive start comm >>>>> [grsacc20:04511] [[6253,0],0] plm:base:orted_report_launch from daemon >>>>> [[6253,0],2] >>>>> [grsacc20:04511] [[6253,0],0] plm:base:orted_report_launch from daemon >>>>> [[6253,0],2] on node grsacc18 >>>>> [grsacc20:04511] [[6253,0],0] plm:base:orted_report_launch completed for >>>>> daemon [[6253,0],2] at contact 409796608.2;tcp://192.168.222.18:47974 >>>>> [grsacc20:04511] [[6253,0],0] plm:base:launch_apps for job [6253,2] >>>>> [grsacc20:04511] 1 more process has sent help message >>>>> help-mpi-btl-openib.txt / no active ports found >>>>> [grsacc20:04511] Set MCA parameter "orte_base_help_aggregate" to 0 to see >>>>> all help / error messages >>>>> [grsacc20:04511] 1 more process has sent help message >>>>> help-mpi-btl-base.txt / btl:no-nics >>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive processing msg >>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive update proc state command >>>>> from [[6253,0],2] >>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive got update_proc_state for >>>>> job [6253,2] >>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive got update_proc_state for >>>>> vpid 0 state RUNNING exit_code 0 >>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive done processing commands >>>>> [grsacc20:04511] [[6253,0],0] plm:base:launch wiring up iof for job >>>>> [6253,2] >>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive processing msg >>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive done processing commands >>>>> [grsacc20:04511] [[6253,0],0] plm:base:launch registered event >>>>> [grsacc20:04511] [[6253,0],0] plm:base:launch sending dyn release of job >>>>> [6253,2] to [[6253,1],0] >>>>> [grsacc20:04511] [[6253,0],0] plm:base:orted_cmd sending orted_exit >>>>> commands >>>>> [grsacc19:28747] [[6253,0],1] plm:base:receive stop comm >>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive stop comm >>>>> -bash-4.1$ [grsacc18:16648] [[6253,0],2] plm:base:receive stop comm >>>>> >>>>> >>>>> >>>>> >>>>> On Sep 23, 2013, at 1:55 AM, Ralph Castain wrote: >>>>> >>>>>> Found a bug in the Torque support - we were trying to connect to the MOM >>>>>> again, which would hang (I imagine). I pushed a fix to the trunk >>>>>> (r29227) and scheduled it to come to 1.7.3 if you want to try it again. >>>>>> >>>>>> >>>>>> On Sep 22, 2013, at 4:21 PM, Suraj Prabhakaran >>>>>> <suraj.prabhaka...@gmail.com> wrote: >>>>>> >>>>>>> Dear Ralph, >>>>>>> >>>>>>> This is the output I get when I execute with the verbose option. >>>>>>> >>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:receive processing msg >>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:receive job launch command from >>>>>>> [[23526,1],0] >>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:receive adding hosts >>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:receive calling spawn >>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:receive done processing commands >>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_job >>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_vm >>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_vm add new daemon >>>>>>> [[23526,0],2] >>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_vm assigning new daemon >>>>>>> [[23526,0],2] to node grsacc17/1-4 >>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_vm add new daemon >>>>>>> [[23526,0],3] >>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_vm assigning new daemon >>>>>>> [[23526,0],3] to node grsacc17/0-5 >>>>>>> [grsacc20:21012] [[23526,0],0] plm:tm: launching vm >>>>>>> [grsacc20:21012] [[23526,0],0] plm:tm: final top-level argv: >>>>>>> orted -mca ess tm -mca orte_ess_jobid 1541799936 -mca >>>>>>> orte_ess_vpid <template> -mca orte_ess_num_procs 4 -mca orte_hnp_uri >>>>>>> "1541799936.0;tcp://192.168.222.20:49049" -mca plm_base_verbose 5 >>>>>>> [warn] opal_libevent2021_event_base_loop: reentrant invocation. Only >>>>>>> one event_base_loop can run on each event_base at once. >>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:orted_cmd sending orted_exit >>>>>>> commands >>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:receive stop comm >>>>>>> >>>>>>> Says something? >>>>>>> >>>>>>> Best, >>>>>>> Suraj >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Sep 22, 2013, at 9:45 PM, Ralph Castain wrote: >>>>>>> >>>>>>>> I'll still need to look at the intercomm_create issue, but I just >>>>>>>> tested both the trunk and current 1.7.3 branch for "add-host" and both >>>>>>>> worked just fine. This was on my little test cluster which only has >>>>>>>> rsh available - no Torque. >>>>>>>> >>>>>>>> You might add "-mca plm_base_verbose 5" to your cmd line to get some >>>>>>>> debug output as to the problem. >>>>>>>> >>>>>>>> >>>>>>>> On Sep 21, 2013, at 5:48 PM, Ralph Castain <r...@open-mpi.org> wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> On Sep 21, 2013, at 4:54 PM, Suraj Prabhakaran >>>>>>>>> <suraj.prabhaka...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Dear all, >>>>>>>>>> >>>>>>>>>> Really thanks a lot for your efforts. I too downloaded the trunk to >>>>>>>>>> check if it works for my case and as of revision 29215, it works for >>>>>>>>>> the original case I reported. Although it works, I still see the >>>>>>>>>> following in the output. Does it mean anything? >>>>>>>>>> [grsacc17][[13611,1],0][btl_openib_proc.c:157:mca_btl_openib_proc_create] >>>>>>>>>> [btl_openib_proc.c:157] ompi_modex_recv failed for peer >>>>>>>>>> [[13611,2],0] >>>>>>>>> >>>>>>>>> Yes - it means we don't quite have this right yet :-( >>>>>>>>> >>>>>>>>>> >>>>>>>>>> However, on another topic relevant to my use case, I have another >>>>>>>>>> problem to report. I am having problems using the "add-host" info to >>>>>>>>>> the MPI_Comm_spawn() when MPI is compiled with support for Torque >>>>>>>>>> resource manager. This problem is totally new in the 1.7 series and >>>>>>>>>> it worked perfectly until 1.6.5 >>>>>>>>>> >>>>>>>>>> Basically, I am working on implementing dynamic resource management >>>>>>>>>> facilities in the Torque/Maui batch system. Through a new tm call, >>>>>>>>>> an application can get new resources for a job. >>>>>>>>> >>>>>>>>> FWIW: you'll find that we added an API to the orte RAS framework to >>>>>>>>> support precisely that operation. It allows an application to request >>>>>>>>> that we dynamically obtain additional resources during execution >>>>>>>>> (e.g., as part of a Comm_spawn call via an info_key). We originally >>>>>>>>> implemented this with Slurm, but you could add the calls into the >>>>>>>>> Torque component as well if you like. >>>>>>>>> >>>>>>>>> This is in the trunk now - will come over to 1.7.4 >>>>>>>>> >>>>>>>>> >>>>>>>>>> I want to use MPI_Comm_spawn() to spawn new processes in the new >>>>>>>>>> hosts. With my extended torque/maui batch system, I was able to >>>>>>>>>> perfectly use the "add-host" info argument to MPI_Comm_spawn() to >>>>>>>>>> spawn new processes on these hosts. Since MPI and Torque refer to >>>>>>>>>> the hosts through the nodeids, I made sure that OpenMPI uses the >>>>>>>>>> correct nodeid's for these new hosts. >>>>>>>>>> Until 1.6.5, this worked perfectly fine, except that due to the >>>>>>>>>> Intercomm_merge problem, I could not really run a real application >>>>>>>>>> to its completion. >>>>>>>>>> >>>>>>>>>> While this is now fixed in the trunk, I found that, however, when >>>>>>>>>> using the "add-host" info argument, everything collapses after >>>>>>>>>> printing out the following error. >>>>>>>>>> >>>>>>>>>> [warn] opal_libevent2021_event_base_loop: reentrant invocation. >>>>>>>>>> Only one event_base_loop can run on each event_base at once. >>>>>>>>> >>>>>>>>> I'll take a look - probably some stale code that hasn't been updated >>>>>>>>> yet for async ORTE operations >>>>>>>>> >>>>>>>>>> >>>>>>>>>> And due to this, I am still not really able to run my application! I >>>>>>>>>> also compiled the MPI without any Torque/PBS support and just used >>>>>>>>>> the "add-host" argument normally. Again, this worked perfectly in >>>>>>>>>> 1.6.5. But in the 1.7 series, it works but after printing out the >>>>>>>>>> following error. >>>>>>>>>> >>>>>>>>>> [grsacc17][[13731,1],0][btl_openib_proc.c:157:mca_btl_openib_proc_create] >>>>>>>>>> [btl_openib_proc.c:157] ompi_modex_recv failed for peer >>>>>>>>>> [[13731,2],0] >>>>>>>>>> [grsacc17][[13731,1],1][btl_openib_proc.c:157:mca_btl_openib_proc_create] >>>>>>>>>> [btl_openib_proc.c:157] ompi_modex_recv failed for peer >>>>>>>>>> [[13731,2],0] >>>>>>>>> >>>>>>>>> Yeah, the 1.7 series doesn't have the reentrant test in it - so we >>>>>>>>> "illegally" re-enter libevent. The error again means we don't have >>>>>>>>> Intercomm_create correct just yet. >>>>>>>>> >>>>>>>>> I'll see what I can do about this and get back to you >>>>>>>>> >>>>>>>>>> >>>>>>>>>> In short, with pbs/torque support, it fails and without pbs/torque >>>>>>>>>> support, it runs after spitting the above lines. >>>>>>>>>> >>>>>>>>>> I would really appreciate some help on this, since I need these >>>>>>>>>> features to actually test my case and (at least in my short >>>>>>>>>> experience) no other MPI implementation seem friendly to such >>>>>>>>>> dynamic scenarios. >>>>>>>>>> >>>>>>>>>> Thanks a lot! >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> Suraj >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Sep 20, 2013, at 4:58 PM, Jeff Squyres (jsquyres) wrote: >>>>>>>>>> >>>>>>>>>>> Just to close my end of this loop: as of trunk r29213, it all works >>>>>>>>>>> for me. Thanks! >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Sep 18, 2013, at 12:52 PM, Ralph Castain <r...@open-mpi.org> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Thanks George - much appreciated >>>>>>>>>>>> >>>>>>>>>>>> On Sep 18, 2013, at 9:49 AM, George Bosilca <bosi...@icl.utk.edu> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> The test case was broken. I just pushed a fix. >>>>>>>>>>>>> >>>>>>>>>>>>> George. >>>>>>>>>>>>> >>>>>>>>>>>>> On Sep 18, 2013, at 16:49 , Ralph Castain <r...@open-mpi.org> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hangs with any np > 1 >>>>>>>>>>>>>> >>>>>>>>>>>>>> However, I'm not sure if that's an issue with the test vs the >>>>>>>>>>>>>> underlying implementation >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Sep 18, 2013, at 7:40 AM, "Jeff Squyres (jsquyres)" >>>>>>>>>>>>>> <jsquy...@cisco.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Does it hang when you run with -np 4? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Sent from my phone. No type good. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Sep 18, 2013, at 4:10 PM, "Ralph Castain" >>>>>>>>>>>>>>> <r...@open-mpi.org> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Strange - it works fine for me on my Mac. However, I see one >>>>>>>>>>>>>>>> difference - I only run it with np=1 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Sep 18, 2013, at 2:22 AM, Jeff Squyres (jsquyres) >>>>>>>>>>>>>>>> <jsquy...@cisco.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Sep 18, 2013, at 9:33 AM, George Bosilca >>>>>>>>>>>>>>>>> <bosi...@icl.utk.edu> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> 1. sm doesn't work between spawned processes. So you must >>>>>>>>>>>>>>>>>> have another network enabled. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I know :-). I have tcp available as well (OMPI will abort if >>>>>>>>>>>>>>>>> you only run with sm,self because the comm_spawn will fail >>>>>>>>>>>>>>>>> with unreachable errors -- I just tested/proved this to >>>>>>>>>>>>>>>>> myself). >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> 2. Don't use the test case attached to my email, I left an >>>>>>>>>>>>>>>>>> xterm based spawn and the debugging. It can't work without >>>>>>>>>>>>>>>>>> xterm support. Instead try using the test case from the >>>>>>>>>>>>>>>>>> trunk, the one committed by Ralph. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I didn't see any "xterm" strings in there, but ok. :-) I >>>>>>>>>>>>>>>>> ran with orte/test/mpi/intercomm_create.c, and that hangs for >>>>>>>>>>>>>>>>> me as well: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> ----- >>>>>>>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create >>>>>>>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create >>>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, >>>>>>>>>>>>>>>>> &inter) [rank 4] >>>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, >>>>>>>>>>>>>>>>> &inter) [rank 5] >>>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, >>>>>>>>>>>>>>>>> &inter) [rank 6] >>>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, >>>>>>>>>>>>>>>>> &inter) [rank 7] >>>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, >>>>>>>>>>>>>>>>> &inter) [rank 4] >>>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, >>>>>>>>>>>>>>>>> &inter) [rank 5] >>>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, >>>>>>>>>>>>>>>>> &inter) [rank 6] >>>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, >>>>>>>>>>>>>>>>> &inter) [rank 7] >>>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, >>>>>>>>>>>>>>>>> &inter) (0) >>>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, >>>>>>>>>>>>>>>>> &inter) (0) >>>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, >>>>>>>>>>>>>>>>> &inter) (0) >>>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, >>>>>>>>>>>>>>>>> &inter) (0) >>>>>>>>>>>>>>>>> [hang] >>>>>>>>>>>>>>>>> ----- >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Similarly, on my Mac, it hangs with no output: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> ----- >>>>>>>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create >>>>>>>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create >>>>>>>>>>>>>>>>> [hang] >>>>>>>>>>>>>>>>> ----- >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> George. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Sep 18, 2013, at 07:53 , "Jeff Squyres (jsquyres)" >>>>>>>>>>>>>>>>>> <jsquy...@cisco.com> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> George -- >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> When I build the SVN trunk (r29201) on 64 bit linux, your >>>>>>>>>>>>>>>>>>> attached test case hangs: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> ----- >>>>>>>>>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create >>>>>>>>>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create >>>>>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, >>>>>>>>>>>>>>>>>>> 201, &inter) [rank 4] >>>>>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, >>>>>>>>>>>>>>>>>>> 201, &inter) [rank 5] >>>>>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, >>>>>>>>>>>>>>>>>>> 201, &inter) [rank 6] >>>>>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, >>>>>>>>>>>>>>>>>>> 201, &inter) [rank 7] >>>>>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, >>>>>>>>>>>>>>>>>>> &inter) (0) >>>>>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, >>>>>>>>>>>>>>>>>>> &inter) (0) >>>>>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, >>>>>>>>>>>>>>>>>>> &inter) (0) >>>>>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, >>>>>>>>>>>>>>>>>>> &inter) (0) >>>>>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, >>>>>>>>>>>>>>>>>>> &inter) [rank 4] >>>>>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, >>>>>>>>>>>>>>>>>>> &inter) [rank 5] >>>>>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, >>>>>>>>>>>>>>>>>>> &inter) [rank 6] >>>>>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, >>>>>>>>>>>>>>>>>>> &inter) [rank 7] >>>>>>>>>>>>>>>>>>> [hang] >>>>>>>>>>>>>>>>>>> ----- >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On my Mac, it hangs without printing anything: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> ----- >>>>>>>>>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create >>>>>>>>>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create >>>>>>>>>>>>>>>>>>> [hang] >>>>>>>>>>>>>>>>>>> ----- >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Sep 18, 2013, at 1:48 AM, George Bosilca >>>>>>>>>>>>>>>>>>> <bosi...@icl.utk.edu> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Here is a quick (and definitively not the cleanest) patch >>>>>>>>>>>>>>>>>>>> that addresses the MPI_Intercomm issue at the MPI level. >>>>>>>>>>>>>>>>>>>> It should be applied after removal of 29166. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I also added the corrected test case stressing the corner >>>>>>>>>>>>>>>>>>>> cases by doing barriers at every inter-comm creation and >>>>>>>>>>>>>>>>>>>> doing a clean disconnect. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>> Jeff Squyres >>>>>>>>>>>>>>>>>>> jsquy...@cisco.com >>>>>>>>>>>>>>>>>>> For corporate legal information go to: >>>>>>>>>>>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>> Jeff Squyres >>>>>>>>>>>>>>>>> jsquy...@cisco.com >>>>>>>>>>>>>>>>> For corporate legal information go to: >>>>>>>>>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>>>> >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> devel mailing list >>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Jeff Squyres >>>>>>>>>>> jsquy...@cisco.com >>>>>>>>>>> For corporate legal information go to: >>>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> devel mailing list >>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> devel mailing list >>>>>>>>>> de...@open-mpi.org >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> de...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel