Hi Ralph, So here is what I do. I spawn just a "single" process on a new node which is basically not in the $PBS_NODEFILE list. My $PBS_NODEFILE list contains grsacc20 grsacc19
I then start the app with just 2 processes. So one host gets one process and they are successfully spawned through the torque (through tm_spawn()). MPI would have stored grsacc20 and grsacc19 to its list of hosts with launchid 0 and 1 correspondingly. I then use the add-host info and spawn ONE new process on a new host "grsacc18" through MPI_Comm_spawn. From what I saw in the code, the launchid of this new host is -1 since openmpi does not know about this and it is not available in the $PBS_NODEFILE. Since withouth the launchid, torque would not know where to spawn, I just retrieve the correct launchid of this host from a file just before tm_spawn() and use this launchid. This is the only modification that I made to openmpi. So, the host "grsacc18" will have a new launchid = 2 and will be used to spawn the process through torque. This worked perfectly until 1.6.5. As we see here from the outputs, although I spawn only a single process on grsacc18, I too have no clue why openmpi tries to spawn something on grsacc19. Of course, without pbs/torque involved, everything works fine. I have attached the simple test code. Please modify hostnames and executable path before use. Best, Suraj
addhosttest.c
Description: Binary data
On Sep 24, 2013, at 4:59 PM, Ralph Castain wrote: > I'm going to need a little help here. The problem is that you launch two new > daemons, and one of them exits immediately because it thinks it lost the > connection back to mpirun - before it even gets a chance to create it. > > Can you give me a little more info as to exactly what you are doing? Perhaps > send me your test code? > > On Sep 24, 2013, at 7:48 AM, Suraj Prabhakaran <suraj.prabhaka...@gmail.com> > wrote: > >> Hi Ralph, >> >> Output attached in a file. >> Thanks a lot! >> >> Best, >> Suraj >> >> <output.rtf> >> >> On Sep 24, 2013, at 4:11 PM, Ralph Castain wrote: >> >>> Afraid I don't see the problem offhand - can you add the following to your >>> cmd line? >>> >>> -mca state_base_verbose 10 -mca errmgr_base_verbose 10 >>> >>> Thanks >>> Ralph >>> >>> On Sep 24, 2013, at 6:35 AM, Suraj Prabhakaran >>> <suraj.prabhaka...@gmail.com> wrote: >>> >>>> Hi Ralph, >>>> >>>> I always got this output from any MPI job that ran on our nodes. There >>>> seems to be a problem somewhere but it never stopped the applications from >>>> running. But anyway, I ran it again now with only tcp and excluded the >>>> infiniband and I get the same output again. Except that this time, the >>>> error related to this openib is not there anymore. Printing out the log >>>> again. >>>> >>>> [grsacc20:04578] [[6160,0],0] plm:base:receive processing msg >>>> [grsacc20:04578] [[6160,0],0] plm:base:receive job launch command from >>>> [[6160,1],0] >>>> [grsacc20:04578] [[6160,0],0] plm:base:receive adding hosts >>>> [grsacc20:04578] [[6160,0],0] plm:base:receive calling spawn >>>> [grsacc20:04578] [[6160,0],0] plm:base:receive done processing commands >>>> [grsacc20:04578] [[6160,0],0] plm:base:setup_job >>>> [grsacc20:04578] [[6160,0],0] plm:base:setup_vm >>>> [grsacc20:04578] [[6160,0],0] plm:base:setup_vm add new daemon [[6160,0],2] >>>> [grsacc20:04578] [[6160,0],0] plm:base:setup_vm assigning new daemon >>>> [[6160,0],2] to node grsacc18 >>>> [grsacc20:04578] [[6160,0],0] plm:tm: launching vm >>>> [grsacc20:04578] [[6160,0],0] plm:tm: final top-level argv: >>>> orted -mca ess tm -mca orte_ess_jobid 403701760 -mca orte_ess_vpid >>>> <template> -mca orte_ess_num_procs 3 -mca orte_hnp_uri >>>> "403701760.0;tcp://192.168.222.20:35163" -mca plm_base_verbose 5 -mca btl >>>> tcp,sm,self >>>> [grsacc20:04578] [[6160,0],0] plm:tm: launching on node grsacc19 >>>> [grsacc20:04578] [[6160,0],0] plm:tm: executing: >>>> orted -mca ess tm -mca orte_ess_jobid 403701760 -mca orte_ess_vpid 1 >>>> -mca orte_ess_num_procs 3 -mca orte_hnp_uri >>>> "403701760.0;tcp://192.168.222.20:35163" -mca plm_base_verbose 5 -mca btl >>>> tcp,sm,self >>>> [grsacc20:04578] [[6160,0],0] plm:tm: launching on node grsacc18 >>>> [grsacc20:04578] [[6160,0],0] plm:tm: executing: >>>> orted -mca ess tm -mca orte_ess_jobid 403701760 -mca orte_ess_vpid 2 >>>> -mca orte_ess_num_procs 3 -mca orte_hnp_uri >>>> "403701760.0;tcp://192.168.222.20:35163" -mca plm_base_verbose 5 -mca btl >>>> tcp,sm,self >>>> [grsacc20:04578] [[6160,0],0] plm:tm:launch: finished spawning orteds >>>> [grsacc19:28821] mca:base:select:( plm) Querying component [rsh] >>>> [grsacc19:28821] [[6160,0],1] plm:rsh_lookup on agent ssh : rsh path NULL >>>> [grsacc19:28821] mca:base:select:( plm) Query of component [rsh] set >>>> priority to 10 >>>> [grsacc19:28821] mca:base:select:( plm) Selected component [rsh] >>>> [grsacc19:28821] [[6160,0],1] plm:rsh_setup on agent ssh : rsh path NULL >>>> [grsacc19:28821] [[6160,0],1] plm:base:receive start comm >>>> [grsacc19:28821] [[6160,0],1] plm:base:receive stop comm >>>> [grsacc18:16717] mca:base:select:( plm) Querying component [rsh] >>>> [grsacc18:16717] [[6160,0],2] plm:rsh_lookup on agent ssh : rsh path NULL >>>> [grsacc18:16717] mca:base:select:( plm) Query of component [rsh] set >>>> priority to 10 >>>> [grsacc18:16717] mca:base:select:( plm) Selected component [rsh] >>>> [grsacc18:16717] [[6160,0],2] plm:rsh_setup on agent ssh : rsh path NULL >>>> [grsacc18:16717] [[6160,0],2] plm:base:receive start comm >>>> [grsacc20:04578] [[6160,0],0] plm:base:orted_report_launch from daemon >>>> [[6160,0],2] >>>> [grsacc20:04578] [[6160,0],0] plm:base:orted_report_launch from daemon >>>> [[6160,0],2] on node grsacc18 >>>> [grsacc20:04578] [[6160,0],0] plm:base:orted_report_launch completed for >>>> daemon [[6160,0],2] at contact 403701760.2;tcp://192.168.222.18:44229 >>>> [grsacc20:04578] [[6160,0],0] plm:base:launch_apps for job [6160,2] >>>> [grsacc20:04578] [[6160,0],0] plm:base:receive processing msg >>>> [grsacc20:04578] [[6160,0],0] plm:base:receive update proc state command >>>> from [[6160,0],2] >>>> [grsacc20:04578] [[6160,0],0] plm:base:receive got update_proc_state for >>>> job [6160,2] >>>> [grsacc20:04578] [[6160,0],0] plm:base:receive got update_proc_state for >>>> vpid 0 state RUNNING exit_code 0 >>>> [grsacc20:04578] [[6160,0],0] plm:base:receive done processing commands >>>> [grsacc20:04578] [[6160,0],0] plm:base:launch wiring up iof for job >>>> [6160,2] >>>> [grsacc20:04578] [[6160,0],0] plm:base:receive processing msg >>>> [grsacc20:04578] [[6160,0],0] plm:base:receive done processing commands >>>> [grsacc20:04578] [[6160,0],0] plm:base:launch registered event >>>> [grsacc20:04578] [[6160,0],0] plm:base:launch sending dyn release of job >>>> [6160,2] to [[6160,1],0] >>>> [grsacc20:04578] [[6160,0],0] plm:base:orted_cmd sending orted_exit >>>> commands >>>> [grsacc19:28815] [[6160,0],1] plm:base:receive stop comm >>>> [grsacc20:04578] [[6160,0],0] plm:base:receive stop comm >>>> -bash-4.1$ [grsacc18:16717] [[6160,0],2] plm:base:receive stop comm >>>> >>>> Best, >>>> Suraj >>>> On Sep 24, 2013, at 3:24 PM, Ralph Castain wrote: >>>> >>>>> Your output shows that it launched your apps, but they exited. The error >>>>> is reported here, though it appears we aren't flushing the message out >>>>> before exiting due to a race condition: >>>>> >>>>>> [grsacc20:04511] 1 more process has sent help message >>>>>> help-mpi-btl-openib.txt / no active ports found >>>>> >>>>> Here is the full text: >>>>> [no active ports found] >>>>> WARNING: There is at least non-excluded one OpenFabrics device found, >>>>> but there are no active ports detected (or Open MPI was unable to use >>>>> them). This is most certainly not what you wanted. Check your >>>>> cables, subnet manager configuration, etc. The openib BTL will be >>>>> ignored for this job. >>>>> >>>>> Local host: %s >>>>> >>>>> Looks like at least one node being used doesn't have an active Infiniband >>>>> port on it? >>>>> >>>>> >>>>> On Sep 24, 2013, at 6:11 AM, Suraj Prabhakaran >>>>> <suraj.prabhaka...@gmail.com> wrote: >>>>> >>>>>> Hi Ralph, >>>>>> >>>>>> I tested it with the trunk r29228. I still have the following problem. >>>>>> Now, it even spawns the daemon on the new node through torque but then >>>>>> suddently quits. The following is the output. Can you please have a >>>>>> look? >>>>>> >>>>>> Thanks >>>>>> Suraj >>>>>> >>>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive processing msg >>>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive job launch command from >>>>>> [[6253,1],0] >>>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive adding hosts >>>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive calling spawn >>>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive done processing commands >>>>>> [grsacc20:04511] [[6253,0],0] plm:base:setup_job >>>>>> [grsacc20:04511] [[6253,0],0] plm:base:setup_vm >>>>>> [grsacc20:04511] [[6253,0],0] plm:base:setup_vm add new daemon >>>>>> [[6253,0],2] >>>>>> [grsacc20:04511] [[6253,0],0] plm:base:setup_vm assigning new daemon >>>>>> [[6253,0],2] to node grsacc18 >>>>>> [grsacc20:04511] [[6253,0],0] plm:tm: launching vm >>>>>> [grsacc20:04511] [[6253,0],0] plm:tm: final top-level argv: >>>>>> orted -mca ess tm -mca orte_ess_jobid 409796608 -mca orte_ess_vpid >>>>>> <template> -mca orte_ess_num_procs 3 -mca orte_hnp_uri >>>>>> "409796608.0;tcp://192.168.222.20:53097" -mca plm_base_verbose 6 >>>>>> [grsacc20:04511] [[6253,0],0] plm:tm: launching on node grsacc19 >>>>>> [grsacc20:04511] [[6253,0],0] plm:tm: executing: >>>>>> orted -mca ess tm -mca orte_ess_jobid 409796608 -mca orte_ess_vpid 1 >>>>>> -mca orte_ess_num_procs 3 -mca orte_hnp_uri >>>>>> "409796608.0;tcp://192.168.222.20:53097" -mca plm_base_verbose 6 >>>>>> [grsacc20:04511] [[6253,0],0] plm:tm: launching on node grsacc18 >>>>>> [grsacc20:04511] [[6253,0],0] plm:tm: executing: >>>>>> orted -mca ess tm -mca orte_ess_jobid 409796608 -mca orte_ess_vpid 2 >>>>>> -mca orte_ess_num_procs 3 -mca orte_hnp_uri >>>>>> "409796608.0;tcp://192.168.222.20:53097" -mca plm_base_verbose 6 >>>>>> [grsacc20:04511] [[6253,0],0] plm:tm:launch: finished spawning orteds >>>>>> [grsacc19:28754] mca:base:select:( plm) Querying component [rsh] >>>>>> [grsacc19:28754] [[6253,0],1] plm:rsh_lookup on agent ssh : rsh path NULL >>>>>> [grsacc19:28754] mca:base:select:( plm) Query of component [rsh] set >>>>>> priority to 10 >>>>>> [grsacc19:28754] mca:base:select:( plm) Selected component [rsh] >>>>>> [grsacc19:28754] [[6253,0],1] plm:rsh_setup on agent ssh : rsh path NULL >>>>>> [grsacc19:28754] [[6253,0],1] plm:base:receive start comm >>>>>> [grsacc19:28754] [[6253,0],1] plm:base:receive stop comm >>>>>> [grsacc18:16648] mca:base:select:( plm) Querying component [rsh] >>>>>> [grsacc18:16648] [[6253,0],2] plm:rsh_lookup on agent ssh : rsh path NULL >>>>>> [grsacc18:16648] mca:base:select:( plm) Query of component [rsh] set >>>>>> priority to 10 >>>>>> [grsacc18:16648] mca:base:select:( plm) Selected component [rsh] >>>>>> [grsacc18:16648] [[6253,0],2] plm:rsh_setup on agent ssh : rsh path NULL >>>>>> [grsacc18:16648] [[6253,0],2] plm:base:receive start comm >>>>>> [grsacc20:04511] [[6253,0],0] plm:base:orted_report_launch from daemon >>>>>> [[6253,0],2] >>>>>> [grsacc20:04511] [[6253,0],0] plm:base:orted_report_launch from daemon >>>>>> [[6253,0],2] on node grsacc18 >>>>>> [grsacc20:04511] [[6253,0],0] plm:base:orted_report_launch completed for >>>>>> daemon [[6253,0],2] at contact 409796608.2;tcp://192.168.222.18:47974 >>>>>> [grsacc20:04511] [[6253,0],0] plm:base:launch_apps for job [6253,2] >>>>>> [grsacc20:04511] 1 more process has sent help message >>>>>> help-mpi-btl-openib.txt / no active ports found >>>>>> [grsacc20:04511] Set MCA parameter "orte_base_help_aggregate" to 0 to >>>>>> see all help / error messages >>>>>> [grsacc20:04511] 1 more process has sent help message >>>>>> help-mpi-btl-base.txt / btl:no-nics >>>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive processing msg >>>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive update proc state command >>>>>> from [[6253,0],2] >>>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive got update_proc_state for >>>>>> job [6253,2] >>>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive got update_proc_state for >>>>>> vpid 0 state RUNNING exit_code 0 >>>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive done processing commands >>>>>> [grsacc20:04511] [[6253,0],0] plm:base:launch wiring up iof for job >>>>>> [6253,2] >>>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive processing msg >>>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive done processing commands >>>>>> [grsacc20:04511] [[6253,0],0] plm:base:launch registered event >>>>>> [grsacc20:04511] [[6253,0],0] plm:base:launch sending dyn release of job >>>>>> [6253,2] to [[6253,1],0] >>>>>> [grsacc20:04511] [[6253,0],0] plm:base:orted_cmd sending orted_exit >>>>>> commands >>>>>> [grsacc19:28747] [[6253,0],1] plm:base:receive stop comm >>>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive stop comm >>>>>> -bash-4.1$ [grsacc18:16648] [[6253,0],2] plm:base:receive stop comm >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Sep 23, 2013, at 1:55 AM, Ralph Castain wrote: >>>>>> >>>>>>> Found a bug in the Torque support - we were trying to connect to the >>>>>>> MOM again, which would hang (I imagine). I pushed a fix to the trunk >>>>>>> (r29227) and scheduled it to come to 1.7.3 if you want to try it again. >>>>>>> >>>>>>> >>>>>>> On Sep 22, 2013, at 4:21 PM, Suraj Prabhakaran >>>>>>> <suraj.prabhaka...@gmail.com> wrote: >>>>>>> >>>>>>>> Dear Ralph, >>>>>>>> >>>>>>>> This is the output I get when I execute with the verbose option. >>>>>>>> >>>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:receive processing msg >>>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:receive job launch command >>>>>>>> from [[23526,1],0] >>>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:receive adding hosts >>>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:receive calling spawn >>>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:receive done processing >>>>>>>> commands >>>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_job >>>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_vm >>>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_vm add new daemon >>>>>>>> [[23526,0],2] >>>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_vm assigning new daemon >>>>>>>> [[23526,0],2] to node grsacc17/1-4 >>>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_vm add new daemon >>>>>>>> [[23526,0],3] >>>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_vm assigning new daemon >>>>>>>> [[23526,0],3] to node grsacc17/0-5 >>>>>>>> [grsacc20:21012] [[23526,0],0] plm:tm: launching vm >>>>>>>> [grsacc20:21012] [[23526,0],0] plm:tm: final top-level argv: >>>>>>>> orted -mca ess tm -mca orte_ess_jobid 1541799936 -mca >>>>>>>> orte_ess_vpid <template> -mca orte_ess_num_procs 4 -mca orte_hnp_uri >>>>>>>> "1541799936.0;tcp://192.168.222.20:49049" -mca plm_base_verbose 5 >>>>>>>> [warn] opal_libevent2021_event_base_loop: reentrant invocation. Only >>>>>>>> one event_base_loop can run on each event_base at once. >>>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:orted_cmd sending orted_exit >>>>>>>> commands >>>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:receive stop comm >>>>>>>> >>>>>>>> Says something? >>>>>>>> >>>>>>>> Best, >>>>>>>> Suraj >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Sep 22, 2013, at 9:45 PM, Ralph Castain wrote: >>>>>>>> >>>>>>>>> I'll still need to look at the intercomm_create issue, but I just >>>>>>>>> tested both the trunk and current 1.7.3 branch for "add-host" and >>>>>>>>> both worked just fine. This was on my little test cluster which only >>>>>>>>> has rsh available - no Torque. >>>>>>>>> >>>>>>>>> You might add "-mca plm_base_verbose 5" to your cmd line to get some >>>>>>>>> debug output as to the problem. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Sep 21, 2013, at 5:48 PM, Ralph Castain <r...@open-mpi.org> wrote: >>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Sep 21, 2013, at 4:54 PM, Suraj Prabhakaran >>>>>>>>>> <suraj.prabhaka...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Dear all, >>>>>>>>>>> >>>>>>>>>>> Really thanks a lot for your efforts. I too downloaded the trunk to >>>>>>>>>>> check if it works for my case and as of revision 29215, it works >>>>>>>>>>> for the original case I reported. Although it works, I still see >>>>>>>>>>> the following in the output. Does it mean anything? >>>>>>>>>>> [grsacc17][[13611,1],0][btl_openib_proc.c:157:mca_btl_openib_proc_create] >>>>>>>>>>> [btl_openib_proc.c:157] ompi_modex_recv failed for peer >>>>>>>>>>> [[13611,2],0] >>>>>>>>>> >>>>>>>>>> Yes - it means we don't quite have this right yet :-( >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> However, on another topic relevant to my use case, I have another >>>>>>>>>>> problem to report. I am having problems using the "add-host" info >>>>>>>>>>> to the MPI_Comm_spawn() when MPI is compiled with support for >>>>>>>>>>> Torque resource manager. This problem is totally new in the 1.7 >>>>>>>>>>> series and it worked perfectly until 1.6.5 >>>>>>>>>>> >>>>>>>>>>> Basically, I am working on implementing dynamic resource management >>>>>>>>>>> facilities in the Torque/Maui batch system. Through a new tm call, >>>>>>>>>>> an application can get new resources for a job. >>>>>>>>>> >>>>>>>>>> FWIW: you'll find that we added an API to the orte RAS framework to >>>>>>>>>> support precisely that operation. It allows an application to >>>>>>>>>> request that we dynamically obtain additional resources during >>>>>>>>>> execution (e.g., as part of a Comm_spawn call via an info_key). We >>>>>>>>>> originally implemented this with Slurm, but you could add the calls >>>>>>>>>> into the Torque component as well if you like. >>>>>>>>>> >>>>>>>>>> This is in the trunk now - will come over to 1.7.4 >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> I want to use MPI_Comm_spawn() to spawn new processes in the new >>>>>>>>>>> hosts. With my extended torque/maui batch system, I was able to >>>>>>>>>>> perfectly use the "add-host" info argument to MPI_Comm_spawn() to >>>>>>>>>>> spawn new processes on these hosts. Since MPI and Torque refer to >>>>>>>>>>> the hosts through the nodeids, I made sure that OpenMPI uses the >>>>>>>>>>> correct nodeid's for these new hosts. >>>>>>>>>>> Until 1.6.5, this worked perfectly fine, except that due to the >>>>>>>>>>> Intercomm_merge problem, I could not really run a real application >>>>>>>>>>> to its completion. >>>>>>>>>>> >>>>>>>>>>> While this is now fixed in the trunk, I found that, however, when >>>>>>>>>>> using the "add-host" info argument, everything collapses after >>>>>>>>>>> printing out the following error. >>>>>>>>>>> >>>>>>>>>>> [warn] opal_libevent2021_event_base_loop: reentrant invocation. >>>>>>>>>>> Only one event_base_loop can run on each event_base at once. >>>>>>>>>> >>>>>>>>>> I'll take a look - probably some stale code that hasn't been updated >>>>>>>>>> yet for async ORTE operations >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> And due to this, I am still not really able to run my application! >>>>>>>>>>> I also compiled the MPI without any Torque/PBS support and just >>>>>>>>>>> used the "add-host" argument normally. Again, this worked perfectly >>>>>>>>>>> in 1.6.5. But in the 1.7 series, it works but after printing out >>>>>>>>>>> the following error. >>>>>>>>>>> >>>>>>>>>>> [grsacc17][[13731,1],0][btl_openib_proc.c:157:mca_btl_openib_proc_create] >>>>>>>>>>> [btl_openib_proc.c:157] ompi_modex_recv failed for peer >>>>>>>>>>> [[13731,2],0] >>>>>>>>>>> [grsacc17][[13731,1],1][btl_openib_proc.c:157:mca_btl_openib_proc_create] >>>>>>>>>>> [btl_openib_proc.c:157] ompi_modex_recv failed for peer >>>>>>>>>>> [[13731,2],0] >>>>>>>>>> >>>>>>>>>> Yeah, the 1.7 series doesn't have the reentrant test in it - so we >>>>>>>>>> "illegally" re-enter libevent. The error again means we don't have >>>>>>>>>> Intercomm_create correct just yet. >>>>>>>>>> >>>>>>>>>> I'll see what I can do about this and get back to you >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> In short, with pbs/torque support, it fails and without pbs/torque >>>>>>>>>>> support, it runs after spitting the above lines. >>>>>>>>>>> >>>>>>>>>>> I would really appreciate some help on this, since I need these >>>>>>>>>>> features to actually test my case and (at least in my short >>>>>>>>>>> experience) no other MPI implementation seem friendly to such >>>>>>>>>>> dynamic scenarios. >>>>>>>>>>> >>>>>>>>>>> Thanks a lot! >>>>>>>>>>> >>>>>>>>>>> Best, >>>>>>>>>>> Suraj >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Sep 20, 2013, at 4:58 PM, Jeff Squyres (jsquyres) wrote: >>>>>>>>>>> >>>>>>>>>>>> Just to close my end of this loop: as of trunk r29213, it all >>>>>>>>>>>> works for me. Thanks! >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Sep 18, 2013, at 12:52 PM, Ralph Castain <r...@open-mpi.org> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Thanks George - much appreciated >>>>>>>>>>>>> >>>>>>>>>>>>> On Sep 18, 2013, at 9:49 AM, George Bosilca <bosi...@icl.utk.edu> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> The test case was broken. I just pushed a fix. >>>>>>>>>>>>>> >>>>>>>>>>>>>> George. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Sep 18, 2013, at 16:49 , Ralph Castain <r...@open-mpi.org> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hangs with any np > 1 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> However, I'm not sure if that's an issue with the test vs the >>>>>>>>>>>>>>> underlying implementation >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Sep 18, 2013, at 7:40 AM, "Jeff Squyres (jsquyres)" >>>>>>>>>>>>>>> <jsquy...@cisco.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Does it hang when you run with -np 4? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Sent from my phone. No type good. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Sep 18, 2013, at 4:10 PM, "Ralph Castain" >>>>>>>>>>>>>>>> <r...@open-mpi.org> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Strange - it works fine for me on my Mac. However, I see one >>>>>>>>>>>>>>>>> difference - I only run it with np=1 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Sep 18, 2013, at 2:22 AM, Jeff Squyres (jsquyres) >>>>>>>>>>>>>>>>> <jsquy...@cisco.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Sep 18, 2013, at 9:33 AM, George Bosilca >>>>>>>>>>>>>>>>>> <bosi...@icl.utk.edu> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> 1. sm doesn't work between spawned processes. So you must >>>>>>>>>>>>>>>>>>> have another network enabled. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I know :-). I have tcp available as well (OMPI will abort >>>>>>>>>>>>>>>>>> if you only run with sm,self because the comm_spawn will >>>>>>>>>>>>>>>>>> fail with unreachable errors -- I just tested/proved this to >>>>>>>>>>>>>>>>>> myself). >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> 2. Don't use the test case attached to my email, I left an >>>>>>>>>>>>>>>>>>> xterm based spawn and the debugging. It can't work without >>>>>>>>>>>>>>>>>>> xterm support. Instead try using the test case from the >>>>>>>>>>>>>>>>>>> trunk, the one committed by Ralph. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I didn't see any "xterm" strings in there, but ok. :-) I >>>>>>>>>>>>>>>>>> ran with orte/test/mpi/intercomm_create.c, and that hangs >>>>>>>>>>>>>>>>>> for me as well: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> ----- >>>>>>>>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create >>>>>>>>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create >>>>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, >>>>>>>>>>>>>>>>>> 201, &inter) [rank 4] >>>>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, >>>>>>>>>>>>>>>>>> 201, &inter) [rank 5] >>>>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, >>>>>>>>>>>>>>>>>> 201, &inter) [rank 6] >>>>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, >>>>>>>>>>>>>>>>>> 201, &inter) [rank 7] >>>>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, >>>>>>>>>>>>>>>>>> &inter) [rank 4] >>>>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, >>>>>>>>>>>>>>>>>> &inter) [rank 5] >>>>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, >>>>>>>>>>>>>>>>>> &inter) [rank 6] >>>>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, >>>>>>>>>>>>>>>>>> &inter) [rank 7] >>>>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, >>>>>>>>>>>>>>>>>> &inter) (0) >>>>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, >>>>>>>>>>>>>>>>>> &inter) (0) >>>>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, >>>>>>>>>>>>>>>>>> &inter) (0) >>>>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, >>>>>>>>>>>>>>>>>> &inter) (0) >>>>>>>>>>>>>>>>>> [hang] >>>>>>>>>>>>>>>>>> ----- >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Similarly, on my Mac, it hangs with no output: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> ----- >>>>>>>>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create >>>>>>>>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create >>>>>>>>>>>>>>>>>> [hang] >>>>>>>>>>>>>>>>>> ----- >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> George. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Sep 18, 2013, at 07:53 , "Jeff Squyres (jsquyres)" >>>>>>>>>>>>>>>>>>> <jsquy...@cisco.com> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> George -- >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> When I build the SVN trunk (r29201) on 64 bit linux, your >>>>>>>>>>>>>>>>>>>> attached test case hangs: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> ----- >>>>>>>>>>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create >>>>>>>>>>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create >>>>>>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, >>>>>>>>>>>>>>>>>>>> 201, &inter) [rank 4] >>>>>>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, >>>>>>>>>>>>>>>>>>>> 201, &inter) [rank 5] >>>>>>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, >>>>>>>>>>>>>>>>>>>> 201, &inter) [rank 6] >>>>>>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, >>>>>>>>>>>>>>>>>>>> 201, &inter) [rank 7] >>>>>>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, >>>>>>>>>>>>>>>>>>>> &inter) (0) >>>>>>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, >>>>>>>>>>>>>>>>>>>> &inter) (0) >>>>>>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, >>>>>>>>>>>>>>>>>>>> &inter) (0) >>>>>>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, >>>>>>>>>>>>>>>>>>>> &inter) (0) >>>>>>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, >>>>>>>>>>>>>>>>>>>> &inter) [rank 4] >>>>>>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, >>>>>>>>>>>>>>>>>>>> &inter) [rank 5] >>>>>>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, >>>>>>>>>>>>>>>>>>>> &inter) [rank 6] >>>>>>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, >>>>>>>>>>>>>>>>>>>> &inter) [rank 7] >>>>>>>>>>>>>>>>>>>> [hang] >>>>>>>>>>>>>>>>>>>> ----- >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On my Mac, it hangs without printing anything: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> ----- >>>>>>>>>>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create >>>>>>>>>>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create >>>>>>>>>>>>>>>>>>>> [hang] >>>>>>>>>>>>>>>>>>>> ----- >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Sep 18, 2013, at 1:48 AM, George Bosilca >>>>>>>>>>>>>>>>>>>> <bosi...@icl.utk.edu> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Here is a quick (and definitively not the cleanest) patch >>>>>>>>>>>>>>>>>>>>> that addresses the MPI_Intercomm issue at the MPI level. >>>>>>>>>>>>>>>>>>>>> It should be applied after removal of 29166. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I also added the corrected test case stressing the corner >>>>>>>>>>>>>>>>>>>>> cases by doing barriers at every inter-comm creation and >>>>>>>>>>>>>>>>>>>>> doing a clean disconnect. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>> Jeff Squyres >>>>>>>>>>>>>>>>>>>> jsquy...@cisco.com >>>>>>>>>>>>>>>>>>>> For corporate legal information go to: >>>>>>>>>>>>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>> Jeff Squyres >>>>>>>>>>>>>>>>>> jsquy...@cisco.com >>>>>>>>>>>>>>>>>> For corporate legal information go to: >>>>>>>>>>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>>>> >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Jeff Squyres >>>>>>>>>>>> jsquy...@cisco.com >>>>>>>>>>>> For corporate legal information go to: >>>>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> devel mailing list >>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> devel mailing list >>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> devel mailing list >>>>>>>>> de...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> de...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel