What I find puzzling is that I don't see any output indicating that you went 
thru the Torque launcher to launch the daemons - not a peep of debug output. 
This makes me suspicious that something else is going on. Are you sure you sent 
me all the output?

Try adding -novm to your mpirun cmd line and let's see if that mode works

On Sep 24, 2013, at 9:06 AM, Suraj Prabhakaran <suraj.prabhaka...@gmail.com> 
wrote:

> Hi Ralph,
> 
> So here is what I do. I spawn just a "single" process on a new node which is 
> basically not in the $PBS_NODEFILE list. 
> My $PBS_NODEFILE list contains
> grsacc20
> grsacc19
> 
> I then start the app with just 2 processes. So one host gets one process and 
> they are successfully spawned through the torque (through tm_spawn()). MPI 
> would have stored grsacc20 and grsacc19 to its list of hosts with launchid 0 
> and 1 correspondingly. 
> I then use the add-host info and spawn ONE new process on a new host 
> "grsacc18" through MPI_Comm_spawn. From what I saw in the code, the launchid 
> of this new host is -1 since openmpi does not know about this and it is not 
> available in the $PBS_NODEFILE. Since withouth the launchid, torque would not 
> know where to spawn, I just retrieve the correct launchid of this host from a 
> file just before tm_spawn() and use this launchid. This is the only 
> modification that I made to openmpi. 
> So, the host "grsacc18" will have a new launchid = 2 and will be used to 
> spawn the process through torque. This worked perfectly until 1.6.5. 
> 
> As we see here from the outputs, although I spawn only a single process on 
> grsacc18, I too have no clue why openmpi tries to spawn something on 
> grsacc19. Of course, without pbs/torque involved, everything works fine. 
> I have attached the simple test code. Please modify hostnames and executable 
> path before use. 
> 
> Best,
> Suraj
> 
> <addhosttest.c>
> 
> 
> On Sep 24, 2013, at 4:59 PM, Ralph Castain wrote:
> 
>> I'm going to need a little help here. The problem is that you launch two new 
>> daemons, and one of them exits immediately because it thinks it lost the 
>> connection back to mpirun - before it even gets a chance to create it.
>> 
>> Can you give me a little more info as to exactly what you are doing? Perhaps 
>> send me your test code?
>> 
>> On Sep 24, 2013, at 7:48 AM, Suraj Prabhakaran <suraj.prabhaka...@gmail.com> 
>> wrote:
>> 
>>> Hi Ralph,
>>> 
>>> Output attached in a file.
>>> Thanks a lot!
>>> 
>>> Best,
>>> Suraj
>>> 
>>> <output.rtf>
>>> 
>>> On Sep 24, 2013, at 4:11 PM, Ralph Castain wrote:
>>> 
>>>> Afraid I don't see the problem offhand - can you add the following to your 
>>>> cmd line?
>>>> 
>>>> -mca state_base_verbose 10 -mca errmgr_base_verbose 10
>>>> 
>>>> Thanks
>>>> Ralph
>>>> 
>>>> On Sep 24, 2013, at 6:35 AM, Suraj Prabhakaran 
>>>> <suraj.prabhaka...@gmail.com> wrote:
>>>> 
>>>>> Hi Ralph, 
>>>>> 
>>>>> I always got this output from any MPI job that ran on our nodes. There 
>>>>> seems to be a problem somewhere but it never stopped the applications 
>>>>> from running. But anyway, I ran it again now with only tcp and excluded 
>>>>> the infiniband and I get the same output again. Except that this time, 
>>>>> the error related to this openib is not there anymore. Printing out the 
>>>>> log again. 
>>>>> 
>>>>> [grsacc20:04578] [[6160,0],0] plm:base:receive processing msg
>>>>> [grsacc20:04578] [[6160,0],0] plm:base:receive job launch command from 
>>>>> [[6160,1],0]
>>>>> [grsacc20:04578] [[6160,0],0] plm:base:receive adding hosts
>>>>> [grsacc20:04578] [[6160,0],0] plm:base:receive calling spawn
>>>>> [grsacc20:04578] [[6160,0],0] plm:base:receive done processing commands
>>>>> [grsacc20:04578] [[6160,0],0] plm:base:setup_job
>>>>> [grsacc20:04578] [[6160,0],0] plm:base:setup_vm
>>>>> [grsacc20:04578] [[6160,0],0] plm:base:setup_vm add new daemon 
>>>>> [[6160,0],2]
>>>>> [grsacc20:04578] [[6160,0],0] plm:base:setup_vm assigning new daemon 
>>>>> [[6160,0],2] to node grsacc18
>>>>> [grsacc20:04578] [[6160,0],0] plm:tm: launching vm
>>>>> [grsacc20:04578] [[6160,0],0] plm:tm: final top-level argv:
>>>>>   orted -mca ess tm -mca orte_ess_jobid 403701760 -mca orte_ess_vpid 
>>>>> <template> -mca orte_ess_num_procs 3 -mca orte_hnp_uri 
>>>>> "403701760.0;tcp://192.168.222.20:35163" -mca plm_base_verbose 5 -mca btl 
>>>>> tcp,sm,self
>>>>> [grsacc20:04578] [[6160,0],0] plm:tm: launching on node grsacc19
>>>>> [grsacc20:04578] [[6160,0],0] plm:tm: executing:
>>>>>   orted -mca ess tm -mca orte_ess_jobid 403701760 -mca orte_ess_vpid 1 
>>>>> -mca orte_ess_num_procs 3 -mca orte_hnp_uri 
>>>>> "403701760.0;tcp://192.168.222.20:35163" -mca plm_base_verbose 5 -mca btl 
>>>>> tcp,sm,self
>>>>> [grsacc20:04578] [[6160,0],0] plm:tm: launching on node grsacc18
>>>>> [grsacc20:04578] [[6160,0],0] plm:tm: executing:
>>>>>   orted -mca ess tm -mca orte_ess_jobid 403701760 -mca orte_ess_vpid 2 
>>>>> -mca orte_ess_num_procs 3 -mca orte_hnp_uri 
>>>>> "403701760.0;tcp://192.168.222.20:35163" -mca plm_base_verbose 5 -mca btl 
>>>>> tcp,sm,self
>>>>> [grsacc20:04578] [[6160,0],0] plm:tm:launch: finished spawning orteds
>>>>> [grsacc19:28821] mca:base:select:(  plm) Querying component [rsh]
>>>>> [grsacc19:28821] [[6160,0],1] plm:rsh_lookup on agent ssh : rsh path NULL
>>>>> [grsacc19:28821] mca:base:select:(  plm) Query of component [rsh] set 
>>>>> priority to 10
>>>>> [grsacc19:28821] mca:base:select:(  plm) Selected component [rsh]
>>>>> [grsacc19:28821] [[6160,0],1] plm:rsh_setup on agent ssh : rsh path NULL
>>>>> [grsacc19:28821] [[6160,0],1] plm:base:receive start comm
>>>>> [grsacc19:28821] [[6160,0],1] plm:base:receive stop comm
>>>>> [grsacc18:16717] mca:base:select:(  plm) Querying component [rsh]
>>>>> [grsacc18:16717] [[6160,0],2] plm:rsh_lookup on agent ssh : rsh path NULL
>>>>> [grsacc18:16717] mca:base:select:(  plm) Query of component [rsh] set 
>>>>> priority to 10
>>>>> [grsacc18:16717] mca:base:select:(  plm) Selected component [rsh]
>>>>> [grsacc18:16717] [[6160,0],2] plm:rsh_setup on agent ssh : rsh path NULL
>>>>> [grsacc18:16717] [[6160,0],2] plm:base:receive start comm
>>>>> [grsacc20:04578] [[6160,0],0] plm:base:orted_report_launch from daemon 
>>>>> [[6160,0],2]
>>>>> [grsacc20:04578] [[6160,0],0] plm:base:orted_report_launch from daemon 
>>>>> [[6160,0],2] on node grsacc18
>>>>> [grsacc20:04578] [[6160,0],0] plm:base:orted_report_launch completed for 
>>>>> daemon [[6160,0],2] at contact 403701760.2;tcp://192.168.222.18:44229
>>>>> [grsacc20:04578] [[6160,0],0] plm:base:launch_apps for job [6160,2]
>>>>> [grsacc20:04578] [[6160,0],0] plm:base:receive processing msg
>>>>> [grsacc20:04578] [[6160,0],0] plm:base:receive update proc state command 
>>>>> from [[6160,0],2]
>>>>> [grsacc20:04578] [[6160,0],0] plm:base:receive got update_proc_state for 
>>>>> job [6160,2]
>>>>> [grsacc20:04578] [[6160,0],0] plm:base:receive got update_proc_state for 
>>>>> vpid 0 state RUNNING exit_code 0
>>>>> [grsacc20:04578] [[6160,0],0] plm:base:receive done processing commands
>>>>> [grsacc20:04578] [[6160,0],0] plm:base:launch wiring up iof for job 
>>>>> [6160,2]
>>>>> [grsacc20:04578] [[6160,0],0] plm:base:receive processing msg
>>>>> [grsacc20:04578] [[6160,0],0] plm:base:receive done processing commands
>>>>> [grsacc20:04578] [[6160,0],0] plm:base:launch registered event
>>>>> [grsacc20:04578] [[6160,0],0] plm:base:launch sending dyn release of job 
>>>>> [6160,2] to [[6160,1],0]
>>>>> [grsacc20:04578] [[6160,0],0] plm:base:orted_cmd sending orted_exit 
>>>>> commands
>>>>> [grsacc19:28815] [[6160,0],1] plm:base:receive stop comm
>>>>> [grsacc20:04578] [[6160,0],0] plm:base:receive stop comm
>>>>> -bash-4.1$ [grsacc18:16717] [[6160,0],2] plm:base:receive stop comm
>>>>> 
>>>>> Best,
>>>>> Suraj
>>>>> On Sep 24, 2013, at 3:24 PM, Ralph Castain wrote:
>>>>> 
>>>>>> Your output shows that it launched your apps, but they exited. The error 
>>>>>> is reported here, though it appears we aren't flushing the message out 
>>>>>> before exiting due to a race condition:
>>>>>> 
>>>>>>> [grsacc20:04511] 1 more process has sent help message 
>>>>>>> help-mpi-btl-openib.txt / no active ports found
>>>>>> 
>>>>>> Here is the full text:
>>>>>> [no active ports found]
>>>>>> WARNING: There is at least non-excluded one OpenFabrics device found,
>>>>>> but there are no active ports detected (or Open MPI was unable to use
>>>>>> them).  This is most certainly not what you wanted.  Check your
>>>>>> cables, subnet manager configuration, etc.  The openib BTL will be
>>>>>> ignored for this job.
>>>>>> 
>>>>>> Local host: %s
>>>>>> 
>>>>>> Looks like at least one node being used doesn't have an active 
>>>>>> Infiniband port on it?
>>>>>> 
>>>>>> 
>>>>>> On Sep 24, 2013, at 6:11 AM, Suraj Prabhakaran 
>>>>>> <suraj.prabhaka...@gmail.com> wrote:
>>>>>> 
>>>>>>> Hi Ralph,
>>>>>>> 
>>>>>>> I tested it with the trunk r29228. I still have the following problem. 
>>>>>>> Now, it even spawns the daemon on the new node through torque but then 
>>>>>>> suddently quits. The following is the output. Can you please have a 
>>>>>>> look? 
>>>>>>> 
>>>>>>> Thanks
>>>>>>> Suraj
>>>>>>> 
>>>>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive processing msg
>>>>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive job launch command from 
>>>>>>> [[6253,1],0]
>>>>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive adding hosts
>>>>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive calling spawn
>>>>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive done processing commands
>>>>>>> [grsacc20:04511] [[6253,0],0] plm:base:setup_job
>>>>>>> [grsacc20:04511] [[6253,0],0] plm:base:setup_vm
>>>>>>> [grsacc20:04511] [[6253,0],0] plm:base:setup_vm add new daemon 
>>>>>>> [[6253,0],2]
>>>>>>> [grsacc20:04511] [[6253,0],0] plm:base:setup_vm assigning new daemon 
>>>>>>> [[6253,0],2] to node grsacc18
>>>>>>> [grsacc20:04511] [[6253,0],0] plm:tm: launching vm
>>>>>>> [grsacc20:04511] [[6253,0],0] plm:tm: final top-level argv:
>>>>>>>         orted -mca ess tm -mca orte_ess_jobid 409796608 -mca 
>>>>>>> orte_ess_vpid <template> -mca orte_ess_num_procs 3 -mca orte_hnp_uri 
>>>>>>> "409796608.0;tcp://192.168.222.20:53097" -mca plm_base_verbose 6
>>>>>>> [grsacc20:04511] [[6253,0],0] plm:tm: launching on node grsacc19
>>>>>>> [grsacc20:04511] [[6253,0],0] plm:tm: executing:
>>>>>>>         orted -mca ess tm -mca orte_ess_jobid 409796608 -mca 
>>>>>>> orte_ess_vpid 1 -mca orte_ess_num_procs 3 -mca orte_hnp_uri 
>>>>>>> "409796608.0;tcp://192.168.222.20:53097" -mca plm_base_verbose 6
>>>>>>> [grsacc20:04511] [[6253,0],0] plm:tm: launching on node grsacc18
>>>>>>> [grsacc20:04511] [[6253,0],0] plm:tm: executing:
>>>>>>>         orted -mca ess tm -mca orte_ess_jobid 409796608 -mca 
>>>>>>> orte_ess_vpid 2 -mca orte_ess_num_procs 3 -mca orte_hnp_uri 
>>>>>>> "409796608.0;tcp://192.168.222.20:53097" -mca plm_base_verbose 6
>>>>>>> [grsacc20:04511] [[6253,0],0] plm:tm:launch: finished spawning orteds
>>>>>>> [grsacc19:28754] mca:base:select:(  plm) Querying component [rsh]
>>>>>>> [grsacc19:28754] [[6253,0],1] plm:rsh_lookup on agent ssh : rsh path 
>>>>>>> NULL
>>>>>>> [grsacc19:28754] mca:base:select:(  plm) Query of component [rsh] set 
>>>>>>> priority to 10
>>>>>>> [grsacc19:28754] mca:base:select:(  plm) Selected component [rsh]
>>>>>>> [grsacc19:28754] [[6253,0],1] plm:rsh_setup on agent ssh : rsh path NULL
>>>>>>> [grsacc19:28754] [[6253,0],1] plm:base:receive start comm
>>>>>>> [grsacc19:28754] [[6253,0],1] plm:base:receive stop comm
>>>>>>> [grsacc18:16648] mca:base:select:(  plm) Querying component [rsh]
>>>>>>> [grsacc18:16648] [[6253,0],2] plm:rsh_lookup on agent ssh : rsh path 
>>>>>>> NULL
>>>>>>> [grsacc18:16648] mca:base:select:(  plm) Query of component [rsh] set 
>>>>>>> priority to 10
>>>>>>> [grsacc18:16648] mca:base:select:(  plm) Selected component [rsh]
>>>>>>> [grsacc18:16648] [[6253,0],2] plm:rsh_setup on agent ssh : rsh path NULL
>>>>>>> [grsacc18:16648] [[6253,0],2] plm:base:receive start comm
>>>>>>> [grsacc20:04511] [[6253,0],0] plm:base:orted_report_launch from daemon 
>>>>>>> [[6253,0],2]
>>>>>>> [grsacc20:04511] [[6253,0],0] plm:base:orted_report_launch from daemon 
>>>>>>> [[6253,0],2] on node grsacc18
>>>>>>> [grsacc20:04511] [[6253,0],0] plm:base:orted_report_launch completed 
>>>>>>> for daemon [[6253,0],2] at contact 
>>>>>>> 409796608.2;tcp://192.168.222.18:47974
>>>>>>> [grsacc20:04511] [[6253,0],0] plm:base:launch_apps for job [6253,2]
>>>>>>> [grsacc20:04511] 1 more process has sent help message 
>>>>>>> help-mpi-btl-openib.txt / no active ports found
>>>>>>> [grsacc20:04511] Set MCA parameter "orte_base_help_aggregate" to 0 to 
>>>>>>> see all help / error messages
>>>>>>> [grsacc20:04511] 1 more process has sent help message 
>>>>>>> help-mpi-btl-base.txt / btl:no-nics
>>>>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive processing msg
>>>>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive update proc state 
>>>>>>> command from [[6253,0],2]
>>>>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive got update_proc_state 
>>>>>>> for job [6253,2]
>>>>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive got update_proc_state 
>>>>>>> for vpid 0 state RUNNING exit_code 0
>>>>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive done processing commands
>>>>>>> [grsacc20:04511] [[6253,0],0] plm:base:launch wiring up iof for job 
>>>>>>> [6253,2]
>>>>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive processing msg
>>>>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive done processing commands
>>>>>>> [grsacc20:04511] [[6253,0],0] plm:base:launch registered event
>>>>>>> [grsacc20:04511] [[6253,0],0] plm:base:launch sending dyn release of 
>>>>>>> job [6253,2] to [[6253,1],0]
>>>>>>> [grsacc20:04511] [[6253,0],0] plm:base:orted_cmd sending orted_exit 
>>>>>>> commands
>>>>>>> [grsacc19:28747] [[6253,0],1] plm:base:receive stop comm
>>>>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive stop comm
>>>>>>> -bash-4.1$ [grsacc18:16648] [[6253,0],2] plm:base:receive stop comm
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Sep 23, 2013, at 1:55 AM, Ralph Castain wrote:
>>>>>>> 
>>>>>>>> Found a bug in the Torque support - we were trying to connect to the 
>>>>>>>> MOM again, which would hang (I imagine). I pushed a fix to the trunk 
>>>>>>>> (r29227) and scheduled it to come to 1.7.3 if you want to try it again.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Sep 22, 2013, at 4:21 PM, Suraj Prabhakaran 
>>>>>>>> <suraj.prabhaka...@gmail.com> wrote:
>>>>>>>> 
>>>>>>>>> Dear Ralph,
>>>>>>>>> 
>>>>>>>>> This is the output I get when I execute with the verbose option.
>>>>>>>>> 
>>>>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:receive processing msg
>>>>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:receive job launch command 
>>>>>>>>> from [[23526,1],0]
>>>>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:receive adding hosts
>>>>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:receive calling spawn
>>>>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:receive done processing 
>>>>>>>>> commands
>>>>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_job
>>>>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_vm
>>>>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_vm add new daemon 
>>>>>>>>> [[23526,0],2]
>>>>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_vm assigning new daemon 
>>>>>>>>> [[23526,0],2] to node grsacc17/1-4
>>>>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_vm add new daemon 
>>>>>>>>> [[23526,0],3]
>>>>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_vm assigning new daemon 
>>>>>>>>> [[23526,0],3] to node grsacc17/0-5
>>>>>>>>> [grsacc20:21012] [[23526,0],0] plm:tm: launching vm
>>>>>>>>> [grsacc20:21012] [[23526,0],0] plm:tm: final top-level argv:
>>>>>>>>>       orted -mca ess tm -mca orte_ess_jobid 1541799936 -mca 
>>>>>>>>> orte_ess_vpid <template> -mca orte_ess_num_procs 4 -mca orte_hnp_uri 
>>>>>>>>> "1541799936.0;tcp://192.168.222.20:49049" -mca plm_base_verbose 5
>>>>>>>>> [warn] opal_libevent2021_event_base_loop: reentrant invocation.  Only 
>>>>>>>>> one event_base_loop can run on each event_base at once.
>>>>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:orted_cmd sending orted_exit 
>>>>>>>>> commands
>>>>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:receive stop comm
>>>>>>>>> 
>>>>>>>>> Says something?
>>>>>>>>> 
>>>>>>>>> Best,
>>>>>>>>> Suraj
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Sep 22, 2013, at 9:45 PM, Ralph Castain wrote:
>>>>>>>>> 
>>>>>>>>>> I'll still need to look at the intercomm_create issue, but I just 
>>>>>>>>>> tested both the trunk and current 1.7.3 branch for "add-host" and 
>>>>>>>>>> both worked just fine. This was on my little test cluster which only 
>>>>>>>>>> has rsh available - no Torque.
>>>>>>>>>> 
>>>>>>>>>> You might add "-mca plm_base_verbose 5" to your cmd line to get some 
>>>>>>>>>> debug output as to the problem.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Sep 21, 2013, at 5:48 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Sep 21, 2013, at 4:54 PM, Suraj Prabhakaran 
>>>>>>>>>>> <suraj.prabhaka...@gmail.com> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Dear all,
>>>>>>>>>>>> 
>>>>>>>>>>>> Really thanks a lot for your efforts. I too downloaded the trunk 
>>>>>>>>>>>> to check if it works for my case and as of revision 29215, it 
>>>>>>>>>>>> works for the original case I reported. Although it works, I still 
>>>>>>>>>>>> see the following in the output. Does it mean anything?
>>>>>>>>>>>> [grsacc17][[13611,1],0][btl_openib_proc.c:157:mca_btl_openib_proc_create]
>>>>>>>>>>>>  [btl_openib_proc.c:157] ompi_modex_recv failed for peer 
>>>>>>>>>>>> [[13611,2],0]
>>>>>>>>>>> 
>>>>>>>>>>> Yes - it means we don't quite have this right yet :-(
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> However, on another topic relevant to my use case, I have another 
>>>>>>>>>>>> problem to report. I am having problems using the "add-host" info 
>>>>>>>>>>>> to the MPI_Comm_spawn() when MPI is compiled with support for 
>>>>>>>>>>>> Torque resource manager. This problem is totally new in the 1.7 
>>>>>>>>>>>> series and it worked perfectly until 1.6.5 
>>>>>>>>>>>> 
>>>>>>>>>>>> Basically, I am working on implementing dynamic resource 
>>>>>>>>>>>> management facilities in the Torque/Maui batch system. Through a 
>>>>>>>>>>>> new tm call, an application can get new resources for a job.
>>>>>>>>>>> 
>>>>>>>>>>> FWIW: you'll find that we added an API to the orte RAS framework to 
>>>>>>>>>>> support precisely that operation. It allows an application to 
>>>>>>>>>>> request that we dynamically obtain additional resources during 
>>>>>>>>>>> execution (e.g., as part of a Comm_spawn call via an info_key). We 
>>>>>>>>>>> originally implemented this with Slurm, but you could add the calls 
>>>>>>>>>>> into the Torque component as well if you like.
>>>>>>>>>>> 
>>>>>>>>>>> This is in the trunk now - will come over to 1.7.4
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> I want to use MPI_Comm_spawn() to spawn new processes in the new 
>>>>>>>>>>>> hosts. With my extended torque/maui batch system, I was able to 
>>>>>>>>>>>> perfectly use the "add-host" info argument to MPI_Comm_spawn() to 
>>>>>>>>>>>> spawn new processes on these hosts. Since MPI and Torque refer to 
>>>>>>>>>>>> the hosts through the nodeids, I made sure that OpenMPI uses the 
>>>>>>>>>>>> correct nodeid's for these new hosts. 
>>>>>>>>>>>> Until 1.6.5, this worked perfectly fine, except that due to the 
>>>>>>>>>>>> Intercomm_merge problem, I could not really run a real application 
>>>>>>>>>>>> to its completion.
>>>>>>>>>>>> 
>>>>>>>>>>>> While this is now fixed in the trunk, I found that, however, when 
>>>>>>>>>>>> using the "add-host" info argument, everything collapses after 
>>>>>>>>>>>> printing out the following error. 
>>>>>>>>>>>> 
>>>>>>>>>>>> [warn] opal_libevent2021_event_base_loop: reentrant invocation.  
>>>>>>>>>>>> Only one event_base_loop can run on each event_base at once.
>>>>>>>>>>> 
>>>>>>>>>>> I'll take a look - probably some stale code that hasn't been 
>>>>>>>>>>> updated yet for async ORTE operations
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> And due to this, I am still not really able to run my application! 
>>>>>>>>>>>> I also compiled the MPI without any Torque/PBS support and just 
>>>>>>>>>>>> used the "add-host" argument normally. Again, this worked 
>>>>>>>>>>>> perfectly in 1.6.5. But in the 1.7 series, it works but after 
>>>>>>>>>>>> printing out the following error.
>>>>>>>>>>>> 
>>>>>>>>>>>> [grsacc17][[13731,1],0][btl_openib_proc.c:157:mca_btl_openib_proc_create]
>>>>>>>>>>>>  [btl_openib_proc.c:157] ompi_modex_recv failed for peer 
>>>>>>>>>>>> [[13731,2],0]
>>>>>>>>>>>> [grsacc17][[13731,1],1][btl_openib_proc.c:157:mca_btl_openib_proc_create]
>>>>>>>>>>>>  [btl_openib_proc.c:157] ompi_modex_recv failed for peer 
>>>>>>>>>>>> [[13731,2],0]
>>>>>>>>>>> 
>>>>>>>>>>> Yeah, the 1.7 series doesn't have the reentrant test in it - so we 
>>>>>>>>>>> "illegally" re-enter libevent. The error again means we don't have 
>>>>>>>>>>> Intercomm_create correct just yet.
>>>>>>>>>>> 
>>>>>>>>>>> I'll see what I can do about this and get back to you
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> In short, with pbs/torque support, it fails and without pbs/torque 
>>>>>>>>>>>> support, it runs after spitting the above lines. 
>>>>>>>>>>>> 
>>>>>>>>>>>> I would really appreciate some help on this, since I need these 
>>>>>>>>>>>> features to actually test my case and (at least in my short 
>>>>>>>>>>>> experience) no other MPI implementation seem friendly to such 
>>>>>>>>>>>> dynamic scenarios. 
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks a lot!
>>>>>>>>>>>> 
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Suraj
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Sep 20, 2013, at 4:58 PM, Jeff Squyres (jsquyres) wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Just to close my end of this loop: as of trunk r29213, it all 
>>>>>>>>>>>>> works for me.  Thanks!
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Sep 18, 2013, at 12:52 PM, Ralph Castain <r...@open-mpi.org> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks George - much appreciated
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Sep 18, 2013, at 9:49 AM, George Bosilca 
>>>>>>>>>>>>>> <bosi...@icl.utk.edu> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> The test case was broken. I just pushed a fix.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> George.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Sep 18, 2013, at 16:49 , Ralph Castain <r...@open-mpi.org> 
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hangs with any np > 1
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> However, I'm not sure if that's an issue with the test vs the 
>>>>>>>>>>>>>>>> underlying implementation
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Sep 18, 2013, at 7:40 AM, "Jeff Squyres (jsquyres)" 
>>>>>>>>>>>>>>>> <jsquy...@cisco.com> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Does it hang when you run with -np 4?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Sent from my phone. No type good. 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Sep 18, 2013, at 4:10 PM, "Ralph Castain" 
>>>>>>>>>>>>>>>>> <r...@open-mpi.org> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Strange - it works fine for me on my Mac. However, I see one 
>>>>>>>>>>>>>>>>>> difference - I only run it with np=1
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Sep 18, 2013, at 2:22 AM, Jeff Squyres (jsquyres) 
>>>>>>>>>>>>>>>>>> <jsquy...@cisco.com> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Sep 18, 2013, at 9:33 AM, George Bosilca 
>>>>>>>>>>>>>>>>>>> <bosi...@icl.utk.edu> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 1. sm doesn't work between spawned processes. So you must 
>>>>>>>>>>>>>>>>>>>> have another network enabled.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> I know :-).  I have tcp available as well (OMPI will abort 
>>>>>>>>>>>>>>>>>>> if you only run with sm,self because the comm_spawn will 
>>>>>>>>>>>>>>>>>>> fail with unreachable errors -- I just tested/proved this 
>>>>>>>>>>>>>>>>>>> to myself).
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 2. Don't use the test case attached to my email, I left an 
>>>>>>>>>>>>>>>>>>>> xterm based spawn and the debugging. It can't work without 
>>>>>>>>>>>>>>>>>>>> xterm support. Instead try using the test case from the 
>>>>>>>>>>>>>>>>>>>> trunk, the one committed by Ralph.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> I didn't see any "xterm" strings in there, but ok.  :-)  I 
>>>>>>>>>>>>>>>>>>> ran with orte/test/mpi/intercomm_create.c, and that hangs 
>>>>>>>>>>>>>>>>>>> for me as well:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> -----
>>>>>>>>>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create
>>>>>>>>>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create
>>>>>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 
>>>>>>>>>>>>>>>>>>> 201, &inter) [rank 4]
>>>>>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 
>>>>>>>>>>>>>>>>>>> 201, &inter) [rank 5]
>>>>>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 
>>>>>>>>>>>>>>>>>>> 201, &inter) [rank 6]
>>>>>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 
>>>>>>>>>>>>>>>>>>> 201, &inter) [rank 7]
>>>>>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, 
>>>>>>>>>>>>>>>>>>> &inter) [rank 4]
>>>>>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, 
>>>>>>>>>>>>>>>>>>> &inter) [rank 5]
>>>>>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, 
>>>>>>>>>>>>>>>>>>> &inter) [rank 6]
>>>>>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, 
>>>>>>>>>>>>>>>>>>> &inter) [rank 7]
>>>>>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, 
>>>>>>>>>>>>>>>>>>> &inter) (0)
>>>>>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, 
>>>>>>>>>>>>>>>>>>> &inter) (0)
>>>>>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, 
>>>>>>>>>>>>>>>>>>> &inter) (0)
>>>>>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, 
>>>>>>>>>>>>>>>>>>> &inter) (0)
>>>>>>>>>>>>>>>>>>> [hang]
>>>>>>>>>>>>>>>>>>> -----
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Similarly, on my Mac, it hangs with no output:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> -----
>>>>>>>>>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create
>>>>>>>>>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create
>>>>>>>>>>>>>>>>>>> [hang]
>>>>>>>>>>>>>>>>>>> -----
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> George.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On Sep 18, 2013, at 07:53 , "Jeff Squyres (jsquyres)" 
>>>>>>>>>>>>>>>>>>>> <jsquy...@cisco.com> wrote:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> George --
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> When I build the SVN trunk (r29201) on 64 bit linux, your 
>>>>>>>>>>>>>>>>>>>>> attached test case hangs:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> -----
>>>>>>>>>>>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create
>>>>>>>>>>>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create
>>>>>>>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 
>>>>>>>>>>>>>>>>>>>>> 201, &inter) [rank 4]
>>>>>>>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 
>>>>>>>>>>>>>>>>>>>>> 201, &inter) [rank 5]
>>>>>>>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 
>>>>>>>>>>>>>>>>>>>>> 201, &inter) [rank 6]
>>>>>>>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 
>>>>>>>>>>>>>>>>>>>>> 201, &inter) [rank 7]
>>>>>>>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, 
>>>>>>>>>>>>>>>>>>>>> &inter) (0)
>>>>>>>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, 
>>>>>>>>>>>>>>>>>>>>> &inter) (0)
>>>>>>>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, 
>>>>>>>>>>>>>>>>>>>>> &inter) (0)
>>>>>>>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, 
>>>>>>>>>>>>>>>>>>>>> &inter) (0)
>>>>>>>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 
>>>>>>>>>>>>>>>>>>>>> 201, &inter) [rank 4]
>>>>>>>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 
>>>>>>>>>>>>>>>>>>>>> 201, &inter) [rank 5]
>>>>>>>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 
>>>>>>>>>>>>>>>>>>>>> 201, &inter) [rank 6]
>>>>>>>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 
>>>>>>>>>>>>>>>>>>>>> 201, &inter) [rank 7]
>>>>>>>>>>>>>>>>>>>>> [hang]
>>>>>>>>>>>>>>>>>>>>> -----
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On my Mac, it hangs without printing anything:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> -----
>>>>>>>>>>>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create
>>>>>>>>>>>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create   
>>>>>>>>>>>>>>>>>>>>> [hang]
>>>>>>>>>>>>>>>>>>>>> -----
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On Sep 18, 2013, at 1:48 AM, George Bosilca 
>>>>>>>>>>>>>>>>>>>>> <bosi...@icl.utk.edu> wrote:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Here is a quick (and definitively not the cleanest) 
>>>>>>>>>>>>>>>>>>>>>> patch that addresses the MPI_Intercomm issue at the MPI 
>>>>>>>>>>>>>>>>>>>>>> level. It should be applied after removal of 29166.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> I also added the corrected test case stressing the 
>>>>>>>>>>>>>>>>>>>>>> corner cases by doing barriers at every inter-comm 
>>>>>>>>>>>>>>>>>>>>>> creation and doing a clean disconnect.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>>>>>> Jeff Squyres
>>>>>>>>>>>>>>>>>>>>> jsquy...@cisco.com
>>>>>>>>>>>>>>>>>>>>> For corporate legal information go to: 
>>>>>>>>>>>>>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>>>> Jeff Squyres
>>>>>>>>>>>>>>>>>>> jsquy...@cisco.com
>>>>>>>>>>>>>>>>>>> For corporate legal information go to: 
>>>>>>>>>>>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -- 
>>>>>>>>>>>>> Jeff Squyres
>>>>>>>>>>>>> jsquy...@cisco.com
>>>>>>>>>>>>> For corporate legal information go to: 
>>>>>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>>>>>>>>> 
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>> 
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> devel mailing list
>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> de...@open-mpi.org
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> de...@open-mpi.org
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> de...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> 
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to