Re: [OMPI devel] Intercomm Merge
Hi Ralph, I tested it with the trunk r29228. I still have the following problem. Now, it even spawns the daemon on the new node through torque but then suddently quits. The following is the output. Can you please have a look? Thanks Suraj [grsacc20:04511] [[6253,0],0] plm:base:receive processing msg [grsacc20:04511] [[6253,0],0] plm:base:receive job launch command from [[6253,1],0] [grsacc20:04511] [[6253,0],0] plm:base:receive adding hosts [grsacc20:04511] [[6253,0],0] plm:base:receive calling spawn [grsacc20:04511] [[6253,0],0] plm:base:receive done processing commands [grsacc20:04511] [[6253,0],0] plm:base:setup_job [grsacc20:04511] [[6253,0],0] plm:base:setup_vm [grsacc20:04511] [[6253,0],0] plm:base:setup_vm add new daemon [[6253,0],2] [grsacc20:04511] [[6253,0],0] plm:base:setup_vm assigning new daemon [[6253,0],2] to node grsacc18 [grsacc20:04511] [[6253,0],0] plm:tm: launching vm [grsacc20:04511] [[6253,0],0] plm:tm: final top-level argv: orted -mca ess tm -mca orte_ess_jobid 409796608 -mca orte_ess_vpid -mca orte_ess_num_procs 3 -mca orte_hnp_uri "409796608.0;tcp://192.168.222.20:53097" -mca plm_base_verbose 6 [grsacc20:04511] [[6253,0],0] plm:tm: launching on node grsacc19 [grsacc20:04511] [[6253,0],0] plm:tm: executing: orted -mca ess tm -mca orte_ess_jobid 409796608 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 -mca orte_hnp_uri "409796608.0;tcp://192.168.222.20:53097" -mca plm_base_verbose 6 [grsacc20:04511] [[6253,0],0] plm:tm: launching on node grsacc18 [grsacc20:04511] [[6253,0],0] plm:tm: executing: orted -mca ess tm -mca orte_ess_jobid 409796608 -mca orte_ess_vpid 2 -mca orte_ess_num_procs 3 -mca orte_hnp_uri "409796608.0;tcp://192.168.222.20:53097" -mca plm_base_verbose 6 [grsacc20:04511] [[6253,0],0] plm:tm:launch: finished spawning orteds [grsacc19:28754] mca:base:select:( plm) Querying component [rsh] [grsacc19:28754] [[6253,0],1] plm:rsh_lookup on agent ssh : rsh path NULL [grsacc19:28754] mca:base:select:( plm) Query of component [rsh] set priority to 10 [grsacc19:28754] mca:base:select:( plm) Selected component [rsh] [grsacc19:28754] [[6253,0],1] plm:rsh_setup on agent ssh : rsh path NULL [grsacc19:28754] [[6253,0],1] plm:base:receive start comm [grsacc19:28754] [[6253,0],1] plm:base:receive stop comm [grsacc18:16648] mca:base:select:( plm) Querying component [rsh] [grsacc18:16648] [[6253,0],2] plm:rsh_lookup on agent ssh : rsh path NULL [grsacc18:16648] mca:base:select:( plm) Query of component [rsh] set priority to 10 [grsacc18:16648] mca:base:select:( plm) Selected component [rsh] [grsacc18:16648] [[6253,0],2] plm:rsh_setup on agent ssh : rsh path NULL [grsacc18:16648] [[6253,0],2] plm:base:receive start comm [grsacc20:04511] [[6253,0],0] plm:base:orted_report_launch from daemon [[6253,0],2] [grsacc20:04511] [[6253,0],0] plm:base:orted_report_launch from daemon [[6253,0],2] on node grsacc18 [grsacc20:04511] [[6253,0],0] plm:base:orted_report_launch completed for daemon [[6253,0],2] at contact 409796608.2;tcp://192.168.222.18:47974 [grsacc20:04511] [[6253,0],0] plm:base:launch_apps for job [6253,2] [grsacc20:04511] 1 more process has sent help message help-mpi-btl-openib.txt / no active ports found [grsacc20:04511] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [grsacc20:04511] 1 more process has sent help message help-mpi-btl-base.txt / btl:no-nics [grsacc20:04511] [[6253,0],0] plm:base:receive processing msg [grsacc20:04511] [[6253,0],0] plm:base:receive update proc state command from [[6253,0],2] [grsacc20:04511] [[6253,0],0] plm:base:receive got update_proc_state for job [6253,2] [grsacc20:04511] [[6253,0],0] plm:base:receive got update_proc_state for vpid 0 state RUNNING exit_code 0 [grsacc20:04511] [[6253,0],0] plm:base:receive done processing commands [grsacc20:04511] [[6253,0],0] plm:base:launch wiring up iof for job [6253,2] [grsacc20:04511] [[6253,0],0] plm:base:receive processing msg [grsacc20:04511] [[6253,0],0] plm:base:receive done processing commands [grsacc20:04511] [[6253,0],0] plm:base:launch registered event [grsacc20:04511] [[6253,0],0] plm:base:launch sending dyn release of job [6253,2] to [[6253,1],0] [grsacc20:04511] [[6253,0],0] plm:base:orted_cmd sending orted_exit commands [grsacc19:28747] [[6253,0],1] plm:base:receive stop comm [grsacc20:04511] [[6253,0],0] plm:base:receive stop comm -bash-4.1$ [grsacc18:16648] [[6253,0],2] plm:base:receive stop comm On Sep 23, 2013, at 1:55 AM, Ralph Castain wrote: > Found a bug in the Torque support - we were trying to connect to the MOM > again, which would hang (I imagine). I pushed a fix to the trunk (r29227) and > scheduled it to come to 1.7.3 if you want to try it again. > > > On Sep 22, 2013, at 4:21 PM, Suraj Prabhakaran > wrote: > >> Dear Ralph, >> >> This is the output I get when I execute with the verbose option. >> >> [grsacc20:21012] [[23526,0],0] plm:base:receive processing msg >> [grsacc20:21012] [
Re: [OMPI devel] Intercomm Merge
Your output shows that it launched your apps, but they exited. The error is reported here, though it appears we aren't flushing the message out before exiting due to a race condition: > [grsacc20:04511] 1 more process has sent help message help-mpi-btl-openib.txt > / no active ports found Here is the full text: [no active ports found] WARNING: There is at least non-excluded one OpenFabrics device found, but there are no active ports detected (or Open MPI was unable to use them). This is most certainly not what you wanted. Check your cables, subnet manager configuration, etc. The openib BTL will be ignored for this job. Local host: %s Looks like at least one node being used doesn't have an active Infiniband port on it? On Sep 24, 2013, at 6:11 AM, Suraj Prabhakaran wrote: > Hi Ralph, > > I tested it with the trunk r29228. I still have the following problem. Now, > it even spawns the daemon on the new node through torque but then suddently > quits. The following is the output. Can you please have a look? > > Thanks > Suraj > > [grsacc20:04511] [[6253,0],0] plm:base:receive processing msg > [grsacc20:04511] [[6253,0],0] plm:base:receive job launch command from > [[6253,1],0] > [grsacc20:04511] [[6253,0],0] plm:base:receive adding hosts > [grsacc20:04511] [[6253,0],0] plm:base:receive calling spawn > [grsacc20:04511] [[6253,0],0] plm:base:receive done processing commands > [grsacc20:04511] [[6253,0],0] plm:base:setup_job > [grsacc20:04511] [[6253,0],0] plm:base:setup_vm > [grsacc20:04511] [[6253,0],0] plm:base:setup_vm add new daemon [[6253,0],2] > [grsacc20:04511] [[6253,0],0] plm:base:setup_vm assigning new daemon > [[6253,0],2] to node grsacc18 > [grsacc20:04511] [[6253,0],0] plm:tm: launching vm > [grsacc20:04511] [[6253,0],0] plm:tm: final top-level argv: > orted -mca ess tm -mca orte_ess_jobid 409796608 -mca orte_ess_vpid > -mca orte_ess_num_procs 3 -mca orte_hnp_uri > "409796608.0;tcp://192.168.222.20:53097" -mca plm_base_verbose 6 > [grsacc20:04511] [[6253,0],0] plm:tm: launching on node grsacc19 > [grsacc20:04511] [[6253,0],0] plm:tm: executing: > orted -mca ess tm -mca orte_ess_jobid 409796608 -mca orte_ess_vpid 1 > -mca orte_ess_num_procs 3 -mca orte_hnp_uri > "409796608.0;tcp://192.168.222.20:53097" -mca plm_base_verbose 6 > [grsacc20:04511] [[6253,0],0] plm:tm: launching on node grsacc18 > [grsacc20:04511] [[6253,0],0] plm:tm: executing: > orted -mca ess tm -mca orte_ess_jobid 409796608 -mca orte_ess_vpid 2 > -mca orte_ess_num_procs 3 -mca orte_hnp_uri > "409796608.0;tcp://192.168.222.20:53097" -mca plm_base_verbose 6 > [grsacc20:04511] [[6253,0],0] plm:tm:launch: finished spawning orteds > [grsacc19:28754] mca:base:select:( plm) Querying component [rsh] > [grsacc19:28754] [[6253,0],1] plm:rsh_lookup on agent ssh : rsh path NULL > [grsacc19:28754] mca:base:select:( plm) Query of component [rsh] set > priority to 10 > [grsacc19:28754] mca:base:select:( plm) Selected component [rsh] > [grsacc19:28754] [[6253,0],1] plm:rsh_setup on agent ssh : rsh path NULL > [grsacc19:28754] [[6253,0],1] plm:base:receive start comm > [grsacc19:28754] [[6253,0],1] plm:base:receive stop comm > [grsacc18:16648] mca:base:select:( plm) Querying component [rsh] > [grsacc18:16648] [[6253,0],2] plm:rsh_lookup on agent ssh : rsh path NULL > [grsacc18:16648] mca:base:select:( plm) Query of component [rsh] set > priority to 10 > [grsacc18:16648] mca:base:select:( plm) Selected component [rsh] > [grsacc18:16648] [[6253,0],2] plm:rsh_setup on agent ssh : rsh path NULL > [grsacc18:16648] [[6253,0],2] plm:base:receive start comm > [grsacc20:04511] [[6253,0],0] plm:base:orted_report_launch from daemon > [[6253,0],2] > [grsacc20:04511] [[6253,0],0] plm:base:orted_report_launch from daemon > [[6253,0],2] on node grsacc18 > [grsacc20:04511] [[6253,0],0] plm:base:orted_report_launch completed for > daemon [[6253,0],2] at contact 409796608.2;tcp://192.168.222.18:47974 > [grsacc20:04511] [[6253,0],0] plm:base:launch_apps for job [6253,2] > [grsacc20:04511] 1 more process has sent help message help-mpi-btl-openib.txt > / no active ports found > [grsacc20:04511] Set MCA parameter "orte_base_help_aggregate" to 0 to see all > help / error messages > [grsacc20:04511] 1 more process has sent help message help-mpi-btl-base.txt / > btl:no-nics > [grsacc20:04511] [[6253,0],0] plm:base:receive processing msg > [grsacc20:04511] [[6253,0],0] plm:base:receive update proc state command from > [[6253,0],2] > [grsacc20:04511] [[6253,0],0] plm:base:receive got update_proc_state for job > [6253,2] > [grsacc20:04511] [[6253,0],0] plm:base:receive got update_proc_state for vpid > 0 state RUNNING exit_code 0 > [grsacc20:04511] [[6253,0],0] plm:base:receive done processing commands > [grsacc20:04511] [[6253,0],0] plm:base:launch wiring up iof for job [6253,2] > [grsacc20:04511] [[6253,0],0] plm:base:receive processing msg > [grsacc20:04511] [[6253,0],0] plm:base:receive done processing comman
Re: [OMPI devel] Intercomm Merge
Hi Ralph, I always got this output from any MPI job that ran on our nodes. There seems to be a problem somewhere but it never stopped the applications from running. But anyway, I ran it again now with only tcp and excluded the infiniband and I get the same output again. Except that this time, the error related to this openib is not there anymore. Printing out the log again. [grsacc20:04578] [[6160,0],0] plm:base:receive processing msg [grsacc20:04578] [[6160,0],0] plm:base:receive job launch command from [[6160,1],0] [grsacc20:04578] [[6160,0],0] plm:base:receive adding hosts [grsacc20:04578] [[6160,0],0] plm:base:receive calling spawn [grsacc20:04578] [[6160,0],0] plm:base:receive done processing commands [grsacc20:04578] [[6160,0],0] plm:base:setup_job [grsacc20:04578] [[6160,0],0] plm:base:setup_vm [grsacc20:04578] [[6160,0],0] plm:base:setup_vm add new daemon [[6160,0],2] [grsacc20:04578] [[6160,0],0] plm:base:setup_vm assigning new daemon [[6160,0],2] to node grsacc18 [grsacc20:04578] [[6160,0],0] plm:tm: launching vm [grsacc20:04578] [[6160,0],0] plm:tm: final top-level argv: orted -mca ess tm -mca orte_ess_jobid 403701760 -mca orte_ess_vpid -mca orte_ess_num_procs 3 -mca orte_hnp_uri "403701760.0;tcp://192.168.222.20:35163" -mca plm_base_verbose 5 -mca btl tcp,sm,self [grsacc20:04578] [[6160,0],0] plm:tm: launching on node grsacc19 [grsacc20:04578] [[6160,0],0] plm:tm: executing: orted -mca ess tm -mca orte_ess_jobid 403701760 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 -mca orte_hnp_uri "403701760.0;tcp://192.168.222.20:35163" -mca plm_base_verbose 5 -mca btl tcp,sm,self [grsacc20:04578] [[6160,0],0] plm:tm: launching on node grsacc18 [grsacc20:04578] [[6160,0],0] plm:tm: executing: orted -mca ess tm -mca orte_ess_jobid 403701760 -mca orte_ess_vpid 2 -mca orte_ess_num_procs 3 -mca orte_hnp_uri "403701760.0;tcp://192.168.222.20:35163" -mca plm_base_verbose 5 -mca btl tcp,sm,self [grsacc20:04578] [[6160,0],0] plm:tm:launch: finished spawning orteds [grsacc19:28821] mca:base:select:( plm) Querying component [rsh] [grsacc19:28821] [[6160,0],1] plm:rsh_lookup on agent ssh : rsh path NULL [grsacc19:28821] mca:base:select:( plm) Query of component [rsh] set priority to 10 [grsacc19:28821] mca:base:select:( plm) Selected component [rsh] [grsacc19:28821] [[6160,0],1] plm:rsh_setup on agent ssh : rsh path NULL [grsacc19:28821] [[6160,0],1] plm:base:receive start comm [grsacc19:28821] [[6160,0],1] plm:base:receive stop comm [grsacc18:16717] mca:base:select:( plm) Querying component [rsh] [grsacc18:16717] [[6160,0],2] plm:rsh_lookup on agent ssh : rsh path NULL [grsacc18:16717] mca:base:select:( plm) Query of component [rsh] set priority to 10 [grsacc18:16717] mca:base:select:( plm) Selected component [rsh] [grsacc18:16717] [[6160,0],2] plm:rsh_setup on agent ssh : rsh path NULL [grsacc18:16717] [[6160,0],2] plm:base:receive start comm [grsacc20:04578] [[6160,0],0] plm:base:orted_report_launch from daemon [[6160,0],2] [grsacc20:04578] [[6160,0],0] plm:base:orted_report_launch from daemon [[6160,0],2] on node grsacc18 [grsacc20:04578] [[6160,0],0] plm:base:orted_report_launch completed for daemon [[6160,0],2] at contact 403701760.2;tcp://192.168.222.18:44229 [grsacc20:04578] [[6160,0],0] plm:base:launch_apps for job [6160,2] [grsacc20:04578] [[6160,0],0] plm:base:receive processing msg [grsacc20:04578] [[6160,0],0] plm:base:receive update proc state command from [[6160,0],2] [grsacc20:04578] [[6160,0],0] plm:base:receive got update_proc_state for job [6160,2] [grsacc20:04578] [[6160,0],0] plm:base:receive got update_proc_state for vpid 0 state RUNNING exit_code 0 [grsacc20:04578] [[6160,0],0] plm:base:receive done processing commands [grsacc20:04578] [[6160,0],0] plm:base:launch wiring up iof for job [6160,2] [grsacc20:04578] [[6160,0],0] plm:base:receive processing msg [grsacc20:04578] [[6160,0],0] plm:base:receive done processing commands [grsacc20:04578] [[6160,0],0] plm:base:launch registered event [grsacc20:04578] [[6160,0],0] plm:base:launch sending dyn release of job [6160,2] to [[6160,1],0] [grsacc20:04578] [[6160,0],0] plm:base:orted_cmd sending orted_exit commands [grsacc19:28815] [[6160,0],1] plm:base:receive stop comm [grsacc20:04578] [[6160,0],0] plm:base:receive stop comm -bash-4.1$ [grsacc18:16717] [[6160,0],2] plm:base:receive stop comm Best, Suraj On Sep 24, 2013, at 3:24 PM, Ralph Castain wrote: > Your output shows that it launched your apps, but they exited. The error is > reported here, though it appears we aren't flushing the message out before > exiting due to a race condition: > >> [grsacc20:04511] 1 more process has sent help message >> help-mpi-btl-openib.txt / no active ports found > > Here is the full text: > [no active ports found] > WARNING: There is at least non-excluded one OpenFabrics device found, > but there are no active ports detected (or Open MPI was unable to use > them). This is most certainly not what y
Re: [OMPI devel] Intercomm Merge
Afraid I don't see the problem offhand - can you add the following to your cmd line? -mca state_base_verbose 10 -mca errmgr_base_verbose 10 Thanks Ralph On Sep 24, 2013, at 6:35 AM, Suraj Prabhakaran wrote: > Hi Ralph, > > I always got this output from any MPI job that ran on our nodes. There seems > to be a problem somewhere but it never stopped the applications from running. > But anyway, I ran it again now with only tcp and excluded the infiniband and > I get the same output again. Except that this time, the error related to this > openib is not there anymore. Printing out the log again. > > [grsacc20:04578] [[6160,0],0] plm:base:receive processing msg > [grsacc20:04578] [[6160,0],0] plm:base:receive job launch command from > [[6160,1],0] > [grsacc20:04578] [[6160,0],0] plm:base:receive adding hosts > [grsacc20:04578] [[6160,0],0] plm:base:receive calling spawn > [grsacc20:04578] [[6160,0],0] plm:base:receive done processing commands > [grsacc20:04578] [[6160,0],0] plm:base:setup_job > [grsacc20:04578] [[6160,0],0] plm:base:setup_vm > [grsacc20:04578] [[6160,0],0] plm:base:setup_vm add new daemon [[6160,0],2] > [grsacc20:04578] [[6160,0],0] plm:base:setup_vm assigning new daemon > [[6160,0],2] to node grsacc18 > [grsacc20:04578] [[6160,0],0] plm:tm: launching vm > [grsacc20:04578] [[6160,0],0] plm:tm: final top-level argv: > orted -mca ess tm -mca orte_ess_jobid 403701760 -mca orte_ess_vpid > -mca orte_ess_num_procs 3 -mca orte_hnp_uri > "403701760.0;tcp://192.168.222.20:35163" -mca plm_base_verbose 5 -mca btl > tcp,sm,self > [grsacc20:04578] [[6160,0],0] plm:tm: launching on node grsacc19 > [grsacc20:04578] [[6160,0],0] plm:tm: executing: > orted -mca ess tm -mca orte_ess_jobid 403701760 -mca orte_ess_vpid 1 > -mca orte_ess_num_procs 3 -mca orte_hnp_uri > "403701760.0;tcp://192.168.222.20:35163" -mca plm_base_verbose 5 -mca btl > tcp,sm,self > [grsacc20:04578] [[6160,0],0] plm:tm: launching on node grsacc18 > [grsacc20:04578] [[6160,0],0] plm:tm: executing: > orted -mca ess tm -mca orte_ess_jobid 403701760 -mca orte_ess_vpid 2 > -mca orte_ess_num_procs 3 -mca orte_hnp_uri > "403701760.0;tcp://192.168.222.20:35163" -mca plm_base_verbose 5 -mca btl > tcp,sm,self > [grsacc20:04578] [[6160,0],0] plm:tm:launch: finished spawning orteds > [grsacc19:28821] mca:base:select:( plm) Querying component [rsh] > [grsacc19:28821] [[6160,0],1] plm:rsh_lookup on agent ssh : rsh path NULL > [grsacc19:28821] mca:base:select:( plm) Query of component [rsh] set > priority to 10 > [grsacc19:28821] mca:base:select:( plm) Selected component [rsh] > [grsacc19:28821] [[6160,0],1] plm:rsh_setup on agent ssh : rsh path NULL > [grsacc19:28821] [[6160,0],1] plm:base:receive start comm > [grsacc19:28821] [[6160,0],1] plm:base:receive stop comm > [grsacc18:16717] mca:base:select:( plm) Querying component [rsh] > [grsacc18:16717] [[6160,0],2] plm:rsh_lookup on agent ssh : rsh path NULL > [grsacc18:16717] mca:base:select:( plm) Query of component [rsh] set > priority to 10 > [grsacc18:16717] mca:base:select:( plm) Selected component [rsh] > [grsacc18:16717] [[6160,0],2] plm:rsh_setup on agent ssh : rsh path NULL > [grsacc18:16717] [[6160,0],2] plm:base:receive start comm > [grsacc20:04578] [[6160,0],0] plm:base:orted_report_launch from daemon > [[6160,0],2] > [grsacc20:04578] [[6160,0],0] plm:base:orted_report_launch from daemon > [[6160,0],2] on node grsacc18 > [grsacc20:04578] [[6160,0],0] plm:base:orted_report_launch completed for > daemon [[6160,0],2] at contact 403701760.2;tcp://192.168.222.18:44229 > [grsacc20:04578] [[6160,0],0] plm:base:launch_apps for job [6160,2] > [grsacc20:04578] [[6160,0],0] plm:base:receive processing msg > [grsacc20:04578] [[6160,0],0] plm:base:receive update proc state command from > [[6160,0],2] > [grsacc20:04578] [[6160,0],0] plm:base:receive got update_proc_state for job > [6160,2] > [grsacc20:04578] [[6160,0],0] plm:base:receive got update_proc_state for vpid > 0 state RUNNING exit_code 0 > [grsacc20:04578] [[6160,0],0] plm:base:receive done processing commands > [grsacc20:04578] [[6160,0],0] plm:base:launch wiring up iof for job [6160,2] > [grsacc20:04578] [[6160,0],0] plm:base:receive processing msg > [grsacc20:04578] [[6160,0],0] plm:base:receive done processing commands > [grsacc20:04578] [[6160,0],0] plm:base:launch registered event > [grsacc20:04578] [[6160,0],0] plm:base:launch sending dyn release of job > [6160,2] to [[6160,1],0] > [grsacc20:04578] [[6160,0],0] plm:base:orted_cmd sending orted_exit commands > [grsacc19:28815] [[6160,0],1] plm:base:receive stop comm > [grsacc20:04578] [[6160,0],0] plm:base:receive stop comm > -bash-4.1$ [grsacc18:16717] [[6160,0],2] plm:base:receive stop comm > > Best, > Suraj > On Sep 24, 2013, at 3:24 PM, Ralph Castain wrote: > >> Your output shows that it launched your apps, but they exited. The error is >> reported here, though it appears we aren't flushing the message out before >> exiting d
Re: [OMPI devel] Intercomm Merge
Hi Ralph, Output attached in a file. Thanks a lot! Best, Suraj {\rtf1\ansi\ansicpg1252\cocoartf1038\cocoasubrtf360 {\fonttbl\f0\fswiss\fcharset0 Helvetica;} {\colortbl;\red255\green255\blue255;} \paperw11900\paperh16840\margl1440\margr1440\vieww30340\viewh23120\viewkind0 \pard\tx566\tx1133\tx1700\tx2267\tx2834\tx3401\tx3968\tx4535\tx5102\tx5669\tx6236\tx6803\ql\qnatural\pardirnatural \f0\fs24 \cf0 [grsacc20:04946] [[6816,0],0] ACTIVATE JOB [INVALID] STATE PENDING INIT AT plm_tm_module.c:157\ [grsacc20:04946] [[6816,0],0] ACTIVATING JOB [INVALID] STATE PENDING INIT PRI 4\ [grsacc20:04946] [[6816,0],0] ACTIVATE JOB [6816,2] STATE INIT_COMPLETE AT base/plm_base_launch_support.c:315\ [grsacc20:04946] [[6816,0],0] ACTIVATING JOB [6816,2] STATE INIT_COMPLETE PRI 4\ [grsacc20:04946] [[6816,0],0] ACTIVATE JOB [6816,2] STATE PENDING ALLOCATION AT base/plm_base_launch_support.c:326\ [grsacc20:04946] [[6816,0],0] ACTIVATING JOB [6816,2] STATE PENDING ALLOCATION PRI 4\ [grsacc20:04946] [[6816,0],0] ACTIVATE JOB [6816,2] STATE ALLOCATION COMPLETE AT base/ras_base_allocate.c:421\ [grsacc20:04946] [[6816,0],0] ACTIVATING JOB [6816,2] STATE ALLOCATION COMPLETE PRI 4\ [grsacc20:04946] [[6816,0],0] ACTIVATE JOB [6816,2] STATE PENDING DAEMON LAUNCH AT base/plm_base_launch_support.c:182\ [grsacc20:04946] [[6816,0],0] ACTIVATING JOB [6816,2] STATE PENDING DAEMON LAUNCH PRI 4\ [grsacc19:29071] mca: base: components_register: registering state components\ [grsacc19:29071] mca: base: components_register: found loaded component app\ [grsacc19:29071] mca: base: components_register: component app has no register or open function\ [grsacc19:29071] mca: base: components_register: found loaded component hnp\ [grsacc19:29071] mca: base: components_register: component hnp has no register or open function\ [grsacc19:29071] mca: base: components_register: found loaded component novm\ [grsacc19:29071] mca: base: components_register: component novm register function successful\ [grsacc19:29071] mca: base: components_register: found loaded component orted\ [grsacc19:29071] mca: base: components_register: component orted has no register or open function\ [grsacc19:29071] mca: base: components_register: found loaded component staged_hnp\ [grsacc19:29071] mca: base: components_register: component staged_hnp has no register or open function\ [grsacc19:29071] mca: base: components_register: found loaded component staged_orted\ [grsacc19:29071] mca: base: components_register: component staged_orted has no register or open function\ [grsacc19:29071] mca: base: components_open: opening state components\ [grsacc19:29071] mca: base: components_open: found loaded component app\ [grsacc19:29071] mca: base: components_open: component app open function successful\ [grsacc19:29071] mca: base: components_open: found loaded component hnp\ [grsacc19:29071] mca: base: components_open: component hnp open function successful\ [grsacc19:29071] mca: base: components_open: found loaded component novm\ [grsacc19:29071] mca: base: components_open: component novm open function successful\ [grsacc19:29071] mca: base: components_open: found loaded component orted\ [grsacc19:29071] mca: base: components_open: component orted open function successful\ [grsacc19:29071] mca: base: components_open: found loaded component staged_hnp\ [grsacc19:29071] mca: base: components_open: component staged_hnp open function successful\ [grsacc19:29071] mca: base: components_open: found loaded component staged_orted\ [grsacc19:29071] mca: base: components_open: component staged_orted open function successful\ [grsacc19:29071] mca:base:select: Auto-selecting state components\ [grsacc19:29071] mca:base:select:(state) Querying component [app]\ [grsacc19:29071] mca:base:select:(state) Skipping component [app]. Query failed to return a module\ [grsacc19:29071] mca:base:select:(state) Querying component [hnp]\ [grsacc19:29071] mca:base:select:(state) Skipping component [hnp]. Query failed to return a module\ [grsacc19:29071] mca:base:select:(state) Querying component [novm]\ [grsacc19:29071] mca:base:select:(state) Skipping component [novm]. Query failed to return a module\ [grsacc19:29071] mca:base:select:(state) Querying component [orted]\ [grsacc19:29071] mca:base:select:(state) Query of component [orted] set priority to 100\ [grsacc19:29071] mca:base:select:(state) Querying component [staged_hnp]\ [grsacc19:29071] mca:base:select:(state) Skipping component [staged_hnp]. Query failed to return a module\ [grsacc19:29071] mca:base:select:(state) Querying component [staged_orted]\ [grsacc19:29071] mca:base:select:(state) Skipping component [staged_orted]. Query failed to return a module\ [grsacc19:29071] mca:base:select:(state) Selected component [orted]\ [grsacc19:29071] mca: base: close: component app closed\ [grsacc19:29071] mca: base: close: unloading component app\ [grsacc19:29071] mca: base: close: component hnp closed\ [grsacc19:29071] mca: base: close: unloading component hnp\ [grsac
Re: [OMPI devel] Intercomm Merge
I'm going to need a little help here. The problem is that you launch two new daemons, and one of them exits immediately because it thinks it lost the connection back to mpirun - before it even gets a chance to create it. Can you give me a little more info as to exactly what you are doing? Perhaps send me your test code? On Sep 24, 2013, at 7:48 AM, Suraj Prabhakaran wrote: > Hi Ralph, > > Output attached in a file. > Thanks a lot! > > Best, > Suraj > > > > On Sep 24, 2013, at 4:11 PM, Ralph Castain wrote: > >> Afraid I don't see the problem offhand - can you add the following to your >> cmd line? >> >> -mca state_base_verbose 10 -mca errmgr_base_verbose 10 >> >> Thanks >> Ralph >> >> On Sep 24, 2013, at 6:35 AM, Suraj Prabhakaran >> wrote: >> >>> Hi Ralph, >>> >>> I always got this output from any MPI job that ran on our nodes. There >>> seems to be a problem somewhere but it never stopped the applications from >>> running. But anyway, I ran it again now with only tcp and excluded the >>> infiniband and I get the same output again. Except that this time, the >>> error related to this openib is not there anymore. Printing out the log >>> again. >>> >>> [grsacc20:04578] [[6160,0],0] plm:base:receive processing msg >>> [grsacc20:04578] [[6160,0],0] plm:base:receive job launch command from >>> [[6160,1],0] >>> [grsacc20:04578] [[6160,0],0] plm:base:receive adding hosts >>> [grsacc20:04578] [[6160,0],0] plm:base:receive calling spawn >>> [grsacc20:04578] [[6160,0],0] plm:base:receive done processing commands >>> [grsacc20:04578] [[6160,0],0] plm:base:setup_job >>> [grsacc20:04578] [[6160,0],0] plm:base:setup_vm >>> [grsacc20:04578] [[6160,0],0] plm:base:setup_vm add new daemon [[6160,0],2] >>> [grsacc20:04578] [[6160,0],0] plm:base:setup_vm assigning new daemon >>> [[6160,0],2] to node grsacc18 >>> [grsacc20:04578] [[6160,0],0] plm:tm: launching vm >>> [grsacc20:04578] [[6160,0],0] plm:tm: final top-level argv: >>> orted -mca ess tm -mca orte_ess_jobid 403701760 -mca orte_ess_vpid >>> -mca orte_ess_num_procs 3 -mca orte_hnp_uri >>> "403701760.0;tcp://192.168.222.20:35163" -mca plm_base_verbose 5 -mca btl >>> tcp,sm,self >>> [grsacc20:04578] [[6160,0],0] plm:tm: launching on node grsacc19 >>> [grsacc20:04578] [[6160,0],0] plm:tm: executing: >>> orted -mca ess tm -mca orte_ess_jobid 403701760 -mca orte_ess_vpid 1 >>> -mca orte_ess_num_procs 3 -mca orte_hnp_uri >>> "403701760.0;tcp://192.168.222.20:35163" -mca plm_base_verbose 5 -mca btl >>> tcp,sm,self >>> [grsacc20:04578] [[6160,0],0] plm:tm: launching on node grsacc18 >>> [grsacc20:04578] [[6160,0],0] plm:tm: executing: >>> orted -mca ess tm -mca orte_ess_jobid 403701760 -mca orte_ess_vpid 2 >>> -mca orte_ess_num_procs 3 -mca orte_hnp_uri >>> "403701760.0;tcp://192.168.222.20:35163" -mca plm_base_verbose 5 -mca btl >>> tcp,sm,self >>> [grsacc20:04578] [[6160,0],0] plm:tm:launch: finished spawning orteds >>> [grsacc19:28821] mca:base:select:( plm) Querying component [rsh] >>> [grsacc19:28821] [[6160,0],1] plm:rsh_lookup on agent ssh : rsh path NULL >>> [grsacc19:28821] mca:base:select:( plm) Query of component [rsh] set >>> priority to 10 >>> [grsacc19:28821] mca:base:select:( plm) Selected component [rsh] >>> [grsacc19:28821] [[6160,0],1] plm:rsh_setup on agent ssh : rsh path NULL >>> [grsacc19:28821] [[6160,0],1] plm:base:receive start comm >>> [grsacc19:28821] [[6160,0],1] plm:base:receive stop comm >>> [grsacc18:16717] mca:base:select:( plm) Querying component [rsh] >>> [grsacc18:16717] [[6160,0],2] plm:rsh_lookup on agent ssh : rsh path NULL >>> [grsacc18:16717] mca:base:select:( plm) Query of component [rsh] set >>> priority to 10 >>> [grsacc18:16717] mca:base:select:( plm) Selected component [rsh] >>> [grsacc18:16717] [[6160,0],2] plm:rsh_setup on agent ssh : rsh path NULL >>> [grsacc18:16717] [[6160,0],2] plm:base:receive start comm >>> [grsacc20:04578] [[6160,0],0] plm:base:orted_report_launch from daemon >>> [[6160,0],2] >>> [grsacc20:04578] [[6160,0],0] plm:base:orted_report_launch from daemon >>> [[6160,0],2] on node grsacc18 >>> [grsacc20:04578] [[6160,0],0] plm:base:orted_report_launch completed for >>> daemon [[6160,0],2] at contact 403701760.2;tcp://192.168.222.18:44229 >>> [grsacc20:04578] [[6160,0],0] plm:base:launch_apps for job [6160,2] >>> [grsacc20:04578] [[6160,0],0] plm:base:receive processing msg >>> [grsacc20:04578] [[6160,0],0] plm:base:receive update proc state command >>> from [[6160,0],2] >>> [grsacc20:04578] [[6160,0],0] plm:base:receive got update_proc_state for >>> job [6160,2] >>> [grsacc20:04578] [[6160,0],0] plm:base:receive got update_proc_state for >>> vpid 0 state RUNNING exit_code 0 >>> [grsacc20:04578] [[6160,0],0] plm:base:receive done processing commands >>> [grsacc20:04578] [[6160,0],0] plm:base:launch wiring up iof for job [6160,2] >>> [grsacc20:04578] [[6160,0],0] plm:base:receive processing msg >>> [grsacc20:04578] [[6160,0],0] plm:base:receive done
Re: [OMPI devel] Intercomm Merge
Hi Ralph, So here is what I do. I spawn just a "single" process on a new node which is basically not in the $PBS_NODEFILE list. My $PBS_NODEFILE list contains grsacc20 grsacc19 I then start the app with just 2 processes. So one host gets one process and they are successfully spawned through the torque (through tm_spawn()). MPI would have stored grsacc20 and grsacc19 to its list of hosts with launchid 0 and 1 correspondingly. I then use the add-host info and spawn ONE new process on a new host "grsacc18" through MPI_Comm_spawn. From what I saw in the code, the launchid of this new host is -1 since openmpi does not know about this and it is not available in the $PBS_NODEFILE. Since withouth the launchid, torque would not know where to spawn, I just retrieve the correct launchid of this host from a file just before tm_spawn() and use this launchid. This is the only modification that I made to openmpi. So, the host "grsacc18" will have a new launchid = 2 and will be used to spawn the process through torque. This worked perfectly until 1.6.5. As we see here from the outputs, although I spawn only a single process on grsacc18, I too have no clue why openmpi tries to spawn something on grsacc19. Of course, without pbs/torque involved, everything works fine. I have attached the simple test code. Please modify hostnames and executable path before use. Best, Suraj addhosttest.c Description: Binary data On Sep 24, 2013, at 4:59 PM, Ralph Castain wrote: > I'm going to need a little help here. The problem is that you launch two new > daemons, and one of them exits immediately because it thinks it lost the > connection back to mpirun - before it even gets a chance to create it. > > Can you give me a little more info as to exactly what you are doing? Perhaps > send me your test code? > > On Sep 24, 2013, at 7:48 AM, Suraj Prabhakaran > wrote: > >> Hi Ralph, >> >> Output attached in a file. >> Thanks a lot! >> >> Best, >> Suraj >> >> >> >> On Sep 24, 2013, at 4:11 PM, Ralph Castain wrote: >> >>> Afraid I don't see the problem offhand - can you add the following to your >>> cmd line? >>> >>> -mca state_base_verbose 10 -mca errmgr_base_verbose 10 >>> >>> Thanks >>> Ralph >>> >>> On Sep 24, 2013, at 6:35 AM, Suraj Prabhakaran >>> wrote: >>> Hi Ralph, I always got this output from any MPI job that ran on our nodes. There seems to be a problem somewhere but it never stopped the applications from running. But anyway, I ran it again now with only tcp and excluded the infiniband and I get the same output again. Except that this time, the error related to this openib is not there anymore. Printing out the log again. [grsacc20:04578] [[6160,0],0] plm:base:receive processing msg [grsacc20:04578] [[6160,0],0] plm:base:receive job launch command from [[6160,1],0] [grsacc20:04578] [[6160,0],0] plm:base:receive adding hosts [grsacc20:04578] [[6160,0],0] plm:base:receive calling spawn [grsacc20:04578] [[6160,0],0] plm:base:receive done processing commands [grsacc20:04578] [[6160,0],0] plm:base:setup_job [grsacc20:04578] [[6160,0],0] plm:base:setup_vm [grsacc20:04578] [[6160,0],0] plm:base:setup_vm add new daemon [[6160,0],2] [grsacc20:04578] [[6160,0],0] plm:base:setup_vm assigning new daemon [[6160,0],2] to node grsacc18 [grsacc20:04578] [[6160,0],0] plm:tm: launching vm [grsacc20:04578] [[6160,0],0] plm:tm: final top-level argv: orted -mca ess tm -mca orte_ess_jobid 403701760 -mca orte_ess_vpid -mca orte_ess_num_procs 3 -mca orte_hnp_uri "403701760.0;tcp://192.168.222.20:35163" -mca plm_base_verbose 5 -mca btl tcp,sm,self [grsacc20:04578] [[6160,0],0] plm:tm: launching on node grsacc19 [grsacc20:04578] [[6160,0],0] plm:tm: executing: orted -mca ess tm -mca orte_ess_jobid 403701760 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 -mca orte_hnp_uri "403701760.0;tcp://192.168.222.20:35163" -mca plm_base_verbose 5 -mca btl tcp,sm,self [grsacc20:04578] [[6160,0],0] plm:tm: launching on node grsacc18 [grsacc20:04578] [[6160,0],0] plm:tm: executing: orted -mca ess tm -mca orte_ess_jobid 403701760 -mca orte_ess_vpid 2 -mca orte_ess_num_procs 3 -mca orte_hnp_uri "403701760.0;tcp://192.168.222.20:35163" -mca plm_base_verbose 5 -mca btl tcp,sm,self [grsacc20:04578] [[6160,0],0] plm:tm:launch: finished spawning orteds [grsacc19:28821] mca:base:select:( plm) Querying component [rsh] [grsacc19:28821] [[6160,0],1] plm:rsh_lookup on agent ssh : rsh path NULL [grsacc19:28821] mca:base:select:( plm) Query of component [rsh] set priority to 10 [grsacc19:28821] mca:base:select:( plm) Selected component [rsh] [grsacc19:28821] [[6160,0],1] plm:rsh_setup on agent ssh : rsh path NULL [grsacc19:28821] [[6160,0],1] plm:base:receive start comm [grsa
Re: [OMPI devel] Intercomm Merge
What I find puzzling is that I don't see any output indicating that you went thru the Torque launcher to launch the daemons - not a peep of debug output. This makes me suspicious that something else is going on. Are you sure you sent me all the output? Try adding -novm to your mpirun cmd line and let's see if that mode works On Sep 24, 2013, at 9:06 AM, Suraj Prabhakaran wrote: > Hi Ralph, > > So here is what I do. I spawn just a "single" process on a new node which is > basically not in the $PBS_NODEFILE list. > My $PBS_NODEFILE list contains > grsacc20 > grsacc19 > > I then start the app with just 2 processes. So one host gets one process and > they are successfully spawned through the torque (through tm_spawn()). MPI > would have stored grsacc20 and grsacc19 to its list of hosts with launchid 0 > and 1 correspondingly. > I then use the add-host info and spawn ONE new process on a new host > "grsacc18" through MPI_Comm_spawn. From what I saw in the code, the launchid > of this new host is -1 since openmpi does not know about this and it is not > available in the $PBS_NODEFILE. Since withouth the launchid, torque would not > know where to spawn, I just retrieve the correct launchid of this host from a > file just before tm_spawn() and use this launchid. This is the only > modification that I made to openmpi. > So, the host "grsacc18" will have a new launchid = 2 and will be used to > spawn the process through torque. This worked perfectly until 1.6.5. > > As we see here from the outputs, although I spawn only a single process on > grsacc18, I too have no clue why openmpi tries to spawn something on > grsacc19. Of course, without pbs/torque involved, everything works fine. > I have attached the simple test code. Please modify hostnames and executable > path before use. > > Best, > Suraj > > > > > On Sep 24, 2013, at 4:59 PM, Ralph Castain wrote: > >> I'm going to need a little help here. The problem is that you launch two new >> daemons, and one of them exits immediately because it thinks it lost the >> connection back to mpirun - before it even gets a chance to create it. >> >> Can you give me a little more info as to exactly what you are doing? Perhaps >> send me your test code? >> >> On Sep 24, 2013, at 7:48 AM, Suraj Prabhakaran >> wrote: >> >>> Hi Ralph, >>> >>> Output attached in a file. >>> Thanks a lot! >>> >>> Best, >>> Suraj >>> >>> >>> >>> On Sep 24, 2013, at 4:11 PM, Ralph Castain wrote: >>> Afraid I don't see the problem offhand - can you add the following to your cmd line? -mca state_base_verbose 10 -mca errmgr_base_verbose 10 Thanks Ralph On Sep 24, 2013, at 6:35 AM, Suraj Prabhakaran wrote: > Hi Ralph, > > I always got this output from any MPI job that ran on our nodes. There > seems to be a problem somewhere but it never stopped the applications > from running. But anyway, I ran it again now with only tcp and excluded > the infiniband and I get the same output again. Except that this time, > the error related to this openib is not there anymore. Printing out the > log again. > > [grsacc20:04578] [[6160,0],0] plm:base:receive processing msg > [grsacc20:04578] [[6160,0],0] plm:base:receive job launch command from > [[6160,1],0] > [grsacc20:04578] [[6160,0],0] plm:base:receive adding hosts > [grsacc20:04578] [[6160,0],0] plm:base:receive calling spawn > [grsacc20:04578] [[6160,0],0] plm:base:receive done processing commands > [grsacc20:04578] [[6160,0],0] plm:base:setup_job > [grsacc20:04578] [[6160,0],0] plm:base:setup_vm > [grsacc20:04578] [[6160,0],0] plm:base:setup_vm add new daemon > [[6160,0],2] > [grsacc20:04578] [[6160,0],0] plm:base:setup_vm assigning new daemon > [[6160,0],2] to node grsacc18 > [grsacc20:04578] [[6160,0],0] plm:tm: launching vm > [grsacc20:04578] [[6160,0],0] plm:tm: final top-level argv: > orted -mca ess tm -mca orte_ess_jobid 403701760 -mca orte_ess_vpid > -mca orte_ess_num_procs 3 -mca orte_hnp_uri > "403701760.0;tcp://192.168.222.20:35163" -mca plm_base_verbose 5 -mca btl > tcp,sm,self > [grsacc20:04578] [[6160,0],0] plm:tm: launching on node grsacc19 > [grsacc20:04578] [[6160,0],0] plm:tm: executing: > orted -mca ess tm -mca orte_ess_jobid 403701760 -mca orte_ess_vpid 1 > -mca orte_ess_num_procs 3 -mca orte_hnp_uri > "403701760.0;tcp://192.168.222.20:35163" -mca plm_base_verbose 5 -mca btl > tcp,sm,self > [grsacc20:04578] [[6160,0],0] plm:tm: launching on node grsacc18 > [grsacc20:04578] [[6160,0],0] plm:tm: executing: > orted -mca ess tm -mca orte_ess_jobid 403701760 -mca orte_ess_vpid 2 > -mca orte_ess_num_procs 3 -mca orte_hnp_uri > "403701760.0;tcp://192.168.222.20:35163" -mca plm_base_verbose 5 -mca btl > tcp,sm,self > [grsacc20:04578] [[6160,0],0] plm:tm:launc