On Nov 9, 2009, at 5:33 AM, Sergio Díaz wrote:

> Hi Josh,
> 
> The OpenMPI version is 1.3.3.
> 
> The command ompi-ps doesn't work.
> 
> [root@compute-3-18 ~]# ompi-ps -j 2726959 -p 16241
> [root@compute-3-18 ~]# ompi-ps -v -j 2726959 -p 16241
> [compute-3-18.local:16254] orte_ps: Acquiring list of HNPs and setting 
> contact info into RML...
> [root@compute-3-18 ~]# ompi-ps -v -j 2726959
> [compute-3-18.local:16255] orte_ps: Acquiring list of HNPs and setting 
> contact info into RML...
> 
> [root@compute-3-18 ~]# ps uaxf | grep sdiaz
> root     16260  0.0  0.0 51084  680 pts/0    S+   13:38   0:00          \_ 
> grep sdiaz
> sdiaz    16203  0.0  0.0 53164 1220 ?        Ss   13:37   0:00      \_ -bash 
> /opt/cesga/sge62/default/spool/compute-3-18/job_scripts/2726959
> sdiaz    16241  0.0  0.0 41028 2480 ?        S    13:37   0:00          \_ 
> mpirun -np 2 -am ft-enable-cr ./pi3
> sdiaz    16242  0.0  0.0 36484 1840 ?        Sl   13:37   0:00              
> \_ /opt/cesga/sge62/bin/lx24-x86/qrsh -inherit -nostdin -V compute-3-17.local 
>  orted -mca ess env -mca orte_ess_jobid 2769879040 -mca orte_ess_vpid 1 -mca 
> orte_ess_num_procs 2 --hnp-uri "2769879040.0;tcp://192.168.4.143:57010" -mca 
> mca_base_param_file_prefix ft-enable-cr -mca mca_base_param_file_path 
> /opt/cesga/openmpi-1.3.3/share/openmpi/amca-param-sets:/home_no_usc/cesga/sdiaz/mpi_test
>  -mca mca_base_param_file_path_force /home_no_usc/cesga/sdiaz/mpi_test
> sdiaz    16245  0.1  0.0 99464 4616 ?        Sl   13:37   0:00              
> \_ ./pi3
> 
> [root@compute-3-18 ~]# ompi-ps -n c3-18
> [root@compute-3-18 ~]# ompi-ps -n compute-3-18
> [root@compute-3-18 ~]# ompi-ps -n
> 
> There is not directory on the /tmp of the node. However, if the application 
> is run without SGE, the directory is created

This may be the core of the problem. ompi-ps and other command line tools 
(e.g., ompi-checkpoint) look for the Open MPI session directory in /tmp in 
order to find the connection information to connect to the mpirun process 
(internally called the HNP or Head Node Process).

Can you change the location of the temporary directory in SGE? The temporary 
directory is usually set via an environment variable (e.g., TMPDIR, or TMP). So 
removing the environment variable or setting it to /tmp might help.


> but if I do ompi-ps -j MPIRUN_PID, it seems hanged and I interrupt it. Does 
> it take long time?

It should not take a long time. It is just querying the mpirun process for 
state information.

> what means the option -j of ompi-ps command? isn't it related to a batch 
> system(like sge, condor...), is it?

The '-j' option allows the user to specify the Open MPI jobid. This is 
completely different than the jobid provided by the batch system. In general, 
users should not need to specify the -j option. It is useful when you have 
multiple Open MPI jobs, and want a summary of just one of them.

> 
> Thanks for the ticket. I will follow it.
> 
> Talking with Alan, I realized that there are few transport protocols that are 
> supported. And maybe it is the problem. Currently, SGE is using qrsh to 
> expand mpi process. I can change this protocol and use ssh. So, I'm going to 
> test it this afternoon and I will comment to you the results.

Try 'ssh' and see if that helps. I suspect the problem is with the session 
directory location though.

> 
> Regards,
> Sergio
> 
> 
> Josh Hursey escribió:
>> 
>> On Oct 28, 2009, at 7:41 AM, Sergio Díaz wrote: 
>> 
>>> Hello, 
>>> 
>>> I have achieved the checkpoint of an easy program without SGE. Now, I'm 
>>> trying to do the integration openmpi+sge but I have some problems... When I 
>>> try to do checkpoint of the mpirun PID, I got an error similar to the error 
>>> gotten when the PID doesn't exit. The example below.     
>> 
>> I do not have any experience with the SGE environment, so I suspect that 
>> there may something 'special' about the environment that is tripping up the 
>> ompi-checkpoint tool. 
>> 
>> First of all, what version of Open MPI are you using? 
>> 
>> Somethings to check: 
>>  - Does 'ompi-ps' work when your application is running? 
>>  - Is there an /tmp/openmpi-sessions-* directory on the node where mpirun is 
>> currently running? This directory contains information on how to connect to 
>> the mpirun process from an external tool, if it's missing then this could be 
>> the cause of the problem. 
>> 
>>> 
>>> Any ideas? 
>>> Somebody have a script to do it automatic with SGE?. For example I have one 
>>> to do checkpoint each X seconds with BLCR and non-mpi jobs. It is launched 
>>> by SGE if you have configured the queue and the ckpt environment. 
>> 
>> I do not know of any integration of the Open MPI checkpointing work with SGE 
>> at the moment. 
>> 
>> As far as time triggered checkpointing, I have a feature ticket open about 
>> this: 
>>   https://svn.open-mpi.org/trac/ompi/ticket/1961 
>> 
>> It is not available yet, but in the works. 
>> 
>> 
>>> 
>>> Is it possible choose the name of the ckpt folder when you do the 
>>> ompi-checkpoint? I can't find the option to do it. 
>> 
>> Not at this time. Though I could see it as a useful feature, and shouldn't 
>> be too hard to implement. I filed a ticket if you want to follow the 
>> progress: 
>>   https://svn.open-mpi.org/trac/ompi/ticket/2098 
>> 
>> -- Josh 
>> 
>>> 
>>> 
>>> Regards, 
>>> Sergio 
>>> 
>>> 
>>> -------------------------------- 
>>> 
>>> [sdiaz@compute-3-17 ~]$ ps auxf 
>>> .... 
>>> root     20044  0.0  0.0  4468 1224 ?        S    13:28   0:00  \_ 
>>> sge_shepherd-2645150 -bg 
>>> sdiaz    20072  0.0  0.0 53172 1212 ?        Ss   13:28   0:00      \_ 
>>> -bash /opt/cesga/sge62/default/spool/compute-3-17/job_scripts/2645150 
>>> sdiaz    20112  0.2  0.0 41028 2480 ?        S    13:28   0:00          \_ 
>>> mpirun -np 2 -am ft-enable-cr pi3 
>>> sdiaz    20113  0.0  0.0 36484 1824 ?        Sl   13:28   0:00              
>>> \_ /opt/cesga/sge62/bin/lx24-x86/qrsh -inherit -nostdin -V 
>>> compute-3-18.......... 
>>> sdiaz    20116  1.2  0.0 99464 4616 ?        Sl   13:28   0:00              
>>> \_ pi3 
>>> 
>>> 
>>> [sdiaz@compute-3-17 ~]$ ompi-checkpoint 20112 
>>> [compute-3-17.local:20124] HNP with PID 20112 Not found! 
>>> 
>>> [sdiaz@compute-3-17 ~]$ ompi-checkpoint -s 20112 
>>> [compute-3-17.local:20135] HNP with PID 20112 Not found! 
>>> 
>>> [sdiaz@compute-3-17 ~]$ ompi-checkpoint -s --term 20112 
>>> [compute-3-17.local:20136] HNP with PID 20112 Not found! 
>>> 
>>> [sdiaz@compute-3-17 ~]$ ompi-checkpoint --hnp-pid 20112 
>>> -------------------------------------------------------------------------- 
>>> ompi-checkpoint PID_OF_MPIRUN 
>>>   Open MPI Checkpoint Tool 
>>> 
>>>    -am <arg0>            Aggregate MCA parameter set file list 
>>>    -gmca|--gmca <arg0> <arg1> 
>>>                          Pass global MCA parameters that are applicable to 
>>>                          all contexts (arg0 is the parameter name; arg1 is 
>>>                          the parameter value) 
>>> -h|--help                This help message 
>>>    --hnp-jobid <arg0>    This should be the jobid of the HNP whose 
>>>                          applications you wish to checkpoint. 
>>>    --hnp-pid <arg0>      This should be the pid of the mpirun whose 
>>>                          applications you wish to checkpoint. 
>>>    -mca|--mca <arg0> <arg1> 
>>>                          Pass context-specific MCA parameters; they are 
>>>                          considered global if --gmca is not used and only 
>>>                          one context is specified (arg0 is the parameter 
>>>                          name; arg1 is the parameter value) 
>>> -s|--status              Display status messages describing the progression 
>>>                          of the checkpoint 
>>>    --term                Terminate the application after checkpoint 
>>> -v|--verbose             Be Verbose 
>>> -w|--nowait              Do not wait for the application to finish 
>>>                          checkpointing before returning 
>>> 
>>> -------------------------------------------------------------------------- 
>>> [sdiaz@compute-3-17 ~]$ exit 
>>> logout 
>>> Connection to c3-17 closed. 
>>> [sdiaz@svgd mpi_test]$ ssh c3-18 
>>> Last login: Wed Oct 28 13:24:12 2009 from svgd.local 
>>> -bash-3.00$ ps auxf |grep sdiaz 
>>> 
>>> sdiaz    14412  0.0  0.0  1888  560 ?        Ss   13:28   0:00      \_ 
>>> /opt/cesga/sge62/utilbin/lx24-x86/qrsh_starter 
>>> /opt/cesga/sge62/default/spool/compute-3-18/active_jobs/2645150.1/1.compute-3-18
>>>  
>>> sdiaz    14419  0.0  0.0 35728 2260 ?        S    13:28   0:00          \_ 
>>> orted -mca ess env -mca orte_ess_jobid 2295267328 -mca orte_ess_vpid 1 -mca 
>>> orte_ess_num_procs 2 --hnp-uri 2295267328.0;tcp://192.168.4.144:36596 -mca 
>>> mca_base_param_file_prefix ft-enable-cr -mca mca_base_param_file_path 
>>> /opt/cesga/openmpi-1.3.3/share/openmpi/amca-param-sets:/home_no_usc/cesga/sdiaz/mpi_test
>>>  -mca mca_base_param_file_path_force /home_no_usc/cesga/sdiaz/mpi_test 
>>> sdiaz    14420  0.0  0.0 99452 4596 ?        Sl   13:28   0:00              
>>> \_ pi3 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> Sergio Díaz Montes 
>>> Centro de Supercomputacion de Galicia 
>>> Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain) 
>>> Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16 
>>> email: sd...@cesga.es ; http://www.cesga.es/ 
>>> <image002.jpg> 
>>> ------------------------------------------------ 
>>> _______________________________________________ 
>>> users mailing list 
>>> us...@open-mpi.org 
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users 
>> 
>> 
>> _______________________________________________ 
>> users mailing list 
>> us...@open-mpi.org 
>> http://www.open-mpi.org/mailman/listinfo.cgi/users 
>> 
> 
> 
> -- 
> Sergio Díaz Montes
> Centro de Supercomputacion de Galicia
> Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain)
> Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16 
> email: sd...@cesga.es ; http://www.cesga.es/
> <image002.jpg>
> ------------------------------------------------ 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to