Dear OpenMPI developers,
i'm testing checkpoint and restart with OpenMPI 1.4 nightly. Test machine is
IBM Blade System over Infiniband with 4 processors every communication node.
At the moment,  I have some problems. My application is  a simply
communication ring between processors, with parametric loop.

*First case:* 8 procs over 2 nodes.

Start command:

$ mpirun -machinefile machinefile -am ft-enable-cr ./ring -t 5000000

The output is:

[node0316:20037] mca: base: components_open: Looking for filem components
[node0316:20037] mca: base: components_open: including only filem
components that are checkpoint enabled
[node0316:20037] mca: base: components_open: (filem) Component rsh is
Checkpointable
[node0316:20037] mca: base: components_open: opening filem components
[node0316:20037] mca: base: components_open: found loaded component rsh
[node0316:20037] mca: base: components_open: component rsh has no
register function
[node0316:20037] filem:rsh: open()
[node0316:20037] filem:rsh: open: priority   = 20
[node0316:20037] filem:rsh: open: verbosity  = 0
[node0316:20037] filem:rsh: open: cp command  = scp
[node0316:20037] filem:rsh: open: rsh command  = ssh
[node0316:20037] mca: base: components_open: component rsh open function
successful
[node0316:20037] mca:base:select: Auto-selecting filem components
[node0316:20037] mca:base:select:(filem) Querying component [rsh]
[node0316:20037] mca:base:select:(filem) Query of component [rsh] set
priority to 20
[node0316:20037] mca:base:select:(filem) Selected component [rsh]
[node0316:20037] mca: base: components_open: Looking for snapc components
[node0316:20037] mca: base: components_open: including only snapc
components that are checkpoint enabled
[node0316:20037] mca: base: components_open: (snapc) Component full is
Checkpointable
[node0316:20037] mca: base: components_open: opening snapc components
[node0316:20037] mca: base: components_open: found loaded component full
[node0316:20037] mca: base: components_open: component full has no
register function
[node0316:20037] snapc:full: open()
[node0316:20037] snapc:full: open: priority    = 20
[node0316:20037] snapc:full: open: verbosity   = 100
[node0316:20037] snapc:full: open: skip_filem  = False
[node0316:20037] mca: base: components_open: component full open
function successful
[node0316:20037] mca:base:select: Auto-selecting snapc components
[node0316:20037] mca:base:select:(snapc) Querying component [full]
[node0316:20037] snapc:full: component_query()
[node0316:20037] mca:base:select:(snapc) Query of component [full] set
priority to 20
[node0316:20037] mca:base:select:(snapc) Selected component [full]
[node0316:20037] snapc:full: module_init(1, 1)
[node0316:20037] snapc:full: module_init: Global Snapshot Coordinator
** HANG**

The application doesn't start, and appears locked.

Strace command before mpirun shows  the informations below:

poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
events=POLLIN}], 3, 1000) = 0
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
events=POLLIN}], 3, 1000) = 0
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
events=POLLIN}], 3, 1000) = 0
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
events=POLLIN}], 3, 1000) = 0
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
events=POLLIN}], 3, 1000) = 0
poll(...

doing nothing..

*Second case:* 1 node, 4 processor ( without intercommunication over
Infiniband)

In this case, mpirun works well, but the checkpoint procedure fails:

ompi-checkpoint 20109
[node0316:20134] Error: Unable to get the current working directory
[node0316:20134] [[42404,0],0] ORTE_ERROR_LOG: Not found in file
orte-checkpoint.c at line 395
[node0316:20134] HNP with PID 20109 Not found!

I don't understand why OpenMPI doesn't find that log file.

Any idea?

Thanks in advance.





-- 
Gabriele Fatigati

CINECA Systems & Tecnologies Department

Supercomputing Group

Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy

www.cineca.it Tel: +39 051 6171722

g.fatig...@cineca.it

Reply via email to