Dear OpenMPI developers, i'm testing checkpoint and restart with OpenMPI 1.4 nightly. Test machine is IBM Blade System over Infiniband with 4 processors every communication node. At the moment, I have some problems. My application is a simply communication ring between processors, with parametric loop.
*First case:* 8 procs over 2 nodes. Start command: $ mpirun -machinefile machinefile -am ft-enable-cr ./ring -t 5000000 The output is: [node0316:20037] mca: base: components_open: Looking for filem components [node0316:20037] mca: base: components_open: including only filem components that are checkpoint enabled [node0316:20037] mca: base: components_open: (filem) Component rsh is Checkpointable [node0316:20037] mca: base: components_open: opening filem components [node0316:20037] mca: base: components_open: found loaded component rsh [node0316:20037] mca: base: components_open: component rsh has no register function [node0316:20037] filem:rsh: open() [node0316:20037] filem:rsh: open: priority = 20 [node0316:20037] filem:rsh: open: verbosity = 0 [node0316:20037] filem:rsh: open: cp command = scp [node0316:20037] filem:rsh: open: rsh command = ssh [node0316:20037] mca: base: components_open: component rsh open function successful [node0316:20037] mca:base:select: Auto-selecting filem components [node0316:20037] mca:base:select:(filem) Querying component [rsh] [node0316:20037] mca:base:select:(filem) Query of component [rsh] set priority to 20 [node0316:20037] mca:base:select:(filem) Selected component [rsh] [node0316:20037] mca: base: components_open: Looking for snapc components [node0316:20037] mca: base: components_open: including only snapc components that are checkpoint enabled [node0316:20037] mca: base: components_open: (snapc) Component full is Checkpointable [node0316:20037] mca: base: components_open: opening snapc components [node0316:20037] mca: base: components_open: found loaded component full [node0316:20037] mca: base: components_open: component full has no register function [node0316:20037] snapc:full: open() [node0316:20037] snapc:full: open: priority = 20 [node0316:20037] snapc:full: open: verbosity = 100 [node0316:20037] snapc:full: open: skip_filem = False [node0316:20037] mca: base: components_open: component full open function successful [node0316:20037] mca:base:select: Auto-selecting snapc components [node0316:20037] mca:base:select:(snapc) Querying component [full] [node0316:20037] snapc:full: component_query() [node0316:20037] mca:base:select:(snapc) Query of component [full] set priority to 20 [node0316:20037] mca:base:select:(snapc) Selected component [full] [node0316:20037] snapc:full: module_init(1, 1) [node0316:20037] snapc:full: module_init: Global Snapshot Coordinator ** HANG** The application doesn't start, and appears locked. Strace command before mpirun shows the informations below: poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}], 3, 1000) = 0 poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}], 3, 1000) = 0 poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}], 3, 1000) = 0 poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}], 3, 1000) = 0 poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}], 3, 1000) = 0 poll(... doing nothing.. *Second case:* 1 node, 4 processor ( without intercommunication over Infiniband) In this case, mpirun works well, but the checkpoint procedure fails: ompi-checkpoint 20109 [node0316:20134] Error: Unable to get the current working directory [node0316:20134] [[42404,0],0] ORTE_ERROR_LOG: Not found in file orte-checkpoint.c at line 395 [node0316:20134] HNP with PID 20109 Not found! I don't understand why OpenMPI doesn't find that log file. Any idea? Thanks in advance. -- Gabriele Fatigati CINECA Systems & Tecnologies Department Supercomputing Group Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy www.cineca.it Tel: +39 051 6171722 g.fatig...@cineca.it