I am running dmtcp on a hpc and am hoping to run a distributed application
across several nodes and checkpoint/restart it.

When I run with the --rm plugin in the torque environment the application
runs into an error (I believe directly after dmtcp_launch --rm):
[40000] ERROR at fileconnection.cpp:619 in preCkpt;
REASON='JASSERT(Util::createDirectoryTree(savedFilePath)) failed'
     savedFilePath = /oasis/scratch/<checkpoint_directory>
Message: Unable to create directory in File Path
python2.7 (40000): Terminating...

If I run without --rm then the host names are not adapted on a restart and
the application fails.

Running a helloworld counting program that says the host files every 60
seconds via openmpi runs smoothly with --rm and --infiniband plugins on the
same system.  I have tried to track down similar errors in the forums but
failed to find instances of distributed systems withing the hpc
environment.  The checkpoints occur in the same directory for both the
openmpi and distributed application so it is not the folder permissions.

Any ideas?
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to