Hi:

I run a Condor cluster and am hoping to use DMTCP for checkpointing of jobs
that cannot be checkpointed within the relatively limited scope offered by
Condor itself. So thanks for building this general and useful tool.

We are a Debian 6 house and I have tried dmtcp 1.2.7 on two different
systems:

1. An extremely simple headless install VM
2. A bare metal "testing" login node of our cluster

Both are patched to the latest kernel, security patches, etc. Further, on
system #2, I have tried using the Debian backports copy (version #
1.2.5-1). The results are outwardly the same as using a compiled 1.2.7 so I
won't differentiate between them.

The short version is that system #1 passes the autotest.py test with flying
colors and also succeeds when I run things by hand. System #2 fails all
tests.

On system #2, I run dmtcp_checkpoint on the dmtcp1 test program and and
issue a c command in the dmtcp_coordinator in the window

/*
c
[29435] TRACE at dmtcp_coordinator.cpp:488 in handleUserCommand;
REASON='checkpointing...'
[29435] NOTE at dmtcp_coordinator.cpp:1315 in startCheckpoint;
REASON='starting checkpoint, suspending all nodes'
     s.numPeers = 1
[29435] NOTE at dmtcp_coordinator.cpp:1317 in startCheckpoint;
REASON='Incremented Generation'
     UniquePid::ComputationId().generation() = 1
*/

In the dmtcp_checkpoint window, the counting freezes at the point where the
checkpoint command was issued. No scripts or checkpoint images are
generated and the system will just sit there until I ^C out of
dmtcp_coordinator and dmtcp_checkpoint.

System #1 behaves as expected: the counter keeps counting and scripts/ckpt
images are generated. The restart script starts counting when the
checkpoint command was issued.

My instinct is that it is an environment variable problem or some other
subtle issue. System #2 sets a number of env vars including TMPDIR. No
firewalls are installed and localhost maps to 127.0.0.1.

I recompiled with --enable-debug and I have attached the logs of the above
exercise on the failing system.

NB: Compilation fails when I use --enable-condor-support giving several
errors about having failed to declare TCP/IP-related objects.

I note also that system #1 does not have any of the openmpi
libraries/binaries installed. I appreciate any thoughts you may have.

Yours,

--
Tom Downes
Associate Scientist and Data Center Manager
Center for Gravitation, Cosmology and Astrophysics
University of Wisconsin-Milwaukee
414.229.2678

Attachment: dmtcp-logs.tar.bz2
Description: BZip2 compressed data

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to