Hi Tom,
Since you're running Debian 6.0, you could be running into a documented
bug in that version of Debian. The short story is that this bug report:
https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=3094
describes both the problem and the bug fix.
For more information, please look at:
http://dmtcp.sourceforge.net/condor.html
Specifically, please look at the pdf there:
http://dmtcp.sourceforge.net/docs/condor-dmtcp-overview.pdf
That in turn leads you to a Condor bug report:
https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=3094
which points to the Debian bug report that was filed:
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=679630
Sorry for the long list of pointers, but it was a weird bug that had
several of us from DMTCP, Condor, and Neurodebian working together to track it
down.
I just added a pointer to this on our Sourceforge site (Condor Integration),
to make this bug more obvious.
Best,
- Gene
On Fri, Mar 22, 2013 at 12:23:34PM -0400, Tom Downes wrote:
> Hi:
>
> I run a Condor cluster and am hoping to use DMTCP for checkpointing of jobs
> that cannot be checkpointed within the relatively limited scope offered by
> Condor itself. So thanks for building this general and useful tool.
>
> We are a Debian 6 house and I have tried dmtcp 1.2.7 on two different
> systems:
>
> 1. An extremely simple headless install VM
> 2. A bare metal "testing" login node of our cluster
>
> Both are patched to the latest kernel, security patches, etc. Further, on
> system #2, I have tried using the Debian backports copy (version #
> 1.2.5-1). The results are outwardly the same as using a compiled 1.2.7 so I
> won't differentiate between them.
>
> The short version is that system #1 passes the autotest.py test with flying
> colors and also succeeds when I run things by hand. System #2 fails all
> tests.
>
> On system #2, I run dmtcp_checkpoint on the dmtcp1 test program and and
> issue a c command in the dmtcp_coordinator in the window
>
> /*
> c
> [29435] TRACE at dmtcp_coordinator.cpp:488 in handleUserCommand;
> REASON='checkpointing...'
> [29435] NOTE at dmtcp_coordinator.cpp:1315 in startCheckpoint;
> REASON='starting checkpoint, suspending all nodes'
> s.numPeers = 1
> [29435] NOTE at dmtcp_coordinator.cpp:1317 in startCheckpoint;
> REASON='Incremented Generation'
> UniquePid::ComputationId().generation() = 1
> */
>
> In the dmtcp_checkpoint window, the counting freezes at the point where the
> checkpoint command was issued. No scripts or checkpoint images are
> generated and the system will just sit there until I ^C out of
> dmtcp_coordinator and dmtcp_checkpoint.
>
> System #1 behaves as expected: the counter keeps counting and scripts/ckpt
> images are generated. The restart script starts counting when the
> checkpoint command was issued.
>
> My instinct is that it is an environment variable problem or some other
> subtle issue. System #2 sets a number of env vars including TMPDIR. No
> firewalls are installed and localhost maps to 127.0.0.1.
>
> I recompiled with --enable-debug and I have attached the logs of the above
> exercise on the failing system.
>
> NB: Compilation fails when I use --enable-condor-support giving several
> errors about having failed to declare TCP/IP-related objects.
>
> I note also that system #1 does not have any of the openmpi
> libraries/binaries installed. I appreciate any thoughts you may have.
>
> Yours,
>
> --
> Tom Downes
> Associate Scientist and Data Center Manager
> Center for Gravitation, Cosmology and Astrophysics
> University of Wisconsin-Milwaukee
> 414.229.2678
> ------------------------------------------------------------------------------
> Everyone hates slow websites. So do we.
> Make your web apps faster with AppDynamics
> Download AppDynamics Lite for free today:
> http://p.sf.net/sfu/appdyn_d2d_mar
> _______________________________________________
> Dmtcp-forum mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum