Hi Tom,
    Since you're running Debian 6.0, you could be running into a documented
bug in that version of Debian.  The short story is that this bug report:
  https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=3094
describes both the problem and the bug fix.

For more information, please look at:
  http://dmtcp.sourceforge.net/condor.html
Specifically, please look at the pdf there:
  http://dmtcp.sourceforge.net/docs/condor-dmtcp-overview.pdf
That in turn leads you to a Condor bug report:
  https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=3094
which points to the Debian bug report that was filed:
  http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=679630
Sorry for the long list of pointers, but it was a weird bug that had
several of us from DMTCP, Condor, and Neurodebian working together to track it 
down.
I just added a pointer to this on our Sourceforge site (Condor Integration),
to make this bug more obvious.

Best,
- Gene

On Fri, Mar 22, 2013 at 12:23:34PM -0400, Tom Downes wrote:
> Hi:
> 
> I run a Condor cluster and am hoping to use DMTCP for checkpointing of jobs
> that cannot be checkpointed within the relatively limited scope offered by
> Condor itself. So thanks for building this general and useful tool.
> 
> We are a Debian 6 house and I have tried dmtcp 1.2.7 on two different
> systems:
> 
> 1. An extremely simple headless install VM
> 2. A bare metal "testing" login node of our cluster
> 
> Both are patched to the latest kernel, security patches, etc. Further, on
> system #2, I have tried using the Debian backports copy (version #
> 1.2.5-1). The results are outwardly the same as using a compiled 1.2.7 so I
> won't differentiate between them.
> 
> The short version is that system #1 passes the autotest.py test with flying
> colors and also succeeds when I run things by hand. System #2 fails all
> tests.
> 
> On system #2, I run dmtcp_checkpoint on the dmtcp1 test program and and
> issue a c command in the dmtcp_coordinator in the window
> 
> /*
> c
> [29435] TRACE at dmtcp_coordinator.cpp:488 in handleUserCommand;
> REASON='checkpointing...'
> [29435] NOTE at dmtcp_coordinator.cpp:1315 in startCheckpoint;
> REASON='starting checkpoint, suspending all nodes'
>      s.numPeers = 1
> [29435] NOTE at dmtcp_coordinator.cpp:1317 in startCheckpoint;
> REASON='Incremented Generation'
>      UniquePid::ComputationId().generation() = 1
> */
> 
> In the dmtcp_checkpoint window, the counting freezes at the point where the
> checkpoint command was issued. No scripts or checkpoint images are
> generated and the system will just sit there until I ^C out of
> dmtcp_coordinator and dmtcp_checkpoint.
> 
> System #1 behaves as expected: the counter keeps counting and scripts/ckpt
> images are generated. The restart script starts counting when the
> checkpoint command was issued.
> 
> My instinct is that it is an environment variable problem or some other
> subtle issue. System #2 sets a number of env vars including TMPDIR. No
> firewalls are installed and localhost maps to 127.0.0.1.
> 
> I recompiled with --enable-debug and I have attached the logs of the above
> exercise on the failing system.
> 
> NB: Compilation fails when I use --enable-condor-support giving several
> errors about having failed to declare TCP/IP-related objects.
> 
> I note also that system #1 does not have any of the openmpi
> libraries/binaries installed. I appreciate any thoughts you may have.
> 
> Yours,
> 
> --
> Tom Downes
> Associate Scientist and Data Center Manager
> Center for Gravitation, Cosmology and Astrophysics
> University of Wisconsin-Milwaukee
> 414.229.2678


> ------------------------------------------------------------------------------
> Everyone hates slow websites. So do we.
> Make your web apps faster with AppDynamics
> Download AppDynamics Lite for free today:
> http://p.sf.net/sfu/appdyn_d2d_mar

> _______________________________________________
> Dmtcp-forum mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum


------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to