Hi Tom, Thanks for the bug report and the log files. Unfortunately, there wasn't enough information in the logs to help diagnose the failure on system #2. Would it be possible for you to tell how to reproduce this bug locally? Alternatively, if you can provide a guest account on the machine where the failure is observed, that can speed things up. Or else, we can also work with something like vnc to share screen and diagnose/fix the problem.
Please let me know what works best for you. Kapil On Fri, Mar 22, 2013 at 12:23 PM, Tom Downes <[email protected]> wrote: > Hi: > > I run a Condor cluster and am hoping to use DMTCP for checkpointing of jobs > that cannot be checkpointed within the relatively limited scope offered by > Condor itself. So thanks for building this general and useful tool. > > We are a Debian 6 house and I have tried dmtcp 1.2.7 on two different > systems: > > 1. An extremely simple headless install VM > 2. A bare metal "testing" login node of our cluster > > Both are patched to the latest kernel, security patches, etc. Further, on > system #2, I have tried using the Debian backports copy (version # 1.2.5-1). > The results are outwardly the same as using a compiled 1.2.7 so I won't > differentiate between them. > > The short version is that system #1 passes the autotest.py test with flying > colors and also succeeds when I run things by hand. System #2 fails all > tests. > > On system #2, I run dmtcp_checkpoint on the dmtcp1 test program and and > issue a c command in the dmtcp_coordinator in the window > > /* > c > [29435] TRACE at dmtcp_coordinator.cpp:488 in handleUserCommand; > REASON='checkpointing...' > [29435] NOTE at dmtcp_coordinator.cpp:1315 in startCheckpoint; > REASON='starting checkpoint, suspending all nodes' > s.numPeers = 1 > [29435] NOTE at dmtcp_coordinator.cpp:1317 in startCheckpoint; > REASON='Incremented Generation' > UniquePid::ComputationId().generation() = 1 > */ > > In the dmtcp_checkpoint window, the counting freezes at the point where the > checkpoint command was issued. No scripts or checkpoint images are generated > and the system will just sit there until I ^C out of dmtcp_coordinator and > dmtcp_checkpoint. > > System #1 behaves as expected: the counter keeps counting and scripts/ckpt > images are generated. The restart script starts counting when the checkpoint > command was issued. > > My instinct is that it is an environment variable problem or some other > subtle issue. System #2 sets a number of env vars including TMPDIR. No > firewalls are installed and localhost maps to 127.0.0.1. > > I recompiled with --enable-debug and I have attached the logs of the above > exercise on the failing system. > > NB: Compilation fails when I use --enable-condor-support giving several > errors about having failed to declare TCP/IP-related objects. > > I note also that system #1 does not have any of the openmpi > libraries/binaries installed. I appreciate any thoughts you may have. > > Yours, > > -- > Tom Downes > Associate Scientist and Data Center Manager > Center for Gravitation, Cosmology and Astrophysics > University of Wisconsin-Milwaukee > 414.229.2678 > > ------------------------------------------------------------------------------ > Everyone hates slow websites. So do we. > Make your web apps faster with AppDynamics > Download AppDynamics Lite for free today: > http://p.sf.net/sfu/appdyn_d2d_mar > _______________________________________________ > Dmtcp-forum mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum > ------------------------------------------------------------------------------ Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_mar _______________________________________________ Dmtcp-forum mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
