Hi Moritz, It sounds like some there are some external sockets involved here. (For details, please refer to previous few threads on the forum where I have described the terms -- external and internal sockets.) Could you please share your launch script/command? I could try to reproduce this locally.
Thanks, Rohan On Fri, Aug 26, 2016 at 10:22:34AM +0000, Eilfort, Moritz Emanuel Christoph wrote: > Dear DMTCP-Team, > > i am trying to find a way to use dmtcp to migrate after checkpointing. > Unfortunately I encountered the first problems with running DMTCP and > MPICH without any third-party plugin or changes of any kind. > > The problem is as follows: > I start a dmtcp_coordinator on the localhost and then launch my mpi > application. The mpi application is just sending messages from one > process to another for a specified time. I use mpich-3.2 and mpirun > with four processes on two hosts. All runs as expected until a > checkpoint is initiated. As soon as a checkpoint is initiated dmtcp and > my mpi application are stuck. I have to kill all connected processes > manually. Ckpt images are not written to the specified directory. If I > print out the process list using the coordinator the processes are > sometimes listed as checkpointing and sometimes as suspended. If I do > not initiated a checkpoint the application runs until it is finished. > > Often but not always dmtcp prints the following message upon getting > stuck: > > [42000] WARNING at kernelbufferdrainer.cpp:125 in onTimeoutInterval; > REASON='JWARNING(false) failed' > _dataSockets[i]->socket().sockfd() = 10 > buffer.size() = 129 > WARN_INTERVAL_SEC = 10 > Message: Still draining socket... perhaps remote host is not running > under DMTCP? > [40000] WARNING at kernelbufferdrainer.cpp:125 in onTimeoutInterval; > REASON='JWARNING(false) failed' > _dataSockets[i]->socket().sockfd() = 7 > buffer.size() = 129 > WARN_INTERVAL_SEC = 10 > Message: Still draining socket... perhaps remote host is not running > under DMTCP? > [43000] WARNING at kernelbufferdrainer.cpp:125 in onTimeoutInterval; > REASON='JWARNING(false) failed' > _dataSockets[i]->socket().sockfd() = 16 > buffer.size() = 177 > WARN_INTERVAL_SEC = 10 > Message: Still draining socket... perhaps remote host is not running > under DMTCP? > [44000] WARNING at kernelbufferdrainer.cpp:125 in onTimeoutInterval; > REASON='JWARNING(false) failed' > _dataSockets[i]->socket().sockfd() = 16 > buffer.size() = 177 > WARN_INTERVAL_SEC = 10 > Message: Still draining socket... perhaps remote host is not running > under DMTCP? > > A wired thing is that out of 20+ times trying to run it exactly as > described above, I actually managed to run a checkpoint on two or three > occasions before it crashed at the next initiated checkpoint. > I did not change anything and the end result stayed the same. > Although I then had a checkpoint Image from which to try a restart. I > then encountered another problem. If I restart from the restart_script, > not all processes are restarted. The dmtcp_ssh and dmtcp_sshd processes > and the mpich process-manger processes hydra and mpiexec are not > restarted. If I use dmtcp_restart and specify all images the > application restarts without any problems, although it now only > restarts on a single host. If I try to checkpoint now the situation is > the same as above (it freezes). > > DMTCP runs smoothly on a single host. I can checkpoint, restart as > often as I want to. The restart_script still seems to be swallowing a > process. Initially six processes where connected to the coordinator, > after restart with the restart_script only five processes are connected > and after restart with dmtcp_restart six processes are connected to the > coordinator. > > I am working on a local cluster at my university. I use two nodes > connected via Ethernet. > > I would be very grateful if you could give me a hint as to how I can > solve these problems. > > Kind regards, > Moritz > > > > > > > > > > > > > > ------------------------------------------------------------------------------ > _______________________________________________ > Dmtcp-forum mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum ------------------------------------------------------------------------------ _______________________________________________ Dmtcp-forum mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
