Hi Moritz,

It sounds like some there are some external sockets involved here. (For
details, please refer to previous few threads on the forum where I have
described the terms -- external and internal sockets.) Could you please
share your launch script/command? I could try to reproduce this locally.

Thanks,
Rohan

On Fri, Aug 26, 2016 at 10:22:34AM +0000, Eilfort, Moritz Emanuel Christoph 
wrote:
> Dear DMTCP-Team,
> 
> i am trying to find a way to use dmtcp to migrate after checkpointing.
> Unfortunately I encountered the first problems with running DMTCP and
> MPICH without any third-party plugin or changes of any kind.
> 
> The problem is as follows:
> I start a dmtcp_coordinator on the localhost and then launch my mpi
> application. The mpi application is just sending messages from one
> process to another for a specified time. I use mpich-3.2 and mpirun
> with four processes on two hosts. All runs as expected until a
> checkpoint is initiated. As soon as a checkpoint is initiated dmtcp and
> my mpi application are stuck. I have to kill all connected processes
> manually. Ckpt images are not written to the specified directory. If I
> print out the process list using the coordinator the processes are
> sometimes listed as checkpointing and sometimes as suspended. If I do
> not initiated a checkpoint the application runs until it is finished.
> 
> Often but not always dmtcp prints the following message upon getting
> stuck:
> 
> [42000] WARNING at kernelbufferdrainer.cpp:125 in onTimeoutInterval;
> REASON='JWARNING(false) failed'
>      _dataSockets[i]->socket().sockfd() = 10
>      buffer.size() = 129
>      WARN_INTERVAL_SEC = 10
> Message: Still draining socket... perhaps remote host is not running
> under DMTCP?
> [40000] WARNING at kernelbufferdrainer.cpp:125 in onTimeoutInterval;
> REASON='JWARNING(false) failed'
>      _dataSockets[i]->socket().sockfd() = 7
>      buffer.size() = 129
>      WARN_INTERVAL_SEC = 10
> Message: Still draining socket... perhaps remote host is not running
> under DMTCP?
> [43000] WARNING at kernelbufferdrainer.cpp:125 in onTimeoutInterval;
> REASON='JWARNING(false) failed'
>      _dataSockets[i]->socket().sockfd() = 16
>      buffer.size() = 177
>      WARN_INTERVAL_SEC = 10
> Message: Still draining socket... perhaps remote host is not running
> under DMTCP?
> [44000] WARNING at kernelbufferdrainer.cpp:125 in onTimeoutInterval;
> REASON='JWARNING(false) failed'
>      _dataSockets[i]->socket().sockfd() = 16
>      buffer.size() = 177
>      WARN_INTERVAL_SEC = 10
> Message: Still draining socket... perhaps remote host is not running
> under DMTCP?
> 
> A wired thing is that out of 20+ times trying to run it exactly as
> described above, I actually managed to run a checkpoint on two or three
> occasions before it crashed at the next initiated checkpoint.
> I did not change anything and the end result stayed the same. 
> Although I then had a checkpoint Image from which to try a restart. I
> then encountered another problem. If I restart from the restart_script,
> not all processes are restarted. The dmtcp_ssh and dmtcp_sshd processes
> and the mpich process-manger processes hydra and mpiexec are not
> restarted. If I use dmtcp_restart and specify all images the
> application restarts without any problems, although it now only
> restarts on a single host. If I try to checkpoint now the situation is
> the same as above (it freezes). 
> 
> DMTCP runs smoothly on a single host. I can checkpoint, restart as
> often as I want to. The restart_script still seems to be swallowing a
> process. Initially six processes where connected to the coordinator,
> after restart with the restart_script only five processes are connected
> and after restart with dmtcp_restart six processes are connected to the
> coordinator. 
> 
> I am working on a local cluster at my university. I use two nodes
> connected via Ethernet. 
> 
> I would be very grateful if you could give me a hint as to how I can
> solve these problems. 
> 
> Kind regards,
> Moritz
> 
> 
> 
>         
>         
> 
> 
> 
> 
>         
> 
> 
> 
> ------------------------------------------------------------------------------
> _______________________________________________
> Dmtcp-forum mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

------------------------------------------------------------------------------
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to