From: [email protected]
To: [email protected]
Subject: ver 1.2.5 - restart fail for 16 processes on a 4 nodes cluster ?
Date: Fri, 7 Feb 2014 04:37:58 +0200
i am trying dmtcp version 1.2.5 with open mpi
i use a 4 node cluster
when i try to check point and restart an exe that was compiler 4 processes it
works good at checkpoint and at restart it gives me an ""Segmentation fault
(core dumped)" " then it works correctly also at restart
ubuntu@master:~$ dmtcp-1.2.5/bin/dmtcp_checkpoint mpirun -np 4 -H
master,node001,node002,node003 /home/ubuntu/NPB3.3/NPB3.3-MPI/bin/lu.A.4
but when i try to check point and restart an exe that was compiler 16 processes
it works good at checkpoint but at restart it gives this output and hangs . it
stops for ever
ubuntu@master:~$ dmtcp-1.2.5/bin/dmtcp_checkpoint mpirun -np 16 -H
master,node001,node002,node003
/home/ubuntu/NPB3.3/NPB3.3-MPI/bin/lu.A.16
it looks like i am missing a simple detail
here is the output i had :
-------------------------------------------------------
dmtcp_checkpoint (DMTCP + MTCP) 1.2.5
Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and
Gene Cooperman
This program comes with ABSOLUTELY NO WARRANTY.
This is free software, and you are welcome to redistribute it
under certain conditions; see COPYING file for details.
(Use flag "-q" to hide this message.)
dmtcp_coordinator starting...
Port: 7779
Checkpoint Interval: disabled (checkpoint manually instead)
Exit on last client: 1
Backgrounding...
[6506] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( (
sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
(strerror((*__errno_location ()))) = Address already in use
id() = 18af1fad8d756-6416-52f43ea3(99072)
Message: Bind failed.
[6506] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( (
sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
(strerror((*__errno_location ()))) = Address already in use
id() = 18af1fad8d756-6419-52f43ea3(99092)
Message: Bind failed.
[6506] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( (
sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
(strerror((*__errno_location ()))) = Address already in use
id() = 18af1fad8d756-6422-52f43ea3(99112)
Message: Bind failed.
dmtcp_checkpoint (DMTCP + MTCP) 1.2.5
Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and
Gene Cooperman
This program comes with ABSOLUTELY NO WARRANTY.
This is free software, and you are welcome to redistribute it
under certain conditions; see COPYING file for details.
(Use flag "-q" to hide this message.)
dmtcp_checkpoint (DMTCP + MTCP) 1.2.5
Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and
Gene Cooperman
This program comes with ABSOLUTELY NO WARRANTY.
This is free software, and you are welcome to redistribute it
under certain conditions; see COPYING file for details.
(Use flag "-q" to hide this message.)
dmtcp_checkpoint (DMTCP + MTCP) 1.2.5
Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and
Gene Cooperman
This program comes with ABSOLUTELY NO WARRANTY.
This is free software, and you are welcome to redistribute it
under certain conditions; see COPYING file for details.
(Use flag "-q" to hide this message.)
[3314] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( (
sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
(strerror((*__errno_location ()))) = Address already in use
id() = 20385667ca0e707-3257-52f43ea3(99074)
Message: Bind failed.
[3314] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( (
sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
(strerror((*__errno_location ()))) = Address already in use
id() = 20385667ca0e707-3261-52f43ea3(99094)
Message: Bind failed.
[3314] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( (
sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
(strerror((*__errno_location ()))) = Address already in use
id() = 20385667ca0e707-3265-52f43ea3(99114)
Message: Bind failed.
[2528] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( (
sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
(strerror((*__errno_location ()))) = Address already in use
id() = 20385667ca0e708-2483-52f43ea3(99074)
Message: Bind failed.
[2528] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( (
sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
(strerror((*__errno_location ()))) = Address already in use
id() = 20385667ca0e708-2487-52f43ea3(99094)
Message: Bind failed.
[2528] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( (
sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
(strerror((*__errno_location ()))) = Address already in use
id() = 20385667ca0e708-2491-52f43ea3(99114)
Message: Bind failed.
[2520] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( (
sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
(strerror((*__errno_location ()))) = Address already in use
id() = 20385667ca0e709-2475-52f43ea3(99076)
Message: Bind failed.
[2520] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( (
sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
(strerror((*__errno_location ()))) = Address already in use
id() = 20385667ca0e709-2479-52f43ea3(99096)
Message: Bind failed.
[2520] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( (
sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
(strerror((*__errno_location ()))) = Address already in use
id() = 20385667ca0e709-2483-52f43ea3(99116)
Message: Bind failed.
Segmentation fault (core dumped)
Segmentation fault (core dumped)
Segmentation fault (core dumped)
[[6422] mtcp_restart_nolibc.c:[929 read_shared_memory_area_from_file:
mapping
6416/tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_pool.master]
mtcp_restart_nolibc.c with data from ckpt image
6419:929 read_shared_memory_area_from_file:
] mtcp_restart_nolibc.cmapping
/tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_pool.master:929 with
data from ckpt image
read_shared_memory_area_from_file:
mapping /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_pool.master
with data from ckpt image
[6416] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file:
mapping
/tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_btl_module.master with
data from ckpt image
[6422] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file:
mapping
/tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_btl_module.master with
data from ckpt image
[6419] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file:
mapping
/tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_btl_module.master with
data from ckpt image
------------------------------------------------------------------------------
Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
http://pubads.g.doubleclick.net/gampad/clk?id=121051231&iu=/4140/ostg.clktrk
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum