Hi Basma,
    Would you mind re-doing this experiment with DMTCP 2.1 (the latest version)?
You'll find it at:  http://sourceforge.net/projects/dmtcp/files/dmtcp-2.x/2.1/
Building it should be easy:  ./configure && make
We renamed the way to start.  It will now be:
  bin/dmtcp_launch mpirun -np 4   -H master,node001,node002,node003   
/home/ubuntu/NPB3.3/NPB3.3-MPI/bin/lu.A.4
Then to restart, it should be the same as before:
  ./dmtcp_restart_script.sh
(Is this the way that you were invoking restart for dmtcp-1.2.5?)

If this still gives you any problems, please do write back.

Best wishes,
- Gene

----- Original Message -----
From: basma a.azeem <[email protected]>
To: [email protected]
Sent: Thu, 6 Feb 2014 21:39:17 -0500 (EST)
Subject: [Dmtcp-forum] FW: ver 1.2.5 - restart fail for 16 processes on a 4 
nodes cluster ?


From: [email protected]
To: [email protected]
Subject: ver 1.2.5 - restart fail for 16 processes on a 4 nodes cluster ?
Date: Fri, 7 Feb 2014 04:37:58 +0200




i  am trying dmtcp version 1.2.5 with open mpi
i use a 4 node cluster

when i try to check point and restart an exe that was compiler 4 processes it 
works good at checkpoint and at restart it gives me an ""Segmentation fault 
(core dumped)" " then it works correctly also at restart

ubuntu@master:~$ dmtcp-1.2.5/bin/dmtcp_checkpoint mpirun -np 4   -H 
master,node001,node002,node003   /home/ubuntu/NPB3.3/NPB3.3-MPI/bin/lu.A.4

but when i try to check point and restart an exe that was compiler 16 processes 
it works good at checkpoint but at restart it gives this output and hangs . it 
stops for ever

ubuntu@master:~$ dmtcp-1.2.5/bin/dmtcp_checkpoint mpirun -np 16   -H 
master,node001,node002,node003   
/home/ubuntu/NPB3.3/NPB3.3-MPI/bin/lu.A.16

it looks like i am missing a simple detail

here is the output i had :

-------------------------------------------------------
dmtcp_checkpoint (DMTCP + MTCP) 1.2.5
Copyright (C) 2006-2011  Jason Ansel, Michael Rieker, Kapil Arya, and
                                                       Gene Cooperman
This program comes with ABSOLUTELY NO WARRANTY.
This is free software, and you are welcome to redistribute it
under certain conditions; see COPYING file for details.
(Use flag "-q" to hide this message.)

dmtcp_coordinator starting...
    Port: 7779
    Checkpoint Interval: disabled (checkpoint manually instead)
    Exit on last client: 1
Backgrounding...
[6506] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( 
sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
     (strerror((*__errno_location ()))) = Address already in use
     id() = 18af1fad8d756-6416-52f43ea3(99072)
Message: Bind failed.
[6506] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( 
sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
     (strerror((*__errno_location ()))) = Address already in use
     id() = 18af1fad8d756-6419-52f43ea3(99092)
Message: Bind failed.
[6506] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( 
sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
     (strerror((*__errno_location ()))) = Address already in use
     id() = 18af1fad8d756-6422-52f43ea3(99112)
Message: Bind failed.
dmtcp_checkpoint (DMTCP + MTCP) 1.2.5
Copyright (C) 2006-2011  Jason Ansel, Michael Rieker, Kapil Arya, and
                                                       Gene Cooperman
This program comes with ABSOLUTELY NO WARRANTY.
This is free software, and you are welcome to redistribute it
under certain conditions; see COPYING file for details.
(Use flag "-q" to hide this message.)

dmtcp_checkpoint (DMTCP + MTCP) 1.2.5
Copyright (C) 2006-2011  Jason Ansel, Michael Rieker, Kapil Arya, and
                                                       Gene Cooperman
This program comes with ABSOLUTELY NO WARRANTY.
This is free software, and you are welcome to redistribute it
under certain conditions; see COPYING file for details.
(Use flag "-q" to hide this message.)

dmtcp_checkpoint (DMTCP + MTCP) 1.2.5
Copyright (C) 2006-2011  Jason Ansel, Michael Rieker, Kapil Arya, and
                                                       Gene Cooperman
This program comes with ABSOLUTELY NO WARRANTY.
This is free software, and you are welcome to redistribute it
under certain conditions; see COPYING file for details.
(Use flag "-q" to hide this message.)

[3314] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( 
sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
     (strerror((*__errno_location ()))) = Address already in use
     id() = 20385667ca0e707-3257-52f43ea3(99074)
Message: Bind failed.
[3314] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( 
sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
     (strerror((*__errno_location ()))) = Address already in use
     id() = 20385667ca0e707-3261-52f43ea3(99094)
Message: Bind failed.
[3314] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( 
sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
     (strerror((*__errno_location ()))) = Address already in use
     id() = 20385667ca0e707-3265-52f43ea3(99114)
Message: Bind failed.
[2528] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( 
sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
     (strerror((*__errno_location ()))) = Address already in use
     id() = 20385667ca0e708-2483-52f43ea3(99074)
Message: Bind failed.
[2528] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( 
sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
     (strerror((*__errno_location ()))) = Address already in use
     id() = 20385667ca0e708-2487-52f43ea3(99094)
Message: Bind failed.
[2528] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( 
sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
     (strerror((*__errno_location ()))) = Address already in use
     id() = 20385667ca0e708-2491-52f43ea3(99114)
Message: Bind failed.
[2520] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( 
sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
     (strerror((*__errno_location ()))) = Address already in use
     id() = 20385667ca0e709-2475-52f43ea3(99076)
Message: Bind failed.
[2520] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( 
sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
     (strerror((*__errno_location ()))) = Address already in use
     id() = 20385667ca0e709-2479-52f43ea3(99096)
Message: Bind failed.
[2520] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( 
sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
     (strerror((*__errno_location ()))) = Address already in use
     id() = 20385667ca0e709-2483-52f43ea3(99116)
Message: Bind failed.
Segmentation fault (core dumped)
Segmentation fault (core dumped)
Segmentation fault (core dumped)
[[6422] mtcp_restart_nolibc.c:[929 read_shared_memory_area_from_file:
  mapping 
6416/tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_pool.master] 
mtcp_restart_nolibc.c with data from ckpt image
6419:929 read_shared_memory_area_from_file:
  ] mtcp_restart_nolibc.cmapping 
/tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_pool.master:929 with 
data from ckpt image
 read_shared_memory_area_from_file:
  mapping /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_pool.master 
with data from ckpt image
[6416] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file:
  mapping 
/tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_btl_module.master with 
data from ckpt image
[6422] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file:
  mapping 
/tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_btl_module.master with 
data from ckpt image
[6419] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file:
  mapping 
/tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_btl_module.master with 
data from ckpt image



                                                                                
  

------------------------------------------------------------------------------
Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
http://pubads.g.doubleclick.net/gampad/clk?id=121051231&iu=/4140/ostg.clktrk
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to