Hi Gene/Kapil
thank you so much for your help
about your question:
./dmtcp_restart_script.sh
(yes , this Is the way by which i was invoking restart for dmtcp-1.2.5)
does version 1.2.5 have problems in case of 16 process on 4 nodes cluster it
worked fine in these cases:
1- single node for 4 processes and 16 processes
2- 4 nodes cluster for 4 processes
about this part:
Building it should be easy: ./configure && make
should not i do "make install " also in order to find all the required files in
all nodes of the cluster ?
thank you
>
> Date: Thu, 6 Feb 2014 23:03:00 -0500
> From: [email protected]
> To: [email protected]
> CC: [email protected]
> Subject: Re: [Dmtcp-forum] FW: ver 1.2.5 - restart fail for 16 processes on a
> 4 nodes cluster ?
>
> Hi Basma,
> Would you mind re-doing this experiment with DMTCP 2.1 (the latest
> version)?
> You'll find it at: http://sourceforge.net/projects/dmtcp/files/dmtcp-2.x/2.1/
> Building it should be easy: ./configure && make
> We renamed the way to start. It will now be:
> bin/dmtcp_launch mpirun -np 4 -H master,node001,node002,node003
> /home/ubuntu/NPB3.3/NPB3.3-MPI/bin/lu.A.4
> Then to restart, it should be the same as before:
> ./dmtcp_restart_script.sh
> (Is this the way that you were invoking restart for dmtcp-1.2.5?)
>
> If this still gives you any problems, please do write back.
>
> Best wishes,
> - Gene
>
> ----- Original Message -----
> From: basma a.azeem <[email protected]>
> To: [email protected]
> Sent: Thu, 6 Feb 2014 21:39:17 -0500 (EST)
> Subject: [Dmtcp-forum] FW: ver 1.2.5 - restart fail for 16 processes on a 4
> nodes cluster ?
>
>
> From: [email protected]
> To: [email protected]
> Subject: ver 1.2.5 - restart fail for 16 processes on a 4 nodes cluster ?
> Date: Fri, 7 Feb 2014 04:37:58 +0200
>
>
>
>
> i am trying dmtcp version 1.2.5 with open mpi
> i use a 4 node cluster
>
> when i try to check point and restart an exe that was compiler 4 processes it
> works good at checkpoint and at restart it gives me an ""Segmentation fault
> (core dumped)" " then it works correctly also at restart
>
> ubuntu@master:~$ dmtcp-1.2.5/bin/dmtcp_checkpoint mpirun -np 4 -H
> master,node001,node002,node003 /home/ubuntu/NPB3.3/NPB3.3-MPI/bin/lu.A.4
>
> but when i try to check point and restart an exe that was compiler 16
> processes it works good at checkpoint but at restart it gives this output and
> hangs . it stops for ever
>
> ubuntu@master:~$ dmtcp-1.2.5/bin/dmtcp_checkpoint mpirun -np 16 -H
> master,node001,node002,node003
> /home/ubuntu/NPB3.3/NPB3.3-MPI/bin/lu.A.16
>
> it looks like i am missing a simple detail
>
> here is the output i had :
>
> -------------------------------------------------------
> dmtcp_checkpoint (DMTCP + MTCP) 1.2.5
> Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and
> Gene Cooperman
> This program comes with ABSOLUTELY NO WARRANTY.
> This is free software, and you are welcome to redistribute it
> under certain conditions; see COPYING file for details.
> (Use flag "-q" to hide this message.)
>
> dmtcp_coordinator starting...
> Port: 7779
> Checkpoint Interval: disabled (checkpoint manually instead)
> Exit on last client: 1
> Backgrounding...
> [6506] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind (
> ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 18af1fad8d756-6416-52f43ea3(99072)
> Message: Bind failed.
> [6506] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind (
> ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 18af1fad8d756-6419-52f43ea3(99092)
> Message: Bind failed.
> [6506] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind (
> ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 18af1fad8d756-6422-52f43ea3(99112)
> Message: Bind failed.
> dmtcp_checkpoint (DMTCP + MTCP) 1.2.5
> Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and
> Gene Cooperman
> This program comes with ABSOLUTELY NO WARRANTY.
> This is free software, and you are welcome to redistribute it
> under certain conditions; see COPYING file for details.
> (Use flag "-q" to hide this message.)
>
> dmtcp_checkpoint (DMTCP + MTCP) 1.2.5
> Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and
> Gene Cooperman
> This program comes with ABSOLUTELY NO WARRANTY.
> This is free software, and you are welcome to redistribute it
> under certain conditions; see COPYING file for details.
> (Use flag "-q" to hide this message.)
>
> dmtcp_checkpoint (DMTCP + MTCP) 1.2.5
> Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and
> Gene Cooperman
> This program comes with ABSOLUTELY NO WARRANTY.
> This is free software, and you are welcome to redistribute it
> under certain conditions; see COPYING file for details.
> (Use flag "-q" to hide this message.)
>
> [3314] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind (
> ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 20385667ca0e707-3257-52f43ea3(99074)
> Message: Bind failed.
> [3314] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind (
> ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 20385667ca0e707-3261-52f43ea3(99094)
> Message: Bind failed.
> [3314] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind (
> ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 20385667ca0e707-3265-52f43ea3(99114)
> Message: Bind failed.
> [2528] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind (
> ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 20385667ca0e708-2483-52f43ea3(99074)
> Message: Bind failed.
> [2528] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind (
> ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 20385667ca0e708-2487-52f43ea3(99094)
> Message: Bind failed.
> [2528] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind (
> ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 20385667ca0e708-2491-52f43ea3(99114)
> Message: Bind failed.
> [2520] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind (
> ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 20385667ca0e709-2475-52f43ea3(99076)
> Message: Bind failed.
> [2520] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind (
> ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 20385667ca0e709-2479-52f43ea3(99096)
> Message: Bind failed.
> [2520] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind (
> ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 20385667ca0e709-2483-52f43ea3(99116)
> Message: Bind failed.
> Segmentation fault (core dumped)
> Segmentation fault (core dumped)
> Segmentation fault (core dumped)
> [[6422] mtcp_restart_nolibc.c:[929 read_shared_memory_area_from_file:
> mapping
> 6416/tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_pool.master]
> mtcp_restart_nolibc.c with data from ckpt image
> 6419:929 read_shared_memory_area_from_file:
> ] mtcp_restart_nolibc.cmapping
> /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_pool.master:929 with
> data from ckpt image
> read_shared_memory_area_from_file:
> mapping
> /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_pool.master with
> data from ckpt image
> [6416] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file:
> mapping
> /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_btl_module.master
> with data from ckpt image
> [6422] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file:
> mapping
> /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_btl_module.master
> with data from ckpt image
> [6419] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file:
> mapping
> /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_btl_module.master
> with data from ckpt image
>
>
>
>
>
------------------------------------------------------------------------------
Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
http://pubads.g.doubleclick.net/gampad/clk?id=121051231&iu=/4140/ostg.clktrk
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum