Hi Basma,
    I'm reading the e-mails in chronological order.  :-)

The excerpt below seems interesting.
>      controllingTty = /dev/pts/0
>      (strerror((*__errno_location ()))) = Permission denied

Kapil tells me that another group had reported a similar bug, but we
were unable to reproduce their bug locally.  Would it be possible
for us to get an account on your cluster?   That will be the shortest
path to analyzing this bug.  If that is not possible for you, we can
also propose a screen-sharing session, so that we can watch you as
you exhibit the bug.

(If a guest account is possible, the we'll work wit the DMTCP-2.1 that
 you've already installed.  We certainly won't need any privileges.)

Best wishes,
- Gene

On Fri, Feb 07, 2014 at 07:49:21PM +0200, basma a.azeem wrote:
> 
> 
> i tried version 2.1
> 
> in single node case for 16 processes at restart it gives me this error:
> 
> 
> ./dmtcp_restart_script.sh
> 
> dmtcp_restart (DMTCP + MTCP) 2.1
> 
> Copyright (C) 2006-2014  Jason Ansel, Michael Rieker, Kapil Arya, and
>                                                        Gene Cooperman
> License LGPLv3+: GNU LGPL version 3 or later 
> <http://gnu.org/licenses/lgpl.html>.
> This program comes with ABSOLUTELY NO WARRANTY.
> This is free software, and you are welcome to redistribute it
> under certain conditions; see COPYING file for details.
> (Use flag "-q" to hide this message.)
> 
> [40000] ERROR at fileconnection.cpp:399 in postRestart; 
> REASON='JASSERT(tempfd >= 0) failed'
>      tempfd = -1
>      controllingTty = /dev/pts/0
>      (strerror((*__errno_location ()))) = Permission denied
> Message: Error Opening the terminal attached with the process
> orterun (40000): Terminating...
> ubuntu@ip-10-43-154-61:~$ [41000] WARNING at socketconnection.cpp:504 in 
> postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) 
> failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-41000-52f519db(99080)
> Message: Bind failed.
> [41000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-41000-52f519db(99081)
> Message: Bind failed.
> [41000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-41000-52f519db(99097)
> Message: Bind failed.
> [41000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-41000-52f519db(99098)
> Message: Bind failed.
> [53000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-53000-52f519dc(99201)
> Message: Bind failed.
> [53000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-53000-52f519dc(99202)
> Message: Bind failed.
> [53000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-53000-52f519dc(99218)
> Message: Bind failed.
> [53000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-53000-52f519dc(99219)
> Message: Bind failed.
> [45000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-45000-52f519db(99121)
> Message: Bind failed.
> [45000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-45000-52f519db(99122)
> Message: Bind failed.
> [45000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-45000-52f519db(99138)
> Message: Bind failed.
> [45000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-45000-52f519db(99139)
> Message: Bind failed.
> [54000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-54000-52f519dc(99211)
> Message: Bind failed.
> [54000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-54000-52f519dc(99212)
> Message: Bind failed.
> [54000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-54000-52f519dc(99228)
> Message: Bind failed.
> [54000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-54000-52f519dc(99229)
> Message: Bind failed.
> [49000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-49000-52f519db(99161)
> Message: Bind failed.
> [49000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-49000-52f519db(99162)
> Message: Bind failed.
> [49000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-49000-52f519db(99178)
> Message: Bind failed.
> [49000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-49000-52f519db(99179)
> Message: Bind failed.
> [42000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-42000-52f519db(99091)
> Message: Bind failed.
> [42000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-42000-52f519db(99092)
> Message: Bind failed.
> [42000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-42000-52f519db(99108)
> Message: Bind failed.
> [42000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-42000-52f519db(99109)
> Message: Bind failed.
> [56000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-56000-52f519dc(99231)
> Message: Bind failed.
> [56000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-56000-52f519dc(99232)
> Message: Bind failed.
> [56000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-56000-52f519dc(99248)
> Message: Bind failed.
> [56000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-56000-52f519dc(99249)
> Message: Bind failed.
> [43000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-43000-52f519db(99101)
> Message: Bind failed.
> [43000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-43000-52f519db(99102)
> Message: Bind failed.
> [43000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-43000-52f519db(99118)
> Message: Bind failed.
> [43000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-43000-52f519db(99119)
> Message: Bind failed.
> [44000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-44000-52f519db(99111)
> Message: Bind failed.
> [44000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-44000-52f519db(99112)
> Message: Bind failed.
> [44000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-44000-52f519db(99128)
> Message: Bind failed.
> [44000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-44000-52f519db(99129)
> Message: Bind failed.
> [46000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-46000-52f519db(99131)
> Message: Bind failed.
> [46000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-46000-52f519db(99132)
> Message: Bind failed.
> [46000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-46000-52f519db(99148)
> Message: Bind failed.
> [46000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-46000-52f519db(99149)
> Message: Bind failed.
> [52000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-52000-52f519dc(99191)
> Message: Bind failed.
> [52000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-52000-52f519dc(99192)
> Message: Bind failed.
> [48000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-48000-52f519db(99151)
> Message: Bind failed.
> [52000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-52000-52f519dc(99208)
> Message: Bind failed.
> [52000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-52000-52f519dc(99209)
> Message: Bind failed.
> [48000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-48000-52f519db(99152)
> Message: Bind failed.
> [48000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-48000-52f519db(99168)
> Message: Bind failed.
> [55000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-55000-52f519dc(99221)
> Message: Bind failed.
> [48000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-48000-52f519db(99169)
> Message: Bind failed.
> [55000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-55000-52f519dc(99222)
> Message: Bind failed.
> [55000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-55000-52f519dc(99238)
> Message: Bind failed.
> [55000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-55000-52f519dc(99239)
> Message: Bind failed.
> [47000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-47000-52f519db(99141)
> Message: Bind failed.
> [47000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-47000-52f519db(99142)
> Message: Bind failed.
> [47000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-47000-52f519db(99158)
> Message: Bind failed.
> [47000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-47000-52f519db(99159)
> Message: Bind failed.
> [51000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-51000-52f519db(99181)
> Message: Bind failed.
> [51000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-51000-52f519db(99182)
> Message: Bind failed.
> [51000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-51000-52f519db(99198)
> Message: Bind failed.
> [51000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-51000-52f519db(99199)
> Message: Bind failed.
> [50000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-50000-52f519db(99171)
> Message: Bind failed.
> [50000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-50000-52f519db(99172)
> Message: Bind failed.
> [50000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-50000-52f519db(99188)
> Message: Bind failed.
> [50000] WARNING at socketconnection.cpp:504 in postRestart; 
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 6da2961af00014aa-50000-52f519db(99189)
> Message: Bind failed.
> [41000] ERROR at connectionrewirer.cpp:89 in doReconnect; 
> REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, 
> remoteAddr.len) == 0) failed'
>      id = 6da2961af00014aa-40000-52f519db(99049)
>      (strerror((*__errno_location ()))) = Invalid argument
> Message: failed to restore connection
> lu.A.16 (41000): Terminating...
> [49000] ERROR at connectionrewirer.cpp:89 in doReconnect; 
> REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, 
> remoteAddr.len) == 0) failed'
>      id = 6da2961af00014aa-40000-52f519db(99129)
>      (strerror((*__errno_location ()))) = Invalid argument
> Message: failed to restore connection
> lu.A.16 (49000): Terminating...
> [46000] ERROR at connectionrewirer.cpp:89 in doReconnect; 
> REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, 
> remoteAddr.len) == 0) failed'
>      id = 6da2961af00014aa-40000-52f519db(99099)
>      (strerror((*__errno_location ()))) = Invalid argument
> Message: failed to restore connection
> lu.A.16 (46000): Terminating...
> [55000] ERROR at connectionrewirer.cpp:89 in doReconnect; 
> REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, 
> remoteAddr.len) == 0) failed'
>      id = 6da2961af00014aa-40000-52f519db(99189)
>      (strerror((*__errno_location ()))) = Invalid argument
> Message: failed to restore connection
> lu.A.16 (55000): Terminating...
> [43000] ERROR at connectionrewirer.cpp:89 in doReconnect; 
> REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, 
> remoteAddr.len) == 0) failed'
>      id = 6da2961af00014aa-40000-52f519db(99069)
>      (strerror((*__errno_location ()))) = Invalid argument
> Message: failed to restore connection
> lu.A.16 (43000): Terminating...
> [47000] ERROR at connectionrewirer.cpp:89 in doReconnect; 
> REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, 
> remoteAddr.len) == 0) failed'
>      id = 6da2961af00014aa-40000-52f519db(99109)
>      (strerror((*__errno_location ()))) = Invalid argument
> Message: failed to restore connection
> lu.A.16 (47000): Terminating...
> [42000] ERROR at connectionrewirer.cpp:89 in doReconnect; 
> REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, 
> remoteAddr.len) == 0) failed'
>      id = 6da2961af00014aa-40000-52f519db(99059)
>      (strerror((*__errno_location ()))) = Invalid argument
> Message: failed to restore connection
> lu.A.16 (42000): Terminating...
> [45000] ERROR at connectionrewirer.cpp:89 in doReconnect; 
> REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, 
> remoteAddr.len) == 0) failed'
>      id = 6da2961af00014aa-40000-52f519db(99089)
>      (strerror((*__errno_location ()))) = Invalid argument
> Message: failed to restore connection
> lu.A.16 (45000): Terminating...
> [54000] ERROR at connectionrewirer.cpp:89 in doReconnect; 
> REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, 
> remoteAddr.len) == 0) failed'
>      id = 6da2961af00014aa-40000-52f519db(99179)
>      (strerror((*__errno_location ()))) = Invalid argument
> Message: failed to restore connection
> lu.A.16 (54000): Terminating...
> [51000] ERROR at connectionrewirer.cpp:89 in doReconnect; 
> REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, 
> remoteAddr.len) == 0) failed'
>      id = 6da2961af00014aa-40000-52f519db(99149)
>      (strerror((*__errno_location ()))) = Invalid argument
> Message: failed to restore connection
> lu.A.16 (51000): Terminating...
> [50000] ERROR at connectionrewirer.cpp:89 in doReconnect; 
> REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, 
> remoteAddr.len) == 0) failed'
>      id = 6da2961af00014aa-40000-52f519db(99139)
>      (strerror((*__errno_location ()))) = Invalid argument
> Message: failed to restore connection
> lu.A.16 (50000): Terminating...
> [44000] ERROR at connectionrewirer.cpp:89 in doReconnect; 
> REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, 
> remoteAddr.len) == 0) failed'
>      id = 6da2961af00014aa-40000-52f519db(99079)
>      (strerror((*__errno_location ()))) = Invalid argument
> Message: failed to restore connection
> lu.A.16 (44000): Terminating...
> [52000] ERROR at connectionrewirer.cpp:89 in doReconnect; 
> REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, 
> remoteAddr.len) == 0) failed'
>      id = 6da2961af00014aa-40000-52f519db(99159)
>      (strerror((*__errno_location ()))) = Invalid argument
> Message: failed to restore connection
> lu.A.16 (52000): Terminating...
> [56000] ERROR at connectionrewirer.cpp:89 in doReconnect; 
> REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, 
> remoteAddr.len) == 0) failed'
>      id = 6da2961af00014aa-40000-52f519db(99199)
>      (strerror((*__errno_location ()))) = Invalid argument
> Message: failed to restore connection
> lu.A.16 (56000): Terminating...
> [53000] ERROR at connectionrewirer.cpp:89 in doReconnect; 
> REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, 
> remoteAddr.len) == 0) failed'
>      id = 6da2961af00014aa-40000-52f519db(99169)
>      (strerror((*__errno_location ()))) = Invalid argument
> Message: failed to restore connection
> lu.A.16 (53000): Terminating...
> [48000] ERROR at connectionrewirer.cpp:89 in doReconnect; 
> REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, 
> remoteAddr.len) == 0) failed'
>      id = 6da2961af00014aa-40000-52f519db(99119)
>      (strerror((*__errno_location ()))) = Invalid argument
> Message: failed to restore connection
> lu.A.16 (48000): Terminating...
> 
> 
> 
> 
> From: [email protected]
> To: [email protected]; [email protected]
> Subject: RE: [Dmtcp-forum] FW: ver 1.2.5 - restart fail for 16 processes on a 
> 4 nodes cluster ?
> Date: Fri, 7 Feb 2014 19:14:44 +0200
> 
> 
> 
> 
> Hi Gene/Kapil
> thank you so much for your help
> 
> about your question:
> 
>   ./dmtcp_restart_script.sh
> (yes , this Is the way by which i was invoking restart for dmtcp-1.2.5)
> 
> does version 1.2.5 have problems in case of 16 process on 4 nodes cluster it 
> worked fine in these cases:
> 
> 1- single node for 4 processes and 16 processes
> 2- 4 nodes cluster for 4 processes
> 
> 
> about this part:
>  Building it should be easy:  ./configure && make
> should not i do "make install " also in order to find all the required files 
> in all nodes of the cluster ?
> 
> thank you
> > 
> 
> > Date: Thu, 6 Feb 2014 23:03:00 -0500
> > From: [email protected]
> > To: [email protected]
> > CC: [email protected]
> > Subject: Re: [Dmtcp-forum] FW: ver 1.2.5 - restart fail for 16 processes on 
> > a 4 nodes cluster ?
> > 
> > Hi Basma,
> >     Would you mind re-doing this experiment with DMTCP 2.1 (the latest 
> > version)?
> > You'll find it at:  
> > http://sourceforge.net/projects/dmtcp/files/dmtcp-2.x/2.1/
> > Building it should be easy:  ./configure && make
> > We renamed the way to start.  It will now be:
> >   bin/dmtcp_launch mpirun -np 4   -H master,node001,node002,node003   
> > /home/ubuntu/NPB3.3/NPB3.3-MPI/bin/lu.A.4
> > Then to restart, it should be the same as before:
> >   ./dmtcp_restart_script.sh
> > (Is this the way that you were invoking restart for dmtcp-1.2.5?)
> > 
> > If this still gives you any problems, please do write back.
> > 
> > Best wishes,
> > - Gene
> > 
> > ----- Original Message -----
> > From: basma a.azeem <[email protected]>
> > To: [email protected]
> > Sent: Thu, 6 Feb 2014 21:39:17 -0500 (EST)
> > Subject: [Dmtcp-forum] FW: ver 1.2.5 - restart fail for 16 processes on a 4 
> > nodes cluster ?
> > 
> > 
> > From: [email protected]
> > To: [email protected]
> > Subject: ver 1.2.5 - restart fail for 16 processes on a 4 nodes cluster ?
> > Date: Fri, 7 Feb 2014 04:37:58 +0200
> > 
> > 
> > 
> > 
> > i  am trying dmtcp version 1.2.5 with open mpi
> > i use a 4 node cluster
> > 
> > when i try to check point and restart an exe that was compiler 4 processes 
> > it works good at checkpoint and at restart it gives me an ""Segmentation 
> > fault (core dumped)" " then it works correctly also at restart
> > 
> > ubuntu@master:~$ dmtcp-1.2.5/bin/dmtcp_checkpoint mpirun -np 4   -H 
> > master,node001,node002,node003   /home/ubuntu/NPB3.3/NPB3.3-MPI/bin/lu.A.4
> > 
> > but when i try to check point and restart an exe that was compiler 16 
> > processes it works good at checkpoint but at restart it gives this output 
> > and hangs . it stops for ever
> > 
> > ubuntu@master:~$ dmtcp-1.2.5/bin/dmtcp_checkpoint mpirun -np 16   -H 
> > master,node001,node002,node003   
> > /home/ubuntu/NPB3.3/NPB3.3-MPI/bin/lu.A.16
> > 
> > it looks like i am missing a simple detail
> > 
> > here is the output i had :
> > 
> > -------------------------------------------------------
> > dmtcp_checkpoint (DMTCP + MTCP) 1.2.5
> > Copyright (C) 2006-2011  Jason Ansel, Michael Rieker, Kapil Arya, and
> >                                                        Gene Cooperman
> > This program comes with ABSOLUTELY NO WARRANTY.
> > This is free software, and you are welcome to redistribute it
> > under certain conditions; see COPYING file for details.
> > (Use flag "-q" to hide this message.)
> > 
> > dmtcp_coordinator starting...
> >     Port: 7779
> >     Checkpoint Interval: disabled (checkpoint manually instead)
> >     Exit on last client: 1
> > Backgrounding...
> > [6506] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind 
> > ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
> >      (strerror((*__errno_location ()))) = Address already in use
> >      id() = 18af1fad8d756-6416-52f43ea3(99072)
> > Message: Bind failed.
> > [6506] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind 
> > ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
> >      (strerror((*__errno_location ()))) = Address already in use
> >      id() = 18af1fad8d756-6419-52f43ea3(99092)
> > Message: Bind failed.
> > [6506] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind 
> > ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
> >      (strerror((*__errno_location ()))) = Address already in use
> >      id() = 18af1fad8d756-6422-52f43ea3(99112)
> > Message: Bind failed.
> > dmtcp_checkpoint (DMTCP + MTCP) 1.2.5
> > Copyright (C) 2006-2011  Jason Ansel, Michael Rieker, Kapil Arya, and
> >                                                        Gene Cooperman
> > This program comes with ABSOLUTELY NO WARRANTY.
> > This is free software, and you are welcome to redistribute it
> > under certain conditions; see COPYING file for details.
> > (Use flag "-q" to hide this message.)
> > 
> > dmtcp_checkpoint (DMTCP + MTCP) 1.2.5
> > Copyright (C) 2006-2011  Jason Ansel, Michael Rieker, Kapil Arya, and
> >                                                        Gene Cooperman
> > This program comes with ABSOLUTELY NO WARRANTY.
> > This is free software, and you are welcome to redistribute it
> > under certain conditions; see COPYING file for details.
> > (Use flag "-q" to hide this message.)
> > 
> > dmtcp_checkpoint (DMTCP + MTCP) 1.2.5
> > Copyright (C) 2006-2011  Jason Ansel, Michael Rieker, Kapil Arya, and
> >                                                        Gene Cooperman
> > This program comes with ABSOLUTELY NO WARRANTY.
> > This is free software, and you are welcome to redistribute it
> > under certain conditions; see COPYING file for details.
> > (Use flag "-q" to hide this message.)
> > 
> > [3314] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind 
> > ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
> >      (strerror((*__errno_location ()))) = Address already in use
> >      id() = 20385667ca0e707-3257-52f43ea3(99074)
> > Message: Bind failed.
> > [3314] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind 
> > ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
> >      (strerror((*__errno_location ()))) = Address already in use
> >      id() = 20385667ca0e707-3261-52f43ea3(99094)
> > Message: Bind failed.
> > [3314] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind 
> > ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
> >      (strerror((*__errno_location ()))) = Address already in use
> >      id() = 20385667ca0e707-3265-52f43ea3(99114)
> > Message: Bind failed.
> > [2528] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind 
> > ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
> >      (strerror((*__errno_location ()))) = Address already in use
> >      id() = 20385667ca0e708-2483-52f43ea3(99074)
> > Message: Bind failed.
> > [2528] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind 
> > ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
> >      (strerror((*__errno_location ()))) = Address already in use
> >      id() = 20385667ca0e708-2487-52f43ea3(99094)
> > Message: Bind failed.
> > [2528] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind 
> > ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
> >      (strerror((*__errno_location ()))) = Address already in use
> >      id() = 20385667ca0e708-2491-52f43ea3(99114)
> > Message: Bind failed.
> > [2520] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind 
> > ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
> >      (strerror((*__errno_location ()))) = Address already in use
> >      id() = 20385667ca0e709-2475-52f43ea3(99076)
> > Message: Bind failed.
> > [2520] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind 
> > ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
> >      (strerror((*__errno_location ()))) = Address already in use
> >      id() = 20385667ca0e709-2479-52f43ea3(99096)
> > Message: Bind failed.
> > [2520] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind 
> > ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
> >      (strerror((*__errno_location ()))) = Address already in use
> >      id() = 20385667ca0e709-2483-52f43ea3(99116)
> > Message: Bind failed.
> > Segmentation fault (core dumped)
> > Segmentation fault (core dumped)
> > Segmentation fault (core dumped)
> > [[6422] mtcp_restart_nolibc.c:[929 read_shared_memory_area_from_file:
> >   mapping 
> > 6416/tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_pool.master] 
> > mtcp_restart_nolibc.c with data from ckpt image
> > 6419:929 read_shared_memory_area_from_file:
> >   ] mtcp_restart_nolibc.cmapping 
> > /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_pool.master:929 
> > with data from ckpt image
> >  read_shared_memory_area_from_file:
> >   mapping 
> > /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_pool.master with 
> > data from ckpt image
> > [6416] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file:
> >   mapping 
> > /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_btl_module.master 
> > with data from ckpt image
> > [6422] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file:
> >   mapping 
> > /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_btl_module.master 
> > with data from ckpt image
> > [6419] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file:
> >   mapping 
> > /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_btl_module.master 
> > with data from ckpt image
> > 
> > 
> > 
> >                                                                             
> >   
>                                                                               
>   

------------------------------------------------------------------------------
Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
http://pubads.g.doubleclick.net/gampad/clk?id=121051231&iu=/4140/ostg.clktrk
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to