Hi Basma,
I'm pleased to say that Kapil and I were running on a different
computer today, and saw your bug. If you checkout the last DMTCP 'svn',
we believe this will fix the DMTCP bug that you reported.
The safest thing will be to download svn revision 2566 (with the bug fix):
http://sourceforge.net/code-snapshots/svn/d/dm/dmtcp/code/dmtcp-code-2566-trunk.zip
This bug fix will also be part of DMTCP version 2.2 (the next release,
perhaps in about a month).
In general, for background, this was the issue:
We suspect that you were initially starting a process on a terminal
as root. So, your controlling terminal was owned by root.
We suspect that you then did 'su username'. At this point, you inherited
the file descriptor of root for your controlling terminal. But on restart,
DMTCP tries to open its own fresh file descriptor to the controlling
terminal. If the terminal is owned by root, this is not possible.
Hence the message you saw about 'no permission' for controlling terminal.
The piece of information we were missing is that you had probably
opened a new terminal as one user (as 'root' or other), and then had
done an su after starting work at a certain terminal. Is this
what had happened?
We hope this fixes everything.
Best wishes,
- Gene and Kapil
On Fri, Feb 07, 2014 at 07:49:21PM +0200, basma a.azeem wrote:
>
>
> i tried version 2.1
>
> in single node case for 16 processes at restart it gives me this error:
>
>
> ./dmtcp_restart_script.sh
>
> dmtcp_restart (DMTCP + MTCP) 2.1
>
> Copyright (C) 2006-2014 Jason Ansel, Michael Rieker, Kapil Arya, and
> Gene Cooperman
> License LGPLv3+: GNU LGPL version 3 or later
> <http://gnu.org/licenses/lgpl.html>.
> This program comes with ABSOLUTELY NO WARRANTY.
> This is free software, and you are welcome to redistribute it
> under certain conditions; see COPYING file for details.
> (Use flag "-q" to hide this message.)
>
> [40000] ERROR at fileconnection.cpp:399 in postRestart;
> REASON='JASSERT(tempfd >= 0) failed'
> tempfd = -1
> controllingTty = /dev/pts/0
> (strerror((*__errno_location ()))) = Permission denied
> Message: Error Opening the terminal attached with the process
> orterun (40000): Terminating...
> ubuntu@ip-10-43-154-61:~$ [41000] WARNING at socketconnection.cpp:504 in
> postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen))
> failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-41000-52f519db(99080)
> Message: Bind failed.
> [41000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-41000-52f519db(99081)
> Message: Bind failed.
> [41000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-41000-52f519db(99097)
> Message: Bind failed.
> [41000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-41000-52f519db(99098)
> Message: Bind failed.
> [53000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-53000-52f519dc(99201)
> Message: Bind failed.
> [53000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-53000-52f519dc(99202)
> Message: Bind failed.
> [53000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-53000-52f519dc(99218)
> Message: Bind failed.
> [53000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-53000-52f519dc(99219)
> Message: Bind failed.
> [45000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-45000-52f519db(99121)
> Message: Bind failed.
> [45000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-45000-52f519db(99122)
> Message: Bind failed.
> [45000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-45000-52f519db(99138)
> Message: Bind failed.
> [45000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-45000-52f519db(99139)
> Message: Bind failed.
> [54000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-54000-52f519dc(99211)
> Message: Bind failed.
> [54000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-54000-52f519dc(99212)
> Message: Bind failed.
> [54000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-54000-52f519dc(99228)
> Message: Bind failed.
> [54000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-54000-52f519dc(99229)
> Message: Bind failed.
> [49000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-49000-52f519db(99161)
> Message: Bind failed.
> [49000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-49000-52f519db(99162)
> Message: Bind failed.
> [49000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-49000-52f519db(99178)
> Message: Bind failed.
> [49000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-49000-52f519db(99179)
> Message: Bind failed.
> [42000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-42000-52f519db(99091)
> Message: Bind failed.
> [42000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-42000-52f519db(99092)
> Message: Bind failed.
> [42000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-42000-52f519db(99108)
> Message: Bind failed.
> [42000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-42000-52f519db(99109)
> Message: Bind failed.
> [56000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-56000-52f519dc(99231)
> Message: Bind failed.
> [56000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-56000-52f519dc(99232)
> Message: Bind failed.
> [56000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-56000-52f519dc(99248)
> Message: Bind failed.
> [56000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-56000-52f519dc(99249)
> Message: Bind failed.
> [43000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-43000-52f519db(99101)
> Message: Bind failed.
> [43000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-43000-52f519db(99102)
> Message: Bind failed.
> [43000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-43000-52f519db(99118)
> Message: Bind failed.
> [43000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-43000-52f519db(99119)
> Message: Bind failed.
> [44000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-44000-52f519db(99111)
> Message: Bind failed.
> [44000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-44000-52f519db(99112)
> Message: Bind failed.
> [44000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-44000-52f519db(99128)
> Message: Bind failed.
> [44000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-44000-52f519db(99129)
> Message: Bind failed.
> [46000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-46000-52f519db(99131)
> Message: Bind failed.
> [46000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-46000-52f519db(99132)
> Message: Bind failed.
> [46000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-46000-52f519db(99148)
> Message: Bind failed.
> [46000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-46000-52f519db(99149)
> Message: Bind failed.
> [52000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-52000-52f519dc(99191)
> Message: Bind failed.
> [52000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-52000-52f519dc(99192)
> Message: Bind failed.
> [48000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-48000-52f519db(99151)
> Message: Bind failed.
> [52000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-52000-52f519dc(99208)
> Message: Bind failed.
> [52000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-52000-52f519dc(99209)
> Message: Bind failed.
> [48000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-48000-52f519db(99152)
> Message: Bind failed.
> [48000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-48000-52f519db(99168)
> Message: Bind failed.
> [55000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-55000-52f519dc(99221)
> Message: Bind failed.
> [48000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-48000-52f519db(99169)
> Message: Bind failed.
> [55000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-55000-52f519dc(99222)
> Message: Bind failed.
> [55000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-55000-52f519dc(99238)
> Message: Bind failed.
> [55000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-55000-52f519dc(99239)
> Message: Bind failed.
> [47000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-47000-52f519db(99141)
> Message: Bind failed.
> [47000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-47000-52f519db(99142)
> Message: Bind failed.
> [47000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-47000-52f519db(99158)
> Message: Bind failed.
> [47000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-47000-52f519db(99159)
> Message: Bind failed.
> [51000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-51000-52f519db(99181)
> Message: Bind failed.
> [51000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-51000-52f519db(99182)
> Message: Bind failed.
> [51000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-51000-52f519db(99198)
> Message: Bind failed.
> [51000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-51000-52f519db(99199)
> Message: Bind failed.
> [50000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-50000-52f519db(99171)
> Message: Bind failed.
> [50000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-50000-52f519db(99172)
> Message: Bind failed.
> [50000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-50000-52f519db(99188)
> Message: Bind failed.
> [50000] WARNING at socketconnection.cpp:504 in postRestart;
> REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed'
> (strerror((*__errno_location ()))) = Address already in use
> id() = 6da2961af00014aa-50000-52f519db(99189)
> Message: Bind failed.
> [41000] ERROR at connectionrewirer.cpp:89 in doReconnect;
> REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr,
> remoteAddr.len) == 0) failed'
> id = 6da2961af00014aa-40000-52f519db(99049)
> (strerror((*__errno_location ()))) = Invalid argument
> Message: failed to restore connection
> lu.A.16 (41000): Terminating...
> [49000] ERROR at connectionrewirer.cpp:89 in doReconnect;
> REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr,
> remoteAddr.len) == 0) failed'
> id = 6da2961af00014aa-40000-52f519db(99129)
> (strerror((*__errno_location ()))) = Invalid argument
> Message: failed to restore connection
> lu.A.16 (49000): Terminating...
> [46000] ERROR at connectionrewirer.cpp:89 in doReconnect;
> REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr,
> remoteAddr.len) == 0) failed'
> id = 6da2961af00014aa-40000-52f519db(99099)
> (strerror((*__errno_location ()))) = Invalid argument
> Message: failed to restore connection
> lu.A.16 (46000): Terminating...
> [55000] ERROR at connectionrewirer.cpp:89 in doReconnect;
> REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr,
> remoteAddr.len) == 0) failed'
> id = 6da2961af00014aa-40000-52f519db(99189)
> (strerror((*__errno_location ()))) = Invalid argument
> Message: failed to restore connection
> lu.A.16 (55000): Terminating...
> [43000] ERROR at connectionrewirer.cpp:89 in doReconnect;
> REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr,
> remoteAddr.len) == 0) failed'
> id = 6da2961af00014aa-40000-52f519db(99069)
> (strerror((*__errno_location ()))) = Invalid argument
> Message: failed to restore connection
> lu.A.16 (43000): Terminating...
> [47000] ERROR at connectionrewirer.cpp:89 in doReconnect;
> REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr,
> remoteAddr.len) == 0) failed'
> id = 6da2961af00014aa-40000-52f519db(99109)
> (strerror((*__errno_location ()))) = Invalid argument
> Message: failed to restore connection
> lu.A.16 (47000): Terminating...
> [42000] ERROR at connectionrewirer.cpp:89 in doReconnect;
> REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr,
> remoteAddr.len) == 0) failed'
> id = 6da2961af00014aa-40000-52f519db(99059)
> (strerror((*__errno_location ()))) = Invalid argument
> Message: failed to restore connection
> lu.A.16 (42000): Terminating...
> [45000] ERROR at connectionrewirer.cpp:89 in doReconnect;
> REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr,
> remoteAddr.len) == 0) failed'
> id = 6da2961af00014aa-40000-52f519db(99089)
> (strerror((*__errno_location ()))) = Invalid argument
> Message: failed to restore connection
> lu.A.16 (45000): Terminating...
> [54000] ERROR at connectionrewirer.cpp:89 in doReconnect;
> REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr,
> remoteAddr.len) == 0) failed'
> id = 6da2961af00014aa-40000-52f519db(99179)
> (strerror((*__errno_location ()))) = Invalid argument
> Message: failed to restore connection
> lu.A.16 (54000): Terminating...
> [51000] ERROR at connectionrewirer.cpp:89 in doReconnect;
> REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr,
> remoteAddr.len) == 0) failed'
> id = 6da2961af00014aa-40000-52f519db(99149)
> (strerror((*__errno_location ()))) = Invalid argument
> Message: failed to restore connection
> lu.A.16 (51000): Terminating...
> [50000] ERROR at connectionrewirer.cpp:89 in doReconnect;
> REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr,
> remoteAddr.len) == 0) failed'
> id = 6da2961af00014aa-40000-52f519db(99139)
> (strerror((*__errno_location ()))) = Invalid argument
> Message: failed to restore connection
> lu.A.16 (50000): Terminating...
> [44000] ERROR at connectionrewirer.cpp:89 in doReconnect;
> REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr,
> remoteAddr.len) == 0) failed'
> id = 6da2961af00014aa-40000-52f519db(99079)
> (strerror((*__errno_location ()))) = Invalid argument
> Message: failed to restore connection
> lu.A.16 (44000): Terminating...
> [52000] ERROR at connectionrewirer.cpp:89 in doReconnect;
> REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr,
> remoteAddr.len) == 0) failed'
> id = 6da2961af00014aa-40000-52f519db(99159)
> (strerror((*__errno_location ()))) = Invalid argument
> Message: failed to restore connection
> lu.A.16 (52000): Terminating...
> [56000] ERROR at connectionrewirer.cpp:89 in doReconnect;
> REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr,
> remoteAddr.len) == 0) failed'
> id = 6da2961af00014aa-40000-52f519db(99199)
> (strerror((*__errno_location ()))) = Invalid argument
> Message: failed to restore connection
> lu.A.16 (56000): Terminating...
> [53000] ERROR at connectionrewirer.cpp:89 in doReconnect;
> REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr,
> remoteAddr.len) == 0) failed'
> id = 6da2961af00014aa-40000-52f519db(99169)
> (strerror((*__errno_location ()))) = Invalid argument
> Message: failed to restore connection
> lu.A.16 (53000): Terminating...
> [48000] ERROR at connectionrewirer.cpp:89 in doReconnect;
> REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr,
> remoteAddr.len) == 0) failed'
> id = 6da2961af00014aa-40000-52f519db(99119)
> (strerror((*__errno_location ()))) = Invalid argument
> Message: failed to restore connection
> lu.A.16 (48000): Terminating...
>
>
>
>
> From: [email protected]
> To: [email protected]; [email protected]
> Subject: RE: [Dmtcp-forum] FW: ver 1.2.5 - restart fail for 16 processes on a
> 4 nodes cluster ?
> Date: Fri, 7 Feb 2014 19:14:44 +0200
>
>
>
>
> Hi Gene/Kapil
> thank you so much for your help
>
> about your question:
>
> ./dmtcp_restart_script.sh
> (yes , this Is the way by which i was invoking restart for dmtcp-1.2.5)
>
> does version 1.2.5 have problems in case of 16 process on 4 nodes cluster it
> worked fine in these cases:
>
> 1- single node for 4 processes and 16 processes
> 2- 4 nodes cluster for 4 processes
>
>
> about this part:
> Building it should be easy: ./configure && make
> should not i do "make install " also in order to find all the required files
> in all nodes of the cluster ?
>
> thank you
> >
>
> > Date: Thu, 6 Feb 2014 23:03:00 -0500
> > From: [email protected]
> > To: [email protected]
> > CC: [email protected]
> > Subject: Re: [Dmtcp-forum] FW: ver 1.2.5 - restart fail for 16 processes on
> > a 4 nodes cluster ?
> >
> > Hi Basma,
> > Would you mind re-doing this experiment with DMTCP 2.1 (the latest
> > version)?
> > You'll find it at:
> > http://sourceforge.net/projects/dmtcp/files/dmtcp-2.x/2.1/
> > Building it should be easy: ./configure && make
> > We renamed the way to start. It will now be:
> > bin/dmtcp_launch mpirun -np 4 -H master,node001,node002,node003
> > /home/ubuntu/NPB3.3/NPB3.3-MPI/bin/lu.A.4
> > Then to restart, it should be the same as before:
> > ./dmtcp_restart_script.sh
> > (Is this the way that you were invoking restart for dmtcp-1.2.5?)
> >
> > If this still gives you any problems, please do write back.
> >
> > Best wishes,
> > - Gene
> >
> > ----- Original Message -----
> > From: basma a.azeem <[email protected]>
> > To: [email protected]
> > Sent: Thu, 6 Feb 2014 21:39:17 -0500 (EST)
> > Subject: [Dmtcp-forum] FW: ver 1.2.5 - restart fail for 16 processes on a 4
> > nodes cluster ?
> >
> >
> > From: [email protected]
> > To: [email protected]
> > Subject: ver 1.2.5 - restart fail for 16 processes on a 4 nodes cluster ?
> > Date: Fri, 7 Feb 2014 04:37:58 +0200
> >
> >
> >
> >
> > i am trying dmtcp version 1.2.5 with open mpi
> > i use a 4 node cluster
> >
> > when i try to check point and restart an exe that was compiler 4 processes
> > it works good at checkpoint and at restart it gives me an ""Segmentation
> > fault (core dumped)" " then it works correctly also at restart
> >
> > ubuntu@master:~$ dmtcp-1.2.5/bin/dmtcp_checkpoint mpirun -np 4 -H
> > master,node001,node002,node003 /home/ubuntu/NPB3.3/NPB3.3-MPI/bin/lu.A.4
> >
> > but when i try to check point and restart an exe that was compiler 16
> > processes it works good at checkpoint but at restart it gives this output
> > and hangs . it stops for ever
> >
> > ubuntu@master:~$ dmtcp-1.2.5/bin/dmtcp_checkpoint mpirun -np 16 -H
> > master,node001,node002,node003
> > /home/ubuntu/NPB3.3/NPB3.3-MPI/bin/lu.A.16
> >
> > it looks like i am missing a simple detail
> >
> > here is the output i had :
> >
> > -------------------------------------------------------
> > dmtcp_checkpoint (DMTCP + MTCP) 1.2.5
> > Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and
> > Gene Cooperman
> > This program comes with ABSOLUTELY NO WARRANTY.
> > This is free software, and you are welcome to redistribute it
> > under certain conditions; see COPYING file for details.
> > (Use flag "-q" to hide this message.)
> >
> > dmtcp_coordinator starting...
> > Port: 7779
> > Checkpoint Interval: disabled (checkpoint manually instead)
> > Exit on last client: 1
> > Backgrounding...
> > [6506] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind
> > ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
> > (strerror((*__errno_location ()))) = Address already in use
> > id() = 18af1fad8d756-6416-52f43ea3(99072)
> > Message: Bind failed.
> > [6506] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind
> > ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
> > (strerror((*__errno_location ()))) = Address already in use
> > id() = 18af1fad8d756-6419-52f43ea3(99092)
> > Message: Bind failed.
> > [6506] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind
> > ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
> > (strerror((*__errno_location ()))) = Address already in use
> > id() = 18af1fad8d756-6422-52f43ea3(99112)
> > Message: Bind failed.
> > dmtcp_checkpoint (DMTCP + MTCP) 1.2.5
> > Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and
> > Gene Cooperman
> > This program comes with ABSOLUTELY NO WARRANTY.
> > This is free software, and you are welcome to redistribute it
> > under certain conditions; see COPYING file for details.
> > (Use flag "-q" to hide this message.)
> >
> > dmtcp_checkpoint (DMTCP + MTCP) 1.2.5
> > Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and
> > Gene Cooperman
> > This program comes with ABSOLUTELY NO WARRANTY.
> > This is free software, and you are welcome to redistribute it
> > under certain conditions; see COPYING file for details.
> > (Use flag "-q" to hide this message.)
> >
> > dmtcp_checkpoint (DMTCP + MTCP) 1.2.5
> > Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and
> > Gene Cooperman
> > This program comes with ABSOLUTELY NO WARRANTY.
> > This is free software, and you are welcome to redistribute it
> > under certain conditions; see COPYING file for details.
> > (Use flag "-q" to hide this message.)
> >
> > [3314] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind
> > ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
> > (strerror((*__errno_location ()))) = Address already in use
> > id() = 20385667ca0e707-3257-52f43ea3(99074)
> > Message: Bind failed.
> > [3314] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind
> > ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
> > (strerror((*__errno_location ()))) = Address already in use
> > id() = 20385667ca0e707-3261-52f43ea3(99094)
> > Message: Bind failed.
> > [3314] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind
> > ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
> > (strerror((*__errno_location ()))) = Address already in use
> > id() = 20385667ca0e707-3265-52f43ea3(99114)
> > Message: Bind failed.
> > [2528] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind
> > ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
> > (strerror((*__errno_location ()))) = Address already in use
> > id() = 20385667ca0e708-2483-52f43ea3(99074)
> > Message: Bind failed.
> > [2528] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind
> > ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
> > (strerror((*__errno_location ()))) = Address already in use
> > id() = 20385667ca0e708-2487-52f43ea3(99094)
> > Message: Bind failed.
> > [2528] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind
> > ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
> > (strerror((*__errno_location ()))) = Address already in use
> > id() = 20385667ca0e708-2491-52f43ea3(99114)
> > Message: Bind failed.
> > [2520] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind
> > ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
> > (strerror((*__errno_location ()))) = Address already in use
> > id() = 20385667ca0e709-2475-52f43ea3(99076)
> > Message: Bind failed.
> > [2520] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind
> > ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
> > (strerror((*__errno_location ()))) = Address already in use
> > id() = 20385667ca0e709-2479-52f43ea3(99096)
> > Message: Bind failed.
> > [2520] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind
> > ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
> > (strerror((*__errno_location ()))) = Address already in use
> > id() = 20385667ca0e709-2483-52f43ea3(99116)
> > Message: Bind failed.
> > Segmentation fault (core dumped)
> > Segmentation fault (core dumped)
> > Segmentation fault (core dumped)
> > [[6422] mtcp_restart_nolibc.c:[929 read_shared_memory_area_from_file:
> > mapping
> > 6416/tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_pool.master]
> > mtcp_restart_nolibc.c with data from ckpt image
> > 6419:929 read_shared_memory_area_from_file:
> > ] mtcp_restart_nolibc.cmapping
> > /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_pool.master:929
> > with data from ckpt image
> > read_shared_memory_area_from_file:
> > mapping
> > /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_pool.master with
> > data from ckpt image
> > [6416] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file:
> > mapping
> > /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_btl_module.master
> > with data from ckpt image
> > [6422] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file:
> > mapping
> > /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_btl_module.master
> > with data from ckpt image
> > [6419] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file:
> > mapping
> > /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_btl_module.master
> > with data from ckpt image
> >
> >
> >
> >
> >
>
>
------------------------------------------------------------------------------
Android apps run on BlackBerry 10
Introducing the new BlackBerry 10.2.1 Runtime for Android apps.
Now with support for Jelly Bean, Bluetooth, Mapview and more.
Get your Android app in front of a whole new audience. Start now.
http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum