(Sorry for the excessive delay in replying)
I do not have any experience with the DMTCP project, so I can only
speculate on what might be going on here. If you are using DMTCP to
transparently checkpoint Open MPI you will need to make sure that you
are not using any other interconnect other than TCP.
If you are building an OPAL CRS component for DMTCP (actually you
probably want their MTCP project which is just the local checkpoint/
restart service), then what you might be seeing are the TCP sockets
that are left open across a checkpoint operation. As an optimization
for checkpoint->continue we leave sockets open when we checkpoint.
Since most checkpoint/restart services will skip over the socket fd
(since they are not supported) and take the checkpoint we leave them
open, and close them only on restart. I suspect that DMTCP is erroring
out since it is trying to do something else with those fds.
You may want to try just using the MTCP project, or ask for a way to
shut off the socket negotiation and just ignore the socket fds.
Let me know how it goes.
-- Josh
On Sep 28, 2009, at 9:55 AM, Kritiraj Sajadah wrote:
Dear All,
I am trying to integrate DMTCP with openmpi. IF I run a c
application, it works fine. But when I execute the program using
mpirun, It checkpoints application but gives error when restarting
the application.
#############
[31007] WARNING at connection.cpp:303 in restore; REASON='JWARNING
((_sockDomain == AF_INET || _sockDomain == AF_UNIX ) && _sockType ==
SOCK_STREAM) failed'
id() = 2ab3f248-30933-4ac0d75a(99007)
_sockDomain = 10
_sockType = 1
_sockProtocol = 0
Message: socket type not yet [fully] supported
[31007] WARNING at connection.cpp:303 in restore; REASON='JWARNING
((_sockDomain == AF_INET || _sockDomain == AF_UNIX ) && _sockType ==
SOCK_STREAM) failed'
id() = 2ab3f248-30943-4ac0d75c(99007)
_sockDomain = 10
_sockType = 1
_sockProtocol = 0
Message: socket type not yet [fully] supported
[31013] WARNING at connection.cpp:87 in restartDup2; REASON='JWARNING
(_real_dup2 ( oldFd, fd ) == fd) failed'
oldFd = 537
fd = 1
(strerror((*__errno_location ()))) = Bad file descriptor
[31013] WARNING at connectionmanager.cpp:627 in closeAll;
REASON='JWARNING(_real_close ( i->second ) ==0) failed'
i->second = 537
(strerror((*__errno_location ()))) = Bad file descriptor
[31015] WARNING at connectionmanager.cpp:627 in closeAll;
REASON='JWARNING(_real_close ( i->second ) ==0) failed'
i->second = 537
(strerror((*__errno_location ()))) = Bad file descriptor
[31017] WARNING at connectionmanager.cpp:627 in closeAll;
REASON='JWARNING(_real_close ( i->second ) ==0) failed'
i->second = 537
(strerror((*__errno_location ()))) = Bad file descriptor
[31007] WARNING at connectionmanager.cpp:627 in closeAll;
REASON='JWARNING(_real_close ( i->second ) ==0) failed'
i->second = 537
(strerror((*__errno_location ()))) = Bad file descriptor
MTCP: mtcp_restart_nolibc: mapping current version of /usr/lib/gconv/
gconv-modules.cache into memory;
_not_ file as it existed at time of checkpoint.
Change mtcp_restart_nolibc.c:634 and re-compile, if you want
different behavior.
[31015] ERROR at connection.cpp:372 in restoreOptions;
REASON='JASSERT(ret == 0) failed'
(strerror((*__errno_location ()))) = Invalid argument
fds[0] = 6
opt->first = 26
opt->second.size() = 4
Message: restoring setsockopt failed
Terminating...
#############################################################
Any suggestions is very welcomed.
regards,
Raj
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users