(Sorry for the excessive delay in replying)

I do not have any experience with the DMTCP project, so I can only speculate on what might be going on here. If you are using DMTCP to transparently checkpoint Open MPI you will need to make sure that you are not using any other interconnect other than TCP.

If you are building an OPAL CRS component for DMTCP (actually you probably want their MTCP project which is just the local checkpoint/ restart service), then what you might be seeing are the TCP sockets that are left open across a checkpoint operation. As an optimization for checkpoint->continue we leave sockets open when we checkpoint. Since most checkpoint/restart services will skip over the socket fd (since they are not supported) and take the checkpoint we leave them open, and close them only on restart. I suspect that DMTCP is erroring out since it is trying to do something else with those fds.

You may want to try just using the MTCP project, or ask for a way to shut off the socket negotiation and just ignore the socket fds.

Let me know how it goes.

-- Josh

On Sep 28, 2009, at 9:55 AM, Kritiraj Sajadah wrote:

Dear All,
I am trying to integrate DMTCP with openmpi. IF I run a c application, it works fine. But when I execute the program using mpirun, It checkpoints application but gives error when restarting the application.

#############
[31007] WARNING at connection.cpp:303 in restore; REASON='JWARNING ((_sockDomain == AF_INET || _sockDomain == AF_UNIX ) && _sockType == SOCK_STREAM) failed'
    id() = 2ab3f248-30933-4ac0d75a(99007)
    _sockDomain = 10
    _sockType = 1
    _sockProtocol = 0
Message: socket type not yet [fully] supported
[31007] WARNING at connection.cpp:303 in restore; REASON='JWARNING ((_sockDomain == AF_INET || _sockDomain == AF_UNIX ) && _sockType == SOCK_STREAM) failed'
    id() = 2ab3f248-30943-4ac0d75c(99007)
    _sockDomain = 10
    _sockType = 1
    _sockProtocol = 0
Message: socket type not yet [fully] supported
[31013] WARNING at connection.cpp:87 in restartDup2; REASON='JWARNING (_real_dup2 ( oldFd, fd ) == fd) failed'
    oldFd = 537
    fd = 1
    (strerror((*__errno_location ()))) = Bad file descriptor
[31013] WARNING at connectionmanager.cpp:627 in closeAll; REASON='JWARNING(_real_close ( i->second ) ==0) failed'
    i->second = 537
    (strerror((*__errno_location ()))) = Bad file descriptor
[31015] WARNING at connectionmanager.cpp:627 in closeAll; REASON='JWARNING(_real_close ( i->second ) ==0) failed'
    i->second = 537
    (strerror((*__errno_location ()))) = Bad file descriptor
[31017] WARNING at connectionmanager.cpp:627 in closeAll; REASON='JWARNING(_real_close ( i->second ) ==0) failed'
    i->second = 537
    (strerror((*__errno_location ()))) = Bad file descriptor
[31007] WARNING at connectionmanager.cpp:627 in closeAll; REASON='JWARNING(_real_close ( i->second ) ==0) failed'
    i->second = 537
    (strerror((*__errno_location ()))) = Bad file descriptor
MTCP: mtcp_restart_nolibc: mapping current version of /usr/lib/gconv/ gconv-modules.cache into memory;
 _not_ file as it existed at time of checkpoint.
Change mtcp_restart_nolibc.c:634 and re-compile, if you want different behavior. [31015] ERROR at connection.cpp:372 in restoreOptions; REASON='JASSERT(ret == 0) failed'
    (strerror((*__errno_location ()))) = Invalid argument
    fds[0] = 6
    opt->first = 26
    opt->second.size() = 4
Message: restoring setsockopt failed
Terminating...
#############################################################

Any suggestions is very welcomed.

regards,

Raj



_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to