Hi Marcela,

Sorry for the delay in getting back to you.

We haven't tested DMTCP with Amazon cloud, but I do think that the issues
can be easily resolved.

It looks like the restart for simple programs itself is failing. Let's
start by diagnosing that.  Can you post the result of the following
commands:

    AUTOTEST="-v" make check-dmtcp1

This will help us get started. Also, if it is possible for us to get a
guest account to run some quick tests, I would be glad to help resolving
this issues.  If guest account is not possible, we can try some sort of
screen share utility to help in debugging.

Best,
Kapil


On Thu, May 15, 2014 at 10:00 AM, Marcela Castro León <[email protected]>
wrote:

> Hi,
> I'm having problem with DMTCP 2.1 installed in VM in clouds, Amazon WS y
> Bonfire.
> In both cases I'm testing by running bt benchmark (NAS) class A with 4
> processes, 1 by node,
> ​using
> OpenMpi  1.6.5.
>
> a) Amazon WS:
> ​I'm using m1.small instances. ​
> I get a segmentation fault when I try to checkpoint using the
> dmtcp_coordinator console.
> This is the output in the app console.
>  NAS Parallel Benchmarks 3.3 -- BT Benchmark
>  No input file inputbt.data. Using compiled defaults
>  Size:  102x 102x 102
>  Iterations:  200    dt:   0.0003000
>  Number of active processes:     4
>
>  Time step    1
>  Time step   20
> [56000] WARNING at jsocket.cpp:291 in readAll; REASON='JWARNING(cnt>=0)
> failed'
>      sockfd() = 0
>      cnt = -1
>      len = 112
>      (strerror((*__errno_location ()))) = Connection reset by peer
> Message: JSocket read failure
> [56000] ERROR at connectionidentifier.h:96 in assertValid;
> REASON='JASSERT(strcmp(sign, HANDSHAKE_SIGNATURE_MSG) == 0) failed'
>      sign =
> Message: read invalid message, signature mismatch. (External socket?)
> bt.B.4 (56000): Terminating...
> Segmentation fault (core dumped)
>
> This is the output in dmtcp_coordinator console
> l
> Client List:
> #, PROG[virtPID:realPID]@HOST, DMTCP-UNIQUEPID, STATE
> 1, orterun[40000:3242]@master, 18af1fad8d756-40000-537450cc, RUNNING
> 18, orted_(forked)[52000:2060]@node003, 20385667ca0e709-52000-537450cd,
> RUNNING
> 19, orted_(forked)[53000:2071]@node001, 20385667ca0e707-53000-537450ce,
> RUNNING
> 22, orted_(forked)[55000:2059]@node002, 20385667ca0e708-55000-537450d0,
> RUNNING
> 26, bt.B.4[56000:3262]@master, 18af1fad8d756-56000-537450d0, RUNNING
> 27, bt.B.4[57000:2075]@node001, 20385667ca0e707-57000-537450d0, RUNNING
> 29, bt.B.4[58000:2063]@node002, 20385667ca0e708-58000-537450d0, RUNNING
> 30, bt.B.4[59000:2065]@node003, 20385667ca0e709-59000-537450d1, RUNNING
> c
> [3241] NOTE at dmtcp_coordinator.cpp:1256 in startCheckpoint;
> REASON='starting checkpoint, suspending all nodes'
>      s.numPeers = 8
> [3241] NOTE at dmtcp_coordinator.cpp:1258 in startCheckpoint;
> REASON='Incremented Generation'
>      UniquePid::ComputationId().generation() = 1
> [3241] NOTE at dmtcp_coordinator.cpp:613 in updateMinimumState;
> REASON='locking all nodes'
> [3241] NOTE at dmtcp_coordinator.cpp:621 in updateMinimumState;
> REASON='draining all nodes'
> [3241] NOTE at dmtcp_coordinator.cpp:627 in updateMinimumState;
> REASON='checkpointing all nodes'
> [3241] NOTE at dmtcp_coordinator.cpp:881 in onDisconnect; REASON='client
> disconnected'
>      client->identity() = 18af1fad8d756-40000-537450cc
> [3241] NOTE at dmtcp_coordinator.cpp:881 in onDisconnect; REASON='client
> disconnected'
>      client->identity() = 20385667ca0e709-52000-537450cd
> [3241] NOTE at dmtcp_coordinator.cpp:881 in onDisconnect; REASON='client
> disconnected'
>      client->identity() = 18af1fad8d756-56000-537450d0
> [3241] NOTE at dmtcp_coordinator.cpp:881 in onDisconnect; REASON='client
> disconnected'
>      client->identity() = 20385667ca0e708-55000-537450d0
> [3241] NOTE at dmtcp_coordinator.cpp:881 in onDisconnect; REASON='client
> disconnected'
>      client->identity() = 20385667ca0e707-53000-537450ce
> [3241] WARNING at dmtcp_coordinator.cpp:1492 in writeRestartScript;
> REASON='JWARNING(symlinkat(uniqueFilename.c_str(), dirfd, filename.c_str())
> == 0) failed'
> [3241] NOTE at dmtcp_coordinator.cpp:641 in updateMinimumState;
> REASON='building name service database'
> [3241] NOTE at dmtcp_coordinator.cpp:657 in updateMinimumState;
> REASON='entertaining queries now'
> [3241] NOTE at dmtcp_coordinator.cpp:662 in updateMinimumState;
> REASON='refilling all nodes'
> [3241] NOTE at dmtcp_coordinator.cpp:881 in onDisconnect; REASON='client
> disconnected'
>      client->identity() = 20385667ca0e709-59000-537450d1
> [3241] NOTE at dmtcp_coordinator.cpp:881 in onDisconnect; REASON='client
> disconnected'
>      client->identity() = 20385667ca0e707-57000-537450d0
> [3241] NOTE at dmtcp_coordinator.cpp:881 in onDisconnect; REASON='client
> disconnected'
>      client->identity() = 20385667ca0e708-58000-537450d0
>
> The output of "make check" is this environment is
> Making all in plugin
> == Tests ==
> dmtcp1          ckpt: PASSED  rstr: FAILED  (first process rec'd signal
> 11)  retry: FAILED
>                 root-pids: [3876] msg: restart error, 1 expected, 0 found,
> running=0
> dmtcp2          ckpt: PASSED  rstr: FAILED  (first process rec'd signal
> 11)  retry: FAILED
>                 root-pids: [3893] msg: restart error, 1 expected, 0 found,
> running=0
> dmtcp3          ckpt: PASSED  rstr: FAILED  (first process rec'd signal
> 11)  retry: FAILED
>                 root-pids: [3918] msg: restart error, 1 expected, 0 found,
> running=0
> dmtcp4          ckpt: PASSED  rstr: FAILED  (first process rec'd signal
> 11)  retry: FAILED
>                 root-pids: [3934] msg: restart error, 1 expected, 0 found,
> running=0
> dmtcp5          ckpt: PASSED  rstr: FAILED  (first process rec'd signal
> 11)  retry: FAILED
>                 root-pids: [3958] msg: restart error, 2 expected, 0 found,
> running=0
>
>
> b) Bonfire cloud.
> mpirun is not able to finish when is running using dmtcp_launch. It is
> hanged.
> The VM are Debian. Kernel 2.6.32-5-xen-amd64.
> *m**odel name : Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz*
> The results of make check are similar
> dmtcp1          ckpt: PASSED  rstr: FAILED  (first process rec'd signal
> 11)  retry: FAILED
>                 root-pids: [1565] msg: restart error, 1 expected, 0 found,
> running=0
> dmtcp2          ckpt: PASSED  rstr: FAILED  (first process rec'd signal
> 11)  retry: FAILED
>                 root-pids: [1579] msg: restart error, 1 expected, 0 found,
> running=0
> dmtcp3          ckpt: PASSED  rstr: FAILED  (first process rec'd signal
> 11)  retry: FAILED
>                 root-pids: [1602] msg: restart error, 1 expected, 0 found,
> running=0
> dmtcp4          ckpt: PASSED  rstr: FAILED  (first process rec'd signal
> 11)  retry: FAILED
>                 root-pids: [1616] msg: restart error, 1 expected, 0 found,
> running=0
>
>
> ​​Any clues on what to look for is appreciated.
> Thank you very much in advance.
> Marcela
>
>
>
>
> ------------------------------------------------------------------------------
> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
> Instantly run your Selenium tests across 300+ browser/OS combos.
> Get unparalleled scalability from the best Selenium testing platform
> available
> Simple to use. Nothing to install. Get started now for free."
> http://p.sf.net/sfu/SauceLabs
> _______________________________________________
> Dmtcp-forum mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>
>
------------------------------------------------------------------------------
Time is money. Stop wasting it! Get your web API in 5 minutes.
www.restlet.com/download
http://p.sf.net/sfu/restlet
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to