Hi,
I'm having problem with DMTCP 2.1 installed in VM in clouds, Amazon WS y
Bonfire.
In both cases I'm testing by running bt benchmark (NAS) class A with 4
processes, 1 by node,
using
OpenMpi 1.6.5.
a) Amazon WS:
I'm using m1.small instances.
I get a segmentation fault when I try to checkpoint using the
dmtcp_coordinator console.
This is the output in the app console.
NAS Parallel Benchmarks 3.3 -- BT Benchmark
No input file inputbt.data. Using compiled defaults
Size: 102x 102x 102
Iterations: 200 dt: 0.0003000
Number of active processes: 4
Time step 1
Time step 20
[56000] WARNING at jsocket.cpp:291 in readAll; REASON='JWARNING(cnt>=0)
failed'
sockfd() = 0
cnt = -1
len = 112
(strerror((*__errno_location ()))) = Connection reset by peer
Message: JSocket read failure
[56000] ERROR at connectionidentifier.h:96 in assertValid;
REASON='JASSERT(strcmp(sign, HANDSHAKE_SIGNATURE_MSG) == 0) failed'
sign =
Message: read invalid message, signature mismatch. (External socket?)
bt.B.4 (56000): Terminating...
Segmentation fault (core dumped)
This is the output in dmtcp_coordinator console
l
Client List:
#, PROG[virtPID:realPID]@HOST, DMTCP-UNIQUEPID, STATE
1, orterun[40000:3242]@master, 18af1fad8d756-40000-537450cc, RUNNING
18, orted_(forked)[52000:2060]@node003, 20385667ca0e709-52000-537450cd,
RUNNING
19, orted_(forked)[53000:2071]@node001, 20385667ca0e707-53000-537450ce,
RUNNING
22, orted_(forked)[55000:2059]@node002, 20385667ca0e708-55000-537450d0,
RUNNING
26, bt.B.4[56000:3262]@master, 18af1fad8d756-56000-537450d0, RUNNING
27, bt.B.4[57000:2075]@node001, 20385667ca0e707-57000-537450d0, RUNNING
29, bt.B.4[58000:2063]@node002, 20385667ca0e708-58000-537450d0, RUNNING
30, bt.B.4[59000:2065]@node003, 20385667ca0e709-59000-537450d1, RUNNING
c
[3241] NOTE at dmtcp_coordinator.cpp:1256 in startCheckpoint;
REASON='starting checkpoint, suspending all nodes'
s.numPeers = 8
[3241] NOTE at dmtcp_coordinator.cpp:1258 in startCheckpoint;
REASON='Incremented Generation'
UniquePid::ComputationId().generation() = 1
[3241] NOTE at dmtcp_coordinator.cpp:613 in updateMinimumState;
REASON='locking all nodes'
[3241] NOTE at dmtcp_coordinator.cpp:621 in updateMinimumState;
REASON='draining all nodes'
[3241] NOTE at dmtcp_coordinator.cpp:627 in updateMinimumState;
REASON='checkpointing all nodes'
[3241] NOTE at dmtcp_coordinator.cpp:881 in onDisconnect; REASON='client
disconnected'
client->identity() = 18af1fad8d756-40000-537450cc
[3241] NOTE at dmtcp_coordinator.cpp:881 in onDisconnect; REASON='client
disconnected'
client->identity() = 20385667ca0e709-52000-537450cd
[3241] NOTE at dmtcp_coordinator.cpp:881 in onDisconnect; REASON='client
disconnected'
client->identity() = 18af1fad8d756-56000-537450d0
[3241] NOTE at dmtcp_coordinator.cpp:881 in onDisconnect; REASON='client
disconnected'
client->identity() = 20385667ca0e708-55000-537450d0
[3241] NOTE at dmtcp_coordinator.cpp:881 in onDisconnect; REASON='client
disconnected'
client->identity() = 20385667ca0e707-53000-537450ce
[3241] WARNING at dmtcp_coordinator.cpp:1492 in writeRestartScript;
REASON='JWARNING(symlinkat(uniqueFilename.c_str(), dirfd, filename.c_str())
== 0) failed'
[3241] NOTE at dmtcp_coordinator.cpp:641 in updateMinimumState;
REASON='building name service database'
[3241] NOTE at dmtcp_coordinator.cpp:657 in updateMinimumState;
REASON='entertaining queries now'
[3241] NOTE at dmtcp_coordinator.cpp:662 in updateMinimumState;
REASON='refilling all nodes'
[3241] NOTE at dmtcp_coordinator.cpp:881 in onDisconnect; REASON='client
disconnected'
client->identity() = 20385667ca0e709-59000-537450d1
[3241] NOTE at dmtcp_coordinator.cpp:881 in onDisconnect; REASON='client
disconnected'
client->identity() = 20385667ca0e707-57000-537450d0
[3241] NOTE at dmtcp_coordinator.cpp:881 in onDisconnect; REASON='client
disconnected'
client->identity() = 20385667ca0e708-58000-537450d0
The output of "make check" is this environment is
Making all in plugin
== Tests ==
dmtcp1 ckpt: PASSED rstr: FAILED (first process rec'd signal 11)
retry: FAILED
root-pids: [3876] msg: restart error, 1 expected, 0 found,
running=0
dmtcp2 ckpt: PASSED rstr: FAILED (first process rec'd signal 11)
retry: FAILED
root-pids: [3893] msg: restart error, 1 expected, 0 found,
running=0
dmtcp3 ckpt: PASSED rstr: FAILED (first process rec'd signal 11)
retry: FAILED
root-pids: [3918] msg: restart error, 1 expected, 0 found,
running=0
dmtcp4 ckpt: PASSED rstr: FAILED (first process rec'd signal 11)
retry: FAILED
root-pids: [3934] msg: restart error, 1 expected, 0 found,
running=0
dmtcp5 ckpt: PASSED rstr: FAILED (first process rec'd signal 11)
retry: FAILED
root-pids: [3958] msg: restart error, 2 expected, 0 found,
running=0
b) Bonfire cloud.
mpirun is not able to finish when is running using dmtcp_launch. It is
hanged.
The VM are Debian. Kernel 2.6.32-5-xen-amd64.
*m**odel name : Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz*
The results of make check are similar
dmtcp1 ckpt: PASSED rstr: FAILED (first process rec'd signal 11)
retry: FAILED
root-pids: [1565] msg: restart error, 1 expected, 0 found,
running=0
dmtcp2 ckpt: PASSED rstr: FAILED (first process rec'd signal 11)
retry: FAILED
root-pids: [1579] msg: restart error, 1 expected, 0 found,
running=0
dmtcp3 ckpt: PASSED rstr: FAILED (first process rec'd signal 11)
retry: FAILED
root-pids: [1602] msg: restart error, 1 expected, 0 found,
running=0
dmtcp4 ckpt: PASSED rstr: FAILED (first process rec'd signal 11)
retry: FAILED
root-pids: [1616] msg: restart error, 1 expected, 0 found,
running=0
Any clues on what to look for is appreciated.
Thank you very much in advance.
Marcela
------------------------------------------------------------------------------
"Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
Instantly run your Selenium tests across 300+ browser/OS combos.
Get unparalleled scalability from the best Selenium testing platform available
Simple to use. Nothing to install. Get started now for free."
http://p.sf.net/sfu/SauceLabs
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum