Package: dmtcp Version: 1.2.4-1 Severity: normal [ This is just for the record to avoid hiding the issue in private email. Upstream has already received this bug report and fixed half of the issue within a few hours -- pretty cool. ]
The attached script demonstrates the bug. It will print a couple of letters ten times (sleeping 2 seconds between two letters), verify correct output and exit. When ran with an integer argument larger than 1, it will distribute the letter printing over the number of requested processes (using pprocess and capped by actual number of CPUs on a machine). Snapshot and restart works beautifully in the single process mode, but fails with pprocess. Below are the logs from dmtcp_checkpoint and _coordinator. Checkpoints were requested manually. dmtcp_coordinator ================= michael@meiner ~/debian/dmtcptest % dmtcp_coordinator dmtcp_coordinator starting... Port: 7779 Checkpoint Interval: disabled (checkpoint manually instead) Exit on last client: 0 Type '?' for help. [15518] NOTE at dmtcp_coordinator.cpp:1020 in onConnect; REASON='worker connected' hello_remote.from = 1313a2c6-15537-4f563b29(-1) [15518] NOTE at dmtcp_coordinator.cpp:1026 in onConnect; REASON='CheckpointInterval Updated' oldInterval = 0 theCheckpointInterval = 0 [15518] NOTE at dmtcp_coordinator.cpp:1020 in onConnect; REASON='worker connected' hello_remote.from = 1313a2c6-15537-4f563b29(-1) [15518] NOTE at dmtcp_coordinator.cpp:1026 in onConnect; REASON='CheckpointInterval Updated' oldInterval = 0 theCheckpointInterval = 0 [15518] NOTE at dmtcp_coordinator.cpp:880 in onData; REASON='Updating process Information after fork()' hostname = meiner progname = python2.7_(forked) msg.from.pid() = 1313a2c6-15544-4f563b29 client->identity() = 1313a2c6-15537-4f563b29 [15518] NOTE at dmtcp_coordinator.cpp:1020 in onConnect; REASON='worker connected' hello_remote.from = 1313a2c6-15537-4f563b29(-1) [15518] NOTE at dmtcp_coordinator.cpp:1026 in onConnect; REASON='CheckpointInterval Updated' oldInterval = 0 theCheckpointInterval = 0 [15518] NOTE at dmtcp_coordinator.cpp:880 in onData; REASON='Updating process Information after fork()' hostname = meiner progname = python2.7_(forked) msg.from.pid() = 1313a2c6-15546-4f563b29 client->identity() = 1313a2c6-15537-4f563b29 c [15518] NOTE at dmtcp_coordinator.cpp:1294 in startCheckpoint; REASON='starting checkpoint, suspending all nodes' s.numPeers = 3 [15518] NOTE at dmtcp_coordinator.cpp:1296 in startCheckpoint; REASON='Incremented Generation' UniquePid::ComputationId().generation() = 1 [15518] NOTE at dmtcp_coordinator.cpp:630 in onData; REASON='locking all nodes' [15518] NOTE at dmtcp_coordinator.cpp:665 in onData; REASON='draining all nodes' [15518] NOTE at dmtcp_coordinator.cpp:671 in onData; REASON='checkpointing all nodes' [15518] NOTE at dmtcp_coordinator.cpp:681 in onData; REASON='building name service database' [15518] NOTE at dmtcp_coordinator.cpp:700 in onData; REASON='entertaining queries now' [15518] NOTE at dmtcp_coordinator.cpp:705 in onData; REASON='refilling all nodes' [15518] NOTE at dmtcp_coordinator.cpp:734 in onData; REASON='restarting all nodes' [15518] NOTE at dmtcp_coordinator.cpp:905 in onDisconnect; REASON='client disconnected' client.identity() = 1313a2c6-15537-4f563b29 [15518] NOTE at dmtcp_coordinator.cpp:905 in onDisconnect; REASON='client disconnected' client.identity() = 1313a2c6-15546-4f563b29 [15518] NOTE at dmtcp_coordinator.cpp:905 in onDisconnect; REASON='client disconnected' client.identity() = 1313a2c6-15544-4f563b29 ^C[15518] NOTE at dmtcp_coordinator.cpp:522 in handleUserCommand; REASON='killing all connected peers and quitting ...' DMTCP coordinator exiting... (per request) dmtcp_checkpoint ================ michael@meiner ~/debian/dmtcptest % dmtcp_checkpoint python pproc_runner.py 2 dmtcp_checkpoint (DMTCP + MTCP) 1.2.4 Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and Gene Cooperman This program comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to redistribute it under certain conditions; see COPYING file for details. (Use flag "-q" to hide this message.) A B A B A B A B A B Traceback (most recent call last): File "pproc_runner.py", line 25, in <module> results = np.hstack(p_results) File "/usr/lib/pymodules/python2.7/numpy/core/shape_base.py", line 258, in hstack return _nx.concatenate(map(atleast_1d,tup),1) File "/usr/lib/pymodules/python2.7/pprocess.py", line 757, in next self.store() File "/usr/lib/pymodules/python2.7/pprocess.py", line 396, in store for channel in self.ready(timeout): File "/usr/lib/pymodules/python2.7/pprocess.py", line 304, in ready fds = self.poller.poll(timeout) select.error: (4, 'Interrupted system call') dmtcp_restart_script.sh ======================= michael@meiner ~/debian/dmtcptest % ./dmtcp_restart_script.sh dmtcp_checkpoint (DMTCP + MTCP) 1.2.4 Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and Gene Cooperman This program comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to redistribute it under certain conditions; see COPYING file for details. (Use flag "-q" to hide this message.) dmtcp_coordinator starting... Port: 7779 Checkpoint Interval: disabled (checkpoint manually instead) Exit on last client: 1 Backgrounding... B Traceback (most recent call last): File "pproc_runner.py", line 25, in <module> results = np.hstack(p_results) File "/usr/lib/pymodules/python2.7/numpy/core/shape_base.py", line 258, in hstack return _nx.concatenate(map(atleast_1d,tup),1) File "/usr/lib/pymodules/python2.7/pprocess.py", line 757, in next self.store() File "/usr/lib/pymodules/python2.7/pprocess.py", line 396, in store for channel in self.ready(timeout): File "/usr/lib/pymodules/python2.7/pprocess.py", line 304, in ready A fds = self.poller.poll(timeout) select.error: (4, 'Interrupted system call') -- System Information: Debian Release: wheezy/sid APT prefers testing APT policy: (500, 'testing'), (1, 'experimental') Architecture: i386 (i686) Kernel: Linux 3.1.0-1-686-pae (SMP w/2 CPU cores) Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8) Shell: /bin/sh linked to /bin/dash Versions of packages dmtcp depends on: ii libc6 2.13-26 ii libgcc1 1:4.6.1-4 ii libmtcp1 1.2.4-1 ii libstdc++6 4.6.1-4 dmtcp recommends no packages. dmtcp suggests no packages. -- no debconf information
import pprocess import numpy as np import time import sys def dummy(printme): results = [] for p in printme: print p results.append(p) time.sleep(2) return results # get number of processes from arg nelement = 10 blocks = ('A', 'B', 'C', 'D') nproc = int(sys.argv[1]) if nproc > 1: p_results = pprocess.Map(limit=nproc) compute = p_results.manage( pprocess.MakeParallel(dummy)) for block in blocks: compute(np.repeat(block, nelement)) results = np.hstack(p_results) else: results = np.hstack([dummy(np.repeat(block, nelement)) for block in blocks]) # collect results for block in blocks: assert np.sum(results == block) == nelement