Hi Marina, Can you restart from the same checkpoint image as in the previous email after creating the following directory "/tmp/openmpi-sessions-hpcpro@m112a_0/7859/1" ?
Apparently, DMTCP is unable to create the directory path and that's why it can't create the file in there. Once we confirm that this is indeed the problem, I will try to come up with a fix by tomorrow. Kapil On Mon, Oct 27, 2014 at 6:00 PM, Marina Moran <[email protected]> wrote: > Hi Kapil! > > It is the same as before. > > hpcpro@m112a:~/NPB3.3/NPB3.3-MPI/bin$ ~/dmtcp-trunk/bin/dmtcp_restart > ckpt_*.dmtcp > [7894] mtcp_restart.c:1310 open_shared_file: > unable to create file > /tmp/openmpi-sessions-hpcpro@m112a_0/7859/1/shared_mem_pool.m112a: 2 > [7892] mtcp_restart.c:1310 open_shared_file: > unable to create file > /tmp/openmpi-sessions-hpcpro@m112a_0/7859/1/shared_mem_pool.m112a: 2 > [7895] mtcp_restart.c:1310 open_shared_file: > unable to create file > /tmp/openmpi-sessions-hpcpro@m112a_0/7859/1/shared_mem_pool.m112a: 2 > [7893] mtcp_restart.c:1310 open_shared_file: > unable to create file > /tmp/openmpi-sessions-hpcpro@m112a_0/7859/1/shared_mem_pool.m112a: 2 > > > and at the coordinator: > [7832] NOTE at dmtcp_coordinator.cpp:1096 in > validateRestartingWorkerProcess; REASON='FIRST dmtcp_restart > connection. Set numPeers. Generate timestamp' > numPeers = 5 > curTimeStamp = 22631325571 > compId = 1310c956110-40000-544ee9ab > [7832] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker > connected' > hello_remote.from = 1310c956110-40000-544ee9ab > [7832] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker > connected' > hello_remote.from = 1310c956110-41000-544ee9ab > [7832] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker > connected' > hello_remote.from = 1310c956110-42000-544ee9ab > [7832] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker > connected' > hello_remote.from = 1310c956110-43000-544ee9ab > [7832] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker > connected' > hello_remote.from = 1310c956110-44000-544ee9ab > [7832] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect; > REASON='client disconnected' > client->identity() = 1310c956110-43000-544ee9ab > [7832] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect; > REASON='client disconnected' > client->identity() = 1310c956110-41000-544ee9ab > [7832] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect; > REASON='client disconnected' > client->identity() = 1310c956110-44000-544ee9ab > [7832] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect; > REASON='client disconnected' > client->identity() = 1310c956110-42000-544ee9ab > l > Client List: > #, PROG[virtPID:realPID]@HOST, DMTCP-UNIQUEPID, STATE > 11, orterun[40000:7881]@m112a, 1310c956110-40000-544ee9ab, CHECKPOINTED > > > On 10/27/14, Kapil Arya <[email protected]> wrote: > > Hi Marina, > > > > Could you do the following and then reproduce the error and send us the > > output: > > > > git clone https://github.com/dmtcp/dmtcp.git dmtcp-trunk > > cd dmtcp-trunk > > ./configure > > make > > > > Now use this code to run your tests. > > > > This will pull the latest trunk to allow us to diagnose the error. > > > > Kapil > > > > On Mon, Oct 27, 2014 at 8:24 PM, Marina Moran > > <[email protected]> > > wrote: > > > >> Hi everyone: > >> > >> I have a node (intel i5) with 4 cores with: > >> Debian jessie amd64 > >> OpenMPI 1.6.5 > >> DMTCP: 2.3.1 > >> NAS benchmarks > >> > >> My first try is using one node (four processes): > >> > >> I started the coordinator in one terminal: > >> > >> hpcpro@m112a:~/NPB3.3/NPB3.3-MPI/bin$dmtcp_coordinator > >> > >> > >> In another terminal I launch the program: > >> > >> hpcpro@m112a:~/NPB3.3/NPB3.3-MPI/bin$dmtcp_launch mpirun -np 4 > lu.A.4 > >> > >> > >> In another terminal I call the checkpoint: > >> hpcpro@m112a:~/NPB3.3/NPB3.3-MPI/bin$ dmtcp_command --checkpoint > >> > >> > >> Call the restart script, where it hangs out: > >> > >> hpcpro@m112a:~/NPB3.3/NPB3.3-MPI/bin$ ./dmtcp_restart_script.sh > >> [1057] mtcp_restart.c:1303 open_shared_file: > >> unable to create file > >> /tmp/openmpi-sessions-hpcpro@m112a_0/16803/1/shared_mem_pool.m112a > >> [1058] mtcp_restart.c:1303 open_shared_file: > >> unable to create file > >> /tmp/openmpi-sessions-hpcpro@m112a_0/16803/1/shared_mem_pool.m112a > >> [1060] mtcp_restart.c:1303 open_shared_file: > >> unable to create file > >> /tmp/openmpi-sessions-hpcpro@m112a_0/16803/1/shared_mem_pool.m112a > >> [1059] mtcp_restart.c:1303 open_shared_file: > >> unable to create file > >> /tmp/openmpi-sessions-hpcpro@m112a_0/16803/1/shared_mem_pool.m112a > >> > >> > >> While the coordinator window show this: > >> > >> [964] NOTE at dmtcp_coordinator.cpp:1096 in > >> validateRestartingWorkerProcess; REASON='FIRST dmtcp_restart > >> connection. Set numPeers. Generate timestamp' > >> numPeers = 5 > >> curTimeStamp = 22631250933 > >> compId = 1310c956110-60000-544ed77d > >> [964] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker > >> connected' > >> hello_remote.from = 1310c956110-60000-544ed77d > >> [964] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker > >> connected' > >> hello_remote.from = 1310c956110-61000-544ed77d > >> [964] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker > >> connected' > >> hello_remote.from = 1310c956110-62000-544ed77d > >> [964] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker > >> connected' > >> hello_remote.from = 1310c956110-63000-544ed77d > >> [964] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker > >> connected' > >> hello_remote.from = 1310c956110-64000-544ed77d > >> [964] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect; > >> REASON='client disconnected' > >> client->identity() = 1310c956110-63000-544ed77d > >> [964] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect; > >> REASON='client disconnected' > >> client->identity() = 1310c956110-62000-544ed77d > >> [964] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect; > >> REASON='client disconnected' > >> client->identity() = 1310c956110-64000-544ed77d > >> [964] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect; > >> REASON='client disconnected' > >> client->identity() = 1310c956110-61000-544ed77d > >> l > >> Client List: > >> #, PROG[virtPID:realPID]@HOST, DMTCP-UNIQUEPID, STATE > >> 41, orterun[60000:1405]@m112a, 1310c956110-60000-544ed77d, CHECKPOINTED > >> > >> > >> I was looking in this foro and internet about this error but can't get > >> any luck. Any help will be very appreciated! > >> > >> Regards, > >> Marina > >> > >> > >> > ------------------------------------------------------------------------------ > >> _______________________________________________ > >> Dmtcp-forum mailing list > >> [email protected] > >> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum > >> > > >
------------------------------------------------------------------------------
_______________________________________________ Dmtcp-forum mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
