I create the folder named "1" in this folder /tmp/openmpi-sessions-hpcpro@m112a_0/7859, that already exists, and it works!
On 10/27/14, Kapil Arya <[email protected]> wrote: > Hi Marina, > > Can you restart from the same checkpoint image as in the previous email > after creating the following directory > "/tmp/openmpi-sessions-hpcpro@m112a_0/7859/1" > ? > > Apparently, DMTCP is unable to create the directory path and that's why it > can't create the file in there. Once we confirm that this is indeed the > problem, I will try to come up with a fix by tomorrow. > > Kapil > > On Mon, Oct 27, 2014 at 6:00 PM, Marina Moran > <[email protected]> > wrote: > >> Hi Kapil! >> >> It is the same as before. >> >> hpcpro@m112a:~/NPB3.3/NPB3.3-MPI/bin$ ~/dmtcp-trunk/bin/dmtcp_restart >> ckpt_*.dmtcp >> [7894] mtcp_restart.c:1310 open_shared_file: >> unable to create file >> /tmp/openmpi-sessions-hpcpro@m112a_0/7859/1/shared_mem_pool.m112a: 2 >> [7892] mtcp_restart.c:1310 open_shared_file: >> unable to create file >> /tmp/openmpi-sessions-hpcpro@m112a_0/7859/1/shared_mem_pool.m112a: 2 >> [7895] mtcp_restart.c:1310 open_shared_file: >> unable to create file >> /tmp/openmpi-sessions-hpcpro@m112a_0/7859/1/shared_mem_pool.m112a: 2 >> [7893] mtcp_restart.c:1310 open_shared_file: >> unable to create file >> /tmp/openmpi-sessions-hpcpro@m112a_0/7859/1/shared_mem_pool.m112a: 2 >> >> >> and at the coordinator: >> [7832] NOTE at dmtcp_coordinator.cpp:1096 in >> validateRestartingWorkerProcess; REASON='FIRST dmtcp_restart >> connection. Set numPeers. Generate timestamp' >> numPeers = 5 >> curTimeStamp = 22631325571 >> compId = 1310c956110-40000-544ee9ab >> [7832] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker >> connected' >> hello_remote.from = 1310c956110-40000-544ee9ab >> [7832] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker >> connected' >> hello_remote.from = 1310c956110-41000-544ee9ab >> [7832] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker >> connected' >> hello_remote.from = 1310c956110-42000-544ee9ab >> [7832] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker >> connected' >> hello_remote.from = 1310c956110-43000-544ee9ab >> [7832] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker >> connected' >> hello_remote.from = 1310c956110-44000-544ee9ab >> [7832] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect; >> REASON='client disconnected' >> client->identity() = 1310c956110-43000-544ee9ab >> [7832] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect; >> REASON='client disconnected' >> client->identity() = 1310c956110-41000-544ee9ab >> [7832] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect; >> REASON='client disconnected' >> client->identity() = 1310c956110-44000-544ee9ab >> [7832] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect; >> REASON='client disconnected' >> client->identity() = 1310c956110-42000-544ee9ab >> l >> Client List: >> #, PROG[virtPID:realPID]@HOST, DMTCP-UNIQUEPID, STATE >> 11, orterun[40000:7881]@m112a, 1310c956110-40000-544ee9ab, CHECKPOINTED >> >> >> On 10/27/14, Kapil Arya <[email protected]> wrote: >> > Hi Marina, >> > >> > Could you do the following and then reproduce the error and send us the >> > output: >> > >> > git clone https://github.com/dmtcp/dmtcp.git dmtcp-trunk >> > cd dmtcp-trunk >> > ./configure >> > make >> > >> > Now use this code to run your tests. >> > >> > This will pull the latest trunk to allow us to diagnose the error. >> > >> > Kapil >> > >> > On Mon, Oct 27, 2014 at 8:24 PM, Marina Moran >> > <[email protected]> >> > wrote: >> > >> >> Hi everyone: >> >> >> >> I have a node (intel i5) with 4 cores with: >> >> Debian jessie amd64 >> >> OpenMPI 1.6.5 >> >> DMTCP: 2.3.1 >> >> NAS benchmarks >> >> >> >> My first try is using one node (four processes): >> >> >> >> I started the coordinator in one terminal: >> >> >> >> hpcpro@m112a:~/NPB3.3/NPB3.3-MPI/bin$dmtcp_coordinator >> >> >> >> >> >> In another terminal I launch the program: >> >> >> >> hpcpro@m112a:~/NPB3.3/NPB3.3-MPI/bin$dmtcp_launch mpirun -np 4 >> lu.A.4 >> >> >> >> >> >> In another terminal I call the checkpoint: >> >> hpcpro@m112a:~/NPB3.3/NPB3.3-MPI/bin$ dmtcp_command --checkpoint >> >> >> >> >> >> Call the restart script, where it hangs out: >> >> >> >> hpcpro@m112a:~/NPB3.3/NPB3.3-MPI/bin$ ./dmtcp_restart_script.sh >> >> [1057] mtcp_restart.c:1303 open_shared_file: >> >> unable to create file >> >> /tmp/openmpi-sessions-hpcpro@m112a_0/16803/1/shared_mem_pool.m112a >> >> [1058] mtcp_restart.c:1303 open_shared_file: >> >> unable to create file >> >> /tmp/openmpi-sessions-hpcpro@m112a_0/16803/1/shared_mem_pool.m112a >> >> [1060] mtcp_restart.c:1303 open_shared_file: >> >> unable to create file >> >> /tmp/openmpi-sessions-hpcpro@m112a_0/16803/1/shared_mem_pool.m112a >> >> [1059] mtcp_restart.c:1303 open_shared_file: >> >> unable to create file >> >> /tmp/openmpi-sessions-hpcpro@m112a_0/16803/1/shared_mem_pool.m112a >> >> >> >> >> >> While the coordinator window show this: >> >> >> >> [964] NOTE at dmtcp_coordinator.cpp:1096 in >> >> validateRestartingWorkerProcess; REASON='FIRST dmtcp_restart >> >> connection. Set numPeers. Generate timestamp' >> >> numPeers = 5 >> >> curTimeStamp = 22631250933 >> >> compId = 1310c956110-60000-544ed77d >> >> [964] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker >> >> connected' >> >> hello_remote.from = 1310c956110-60000-544ed77d >> >> [964] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker >> >> connected' >> >> hello_remote.from = 1310c956110-61000-544ed77d >> >> [964] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker >> >> connected' >> >> hello_remote.from = 1310c956110-62000-544ed77d >> >> [964] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker >> >> connected' >> >> hello_remote.from = 1310c956110-63000-544ed77d >> >> [964] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker >> >> connected' >> >> hello_remote.from = 1310c956110-64000-544ed77d >> >> [964] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect; >> >> REASON='client disconnected' >> >> client->identity() = 1310c956110-63000-544ed77d >> >> [964] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect; >> >> REASON='client disconnected' >> >> client->identity() = 1310c956110-62000-544ed77d >> >> [964] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect; >> >> REASON='client disconnected' >> >> client->identity() = 1310c956110-64000-544ed77d >> >> [964] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect; >> >> REASON='client disconnected' >> >> client->identity() = 1310c956110-61000-544ed77d >> >> l >> >> Client List: >> >> #, PROG[virtPID:realPID]@HOST, DMTCP-UNIQUEPID, STATE >> >> 41, orterun[60000:1405]@m112a, 1310c956110-60000-544ed77d, >> >> CHECKPOINTED >> >> >> >> >> >> I was looking in this foro and internet about this error but can't get >> >> any luck. Any help will be very appreciated! >> >> >> >> Regards, >> >> Marina >> >> >> >> >> >> >> ------------------------------------------------------------------------------ >> >> _______________________________________________ >> >> Dmtcp-forum mailing list >> >> [email protected] >> >> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum >> >> >> > >> > ------------------------------------------------------------------------------ _______________________________________________ Dmtcp-forum mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
