Hi Marina,

Can you restart from the same checkpoint image as in the previous email
after creating the following directory
"/tmp/openmpi-sessions-hpcpro@m112a_0/7859/1"
?

Apparently, DMTCP is unable to create the directory path and that's why it
can't create the file in there.  Once we confirm that this is indeed the
problem, I will try to come up with a fix by tomorrow.

Kapil

On Mon, Oct 27, 2014 at 6:00 PM, Marina Moran <[email protected]>
wrote:

> Hi Kapil!
>
> It is the same as before.
>
> hpcpro@m112a:~/NPB3.3/NPB3.3-MPI/bin$ ~/dmtcp-trunk/bin/dmtcp_restart
> ckpt_*.dmtcp
> [7894] mtcp_restart.c:1310 open_shared_file:
>   unable to create file
> /tmp/openmpi-sessions-hpcpro@m112a_0/7859/1/shared_mem_pool.m112a: 2
> [7892] mtcp_restart.c:1310 open_shared_file:
>   unable to create file
> /tmp/openmpi-sessions-hpcpro@m112a_0/7859/1/shared_mem_pool.m112a: 2
> [7895] mtcp_restart.c:1310 open_shared_file:
>   unable to create file
> /tmp/openmpi-sessions-hpcpro@m112a_0/7859/1/shared_mem_pool.m112a: 2
> [7893] mtcp_restart.c:1310 open_shared_file:
>   unable to create file
> /tmp/openmpi-sessions-hpcpro@m112a_0/7859/1/shared_mem_pool.m112a: 2
>
>
> and at the coordinator:
> [7832] NOTE at dmtcp_coordinator.cpp:1096 in
> validateRestartingWorkerProcess; REASON='FIRST dmtcp_restart
> connection.  Set numPeers. Generate timestamp'
>      numPeers = 5
>      curTimeStamp = 22631325571
>      compId = 1310c956110-40000-544ee9ab
> [7832] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
> connected'
>      hello_remote.from = 1310c956110-40000-544ee9ab
> [7832] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
> connected'
>      hello_remote.from = 1310c956110-41000-544ee9ab
> [7832] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
> connected'
>      hello_remote.from = 1310c956110-42000-544ee9ab
> [7832] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
> connected'
>      hello_remote.from = 1310c956110-43000-544ee9ab
> [7832] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
> connected'
>      hello_remote.from = 1310c956110-44000-544ee9ab
> [7832] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect;
> REASON='client disconnected'
>      client->identity() = 1310c956110-43000-544ee9ab
> [7832] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect;
> REASON='client disconnected'
>      client->identity() = 1310c956110-41000-544ee9ab
> [7832] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect;
> REASON='client disconnected'
>      client->identity() = 1310c956110-44000-544ee9ab
> [7832] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect;
> REASON='client disconnected'
>      client->identity() = 1310c956110-42000-544ee9ab
> l
> Client List:
> #, PROG[virtPID:realPID]@HOST, DMTCP-UNIQUEPID, STATE
> 11, orterun[40000:7881]@m112a, 1310c956110-40000-544ee9ab, CHECKPOINTED
>
>
> On 10/27/14, Kapil Arya <[email protected]> wrote:
> > Hi Marina,
> >
> > Could you do the following and then reproduce the error and send us the
> > output:
> >
> >     git clone https://github.com/dmtcp/dmtcp.git dmtcp-trunk
> >     cd dmtcp-trunk
> >     ./configure
> >     make
> >
> > Now use this code to run your tests.
> >
> > This will pull the latest trunk to allow us to diagnose the error.
> >
> > Kapil
> >
> > On Mon, Oct 27, 2014 at 8:24 PM, Marina Moran
> > <[email protected]>
> > wrote:
> >
> >> Hi everyone:
> >>
> >> I have a node (intel i5) with 4 cores with:
> >> Debian jessie amd64
> >> OpenMPI 1.6.5
> >> DMTCP: 2.3.1
> >> NAS benchmarks
> >>
> >> My first try is using one node (four processes):
> >>
> >> I started the coordinator in one terminal:
> >>
> >>     hpcpro@m112a:~/NPB3.3/NPB3.3-MPI/bin$dmtcp_coordinator
> >>
> >>
> >> In another terminal I launch the program:
> >>
> >>     hpcpro@m112a:~/NPB3.3/NPB3.3-MPI/bin$dmtcp_launch mpirun -np 4
> lu.A.4
> >>
> >>
> >> In another terminal I call the checkpoint:
> >>     hpcpro@m112a:~/NPB3.3/NPB3.3-MPI/bin$ dmtcp_command --checkpoint
> >>
> >>
> >> Call the restart script, where it hangs out:
> >>
> >>    hpcpro@m112a:~/NPB3.3/NPB3.3-MPI/bin$ ./dmtcp_restart_script.sh
> >>  [1057] mtcp_restart.c:1303 open_shared_file:
> >>   unable to create file
> >> /tmp/openmpi-sessions-hpcpro@m112a_0/16803/1/shared_mem_pool.m112a
> >> [1058] mtcp_restart.c:1303 open_shared_file:
> >>   unable to create file
> >> /tmp/openmpi-sessions-hpcpro@m112a_0/16803/1/shared_mem_pool.m112a
> >> [1060] mtcp_restart.c:1303 open_shared_file:
> >>   unable to create file
> >> /tmp/openmpi-sessions-hpcpro@m112a_0/16803/1/shared_mem_pool.m112a
> >> [1059] mtcp_restart.c:1303 open_shared_file:
> >>   unable to create file
> >> /tmp/openmpi-sessions-hpcpro@m112a_0/16803/1/shared_mem_pool.m112a
> >>
> >>
> >> While the coordinator window show this:
> >>
> >> [964] NOTE at dmtcp_coordinator.cpp:1096 in
> >> validateRestartingWorkerProcess; REASON='FIRST dmtcp_restart
> >> connection.  Set numPeers. Generate timestamp'
> >>      numPeers = 5
> >>      curTimeStamp = 22631250933
> >>      compId = 1310c956110-60000-544ed77d
> >> [964] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
> >> connected'
> >>      hello_remote.from = 1310c956110-60000-544ed77d
> >> [964] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
> >> connected'
> >>      hello_remote.from = 1310c956110-61000-544ed77d
> >> [964] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
> >> connected'
> >>      hello_remote.from = 1310c956110-62000-544ed77d
> >> [964] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
> >> connected'
> >>      hello_remote.from = 1310c956110-63000-544ed77d
> >> [964] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
> >> connected'
> >>      hello_remote.from = 1310c956110-64000-544ed77d
> >> [964] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect;
> >> REASON='client disconnected'
> >>      client->identity() = 1310c956110-63000-544ed77d
> >> [964] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect;
> >> REASON='client disconnected'
> >>      client->identity() = 1310c956110-62000-544ed77d
> >> [964] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect;
> >> REASON='client disconnected'
> >>      client->identity() = 1310c956110-64000-544ed77d
> >> [964] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect;
> >> REASON='client disconnected'
> >>      client->identity() = 1310c956110-61000-544ed77d
> >> l
> >> Client List:
> >> #, PROG[virtPID:realPID]@HOST, DMTCP-UNIQUEPID, STATE
> >> 41, orterun[60000:1405]@m112a, 1310c956110-60000-544ed77d, CHECKPOINTED
> >>
> >>
> >> I was looking in this foro and internet about this error but can't get
> >> any luck. Any help will be very appreciated!
> >>
> >> Regards,
> >> Marina
> >>
> >>
> >>
> ------------------------------------------------------------------------------
> >> _______________________________________________
> >> Dmtcp-forum mailing list
> >> [email protected]
> >> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
> >>
> >
>
------------------------------------------------------------------------------
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to