I create the folder named "1" in this folder
/tmp/openmpi-sessions-hpcpro@m112a_0/7859, that already exists, and it
works!



On 10/27/14, Kapil Arya <[email protected]> wrote:
> Hi Marina,
>
> Can you restart from the same checkpoint image as in the previous email
> after creating the following directory
> "/tmp/openmpi-sessions-hpcpro@m112a_0/7859/1"
> ?
>
> Apparently, DMTCP is unable to create the directory path and that's why it
> can't create the file in there.  Once we confirm that this is indeed the
> problem, I will try to come up with a fix by tomorrow.
>
> Kapil
>
> On Mon, Oct 27, 2014 at 6:00 PM, Marina Moran
> <[email protected]>
> wrote:
>
>> Hi Kapil!
>>
>> It is the same as before.
>>
>> hpcpro@m112a:~/NPB3.3/NPB3.3-MPI/bin$ ~/dmtcp-trunk/bin/dmtcp_restart
>> ckpt_*.dmtcp
>> [7894] mtcp_restart.c:1310 open_shared_file:
>>   unable to create file
>> /tmp/openmpi-sessions-hpcpro@m112a_0/7859/1/shared_mem_pool.m112a: 2
>> [7892] mtcp_restart.c:1310 open_shared_file:
>>   unable to create file
>> /tmp/openmpi-sessions-hpcpro@m112a_0/7859/1/shared_mem_pool.m112a: 2
>> [7895] mtcp_restart.c:1310 open_shared_file:
>>   unable to create file
>> /tmp/openmpi-sessions-hpcpro@m112a_0/7859/1/shared_mem_pool.m112a: 2
>> [7893] mtcp_restart.c:1310 open_shared_file:
>>   unable to create file
>> /tmp/openmpi-sessions-hpcpro@m112a_0/7859/1/shared_mem_pool.m112a: 2
>>
>>
>> and at the coordinator:
>> [7832] NOTE at dmtcp_coordinator.cpp:1096 in
>> validateRestartingWorkerProcess; REASON='FIRST dmtcp_restart
>> connection.  Set numPeers. Generate timestamp'
>>      numPeers = 5
>>      curTimeStamp = 22631325571
>>      compId = 1310c956110-40000-544ee9ab
>> [7832] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
>> connected'
>>      hello_remote.from = 1310c956110-40000-544ee9ab
>> [7832] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
>> connected'
>>      hello_remote.from = 1310c956110-41000-544ee9ab
>> [7832] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
>> connected'
>>      hello_remote.from = 1310c956110-42000-544ee9ab
>> [7832] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
>> connected'
>>      hello_remote.from = 1310c956110-43000-544ee9ab
>> [7832] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
>> connected'
>>      hello_remote.from = 1310c956110-44000-544ee9ab
>> [7832] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect;
>> REASON='client disconnected'
>>      client->identity() = 1310c956110-43000-544ee9ab
>> [7832] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect;
>> REASON='client disconnected'
>>      client->identity() = 1310c956110-41000-544ee9ab
>> [7832] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect;
>> REASON='client disconnected'
>>      client->identity() = 1310c956110-44000-544ee9ab
>> [7832] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect;
>> REASON='client disconnected'
>>      client->identity() = 1310c956110-42000-544ee9ab
>> l
>> Client List:
>> #, PROG[virtPID:realPID]@HOST, DMTCP-UNIQUEPID, STATE
>> 11, orterun[40000:7881]@m112a, 1310c956110-40000-544ee9ab, CHECKPOINTED
>>
>>
>> On 10/27/14, Kapil Arya <[email protected]> wrote:
>> > Hi Marina,
>> >
>> > Could you do the following and then reproduce the error and send us the
>> > output:
>> >
>> >     git clone https://github.com/dmtcp/dmtcp.git dmtcp-trunk
>> >     cd dmtcp-trunk
>> >     ./configure
>> >     make
>> >
>> > Now use this code to run your tests.
>> >
>> > This will pull the latest trunk to allow us to diagnose the error.
>> >
>> > Kapil
>> >
>> > On Mon, Oct 27, 2014 at 8:24 PM, Marina Moran
>> > <[email protected]>
>> > wrote:
>> >
>> >> Hi everyone:
>> >>
>> >> I have a node (intel i5) with 4 cores with:
>> >> Debian jessie amd64
>> >> OpenMPI 1.6.5
>> >> DMTCP: 2.3.1
>> >> NAS benchmarks
>> >>
>> >> My first try is using one node (four processes):
>> >>
>> >> I started the coordinator in one terminal:
>> >>
>> >>     hpcpro@m112a:~/NPB3.3/NPB3.3-MPI/bin$dmtcp_coordinator
>> >>
>> >>
>> >> In another terminal I launch the program:
>> >>
>> >>     hpcpro@m112a:~/NPB3.3/NPB3.3-MPI/bin$dmtcp_launch mpirun -np 4
>> lu.A.4
>> >>
>> >>
>> >> In another terminal I call the checkpoint:
>> >>     hpcpro@m112a:~/NPB3.3/NPB3.3-MPI/bin$ dmtcp_command --checkpoint
>> >>
>> >>
>> >> Call the restart script, where it hangs out:
>> >>
>> >>    hpcpro@m112a:~/NPB3.3/NPB3.3-MPI/bin$ ./dmtcp_restart_script.sh
>> >>  [1057] mtcp_restart.c:1303 open_shared_file:
>> >>   unable to create file
>> >> /tmp/openmpi-sessions-hpcpro@m112a_0/16803/1/shared_mem_pool.m112a
>> >> [1058] mtcp_restart.c:1303 open_shared_file:
>> >>   unable to create file
>> >> /tmp/openmpi-sessions-hpcpro@m112a_0/16803/1/shared_mem_pool.m112a
>> >> [1060] mtcp_restart.c:1303 open_shared_file:
>> >>   unable to create file
>> >> /tmp/openmpi-sessions-hpcpro@m112a_0/16803/1/shared_mem_pool.m112a
>> >> [1059] mtcp_restart.c:1303 open_shared_file:
>> >>   unable to create file
>> >> /tmp/openmpi-sessions-hpcpro@m112a_0/16803/1/shared_mem_pool.m112a
>> >>
>> >>
>> >> While the coordinator window show this:
>> >>
>> >> [964] NOTE at dmtcp_coordinator.cpp:1096 in
>> >> validateRestartingWorkerProcess; REASON='FIRST dmtcp_restart
>> >> connection.  Set numPeers. Generate timestamp'
>> >>      numPeers = 5
>> >>      curTimeStamp = 22631250933
>> >>      compId = 1310c956110-60000-544ed77d
>> >> [964] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
>> >> connected'
>> >>      hello_remote.from = 1310c956110-60000-544ed77d
>> >> [964] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
>> >> connected'
>> >>      hello_remote.from = 1310c956110-61000-544ed77d
>> >> [964] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
>> >> connected'
>> >>      hello_remote.from = 1310c956110-62000-544ed77d
>> >> [964] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
>> >> connected'
>> >>      hello_remote.from = 1310c956110-63000-544ed77d
>> >> [964] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
>> >> connected'
>> >>      hello_remote.from = 1310c956110-64000-544ed77d
>> >> [964] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect;
>> >> REASON='client disconnected'
>> >>      client->identity() = 1310c956110-63000-544ed77d
>> >> [964] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect;
>> >> REASON='client disconnected'
>> >>      client->identity() = 1310c956110-62000-544ed77d
>> >> [964] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect;
>> >> REASON='client disconnected'
>> >>      client->identity() = 1310c956110-64000-544ed77d
>> >> [964] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect;
>> >> REASON='client disconnected'
>> >>      client->identity() = 1310c956110-61000-544ed77d
>> >> l
>> >> Client List:
>> >> #, PROG[virtPID:realPID]@HOST, DMTCP-UNIQUEPID, STATE
>> >> 41, orterun[60000:1405]@m112a, 1310c956110-60000-544ed77d,
>> >> CHECKPOINTED
>> >>
>> >>
>> >> I was looking in this foro and internet about this error but can't get
>> >> any luck. Any help will be very appreciated!
>> >>
>> >> Regards,
>> >> Marina
>> >>
>> >>
>> >>
>> ------------------------------------------------------------------------------
>> >> _______________________________________________
>> >> Dmtcp-forum mailing list
>> >> [email protected]
>> >> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>> >>
>> >
>>
>

------------------------------------------------------------------------------
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to