Nitinder,

Can you try by specifying --host for dmtcp_restart as well?  Next thing
would then be to drop the --join flag.

Finally, what DMTCP version are you using?  I would encourage to checkout
our git repository and try with that.  Here are the commands:

git clone https://github.com/dmtcp/dmtcp.git dmtcp-git
cd dmtcp-git
./configure && make

Kapil

On Mon, Nov 3, 2014 at 11:32 AM, Nitinder Mohan <[email protected]>
wrote:

> Dear All,
>
> I am trying to use DMTCP and still learning to use it. I want to
> checkpoint across multiple nodes using IP addresses. I am starting small,
> with only two nodes to checkpoint. The application that I am trying to
> checkpoint is sample app "dmtcp1". This is what I have done so far:
>
> 1. dmtcp_coordinator is running on one of the nodes.
>
> 2. Local Node [Node 1] is connected to coordinator using following command:
>                  dmtcp_checkpoint --host 127.0.1.1 --port 7779 test/dmtcp1
>                      (as coordinator started on host address 127.0.1.1)
>
> 3. Remote Node [Node 2] is connected to coordinator using command:
>                     dmtcp_checkpoint --host 192.168.32.192 --port 7779
> test/dmtcp1
>                     (as coordinator machine's IP address
> is 192.168.32.192 )
>
> 4. Both the machines are connected to coordinator and counting.
>
> 5. Stop Node1 and Node 2 (Note that coordinator is still up and running)
>
> Now, the problem comes into play when restarting:
>
> *Step 1:* Restart on Node 1 using command:
>       dmtcp_restart --join
> ckpt_dmtcp1_16886b7f9e541c55-40000-5457a7f2.dmtcp
>       (Note the join flag for joining to running coordinator)
>
> This is the output shown:
>
> [2766] ERROR at coordinatorapi.cpp:567 in sendRecvHandshake;
> REASON='JASSERT(msg.type == DMT_ACCEPT) failed'
> dmtcp_restart (2766): Terminating...
>
> *Step 2: *Restart on Node 2 using command:
> dmtcp_restart --join ckpt_dmtcp1_16886b7f9e541c55-41000-5457f5f2.dmtcp
>
> This is the output I get:
>
> dmtcp_coordinator starting...
>     Host: iiitd-HP-Compaq-8200-Elite-MT-PC (127.0.1.1)
>     Port: 7779
>     Checkpoint Interval: disabled (checkpoint manually instead)
>     Exit on last client: 1
>
> I am pretty sure I am missing something small and trivial.
>
> Any help will be deeply appreciated.
>
> Thanks and Regards
>
> Nitinder Mohan
> MTech (CE) IIIT Delhi
> http://home.iiitd.edu.in/~nitinder1369/
>
------------------------------------------------------------------------------
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to