Hello Karan,

Removing the --join flag and adding --host did the trick. Thank you for
that.

I am using dmtcp 2.3.1 on my systems. Is the git version more stable than
the one I am using?

Thanks and Regards

Nitinder Mohan
MTech (CE) IIIT Delhi
http://home.iiitd.edu.in/~nitinder1369/

On Mon, Nov 3, 2014 at 10:31 PM, Kapil Arya <[email protected]> wrote:

> Nitinder,
>
> Can you try by specifying --host for dmtcp_restart as well?  Next thing
> would then be to drop the --join flag.
>
> Finally, what DMTCP version are you using?  I would encourage to checkout
> our git repository and try with that.  Here are the commands:
>
> git clone https://github.com/dmtcp/dmtcp.git dmtcp-git
> cd dmtcp-git
> ./configure && make
>
> Kapil
>
> On Mon, Nov 3, 2014 at 11:32 AM, Nitinder Mohan <[email protected]>
> wrote:
>
>> Dear All,
>>
>> I am trying to use DMTCP and still learning to use it. I want to
>> checkpoint across multiple nodes using IP addresses. I am starting small,
>> with only two nodes to checkpoint. The application that I am trying to
>> checkpoint is sample app "dmtcp1". This is what I have done so far:
>>
>> 1. dmtcp_coordinator is running on one of the nodes.
>>
>> 2. Local Node [Node 1] is connected to coordinator using following
>> command:
>>                  dmtcp_checkpoint --host 127.0.1.1 --port 7779 test/dmtcp1
>>                      (as coordinator started on host address 127.0.1.1)
>>
>> 3. Remote Node [Node 2] is connected to coordinator using command:
>>                     dmtcp_checkpoint --host 192.168.32.192 --port 7779
>> test/dmtcp1
>>                     (as coordinator machine's IP address
>> is 192.168.32.192 )
>>
>> 4. Both the machines are connected to coordinator and counting.
>>
>> 5. Stop Node1 and Node 2 (Note that coordinator is still up and running)
>>
>> Now, the problem comes into play when restarting:
>>
>> *Step 1:* Restart on Node 1 using command:
>>       dmtcp_restart --join
>> ckpt_dmtcp1_16886b7f9e541c55-40000-5457a7f2.dmtcp
>>       (Note the join flag for joining to running coordinator)
>>
>> This is the output shown:
>>
>> [2766] ERROR at coordinatorapi.cpp:567 in sendRecvHandshake;
>> REASON='JASSERT(msg.type == DMT_ACCEPT) failed'
>> dmtcp_restart (2766): Terminating...
>>
>> *Step 2: *Restart on Node 2 using command:
>> dmtcp_restart --join ckpt_dmtcp1_16886b7f9e541c55-41000-5457f5f2.dmtcp
>>
>> This is the output I get:
>>
>> dmtcp_coordinator starting...
>>     Host: iiitd-HP-Compaq-8200-Elite-MT-PC (127.0.1.1)
>>     Port: 7779
>>     Checkpoint Interval: disabled (checkpoint manually instead)
>>     Exit on last client: 1
>>
>> I am pretty sure I am missing something small and trivial.
>>
>> Any help will be deeply appreciated.
>>
>> Thanks and Regards
>>
>> Nitinder Mohan
>> MTech (CE) IIIT Delhi
>> http://home.iiitd.edu.in/~nitinder1369/
>>
>
>
------------------------------------------------------------------------------
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to