Hello Karan, Removing the --join flag and adding --host did the trick. Thank you for that.
I am using dmtcp 2.3.1 on my systems. Is the git version more stable than the one I am using? Thanks and Regards Nitinder Mohan MTech (CE) IIIT Delhi http://home.iiitd.edu.in/~nitinder1369/ On Mon, Nov 3, 2014 at 10:31 PM, Kapil Arya <[email protected]> wrote: > Nitinder, > > Can you try by specifying --host for dmtcp_restart as well? Next thing > would then be to drop the --join flag. > > Finally, what DMTCP version are you using? I would encourage to checkout > our git repository and try with that. Here are the commands: > > git clone https://github.com/dmtcp/dmtcp.git dmtcp-git > cd dmtcp-git > ./configure && make > > Kapil > > On Mon, Nov 3, 2014 at 11:32 AM, Nitinder Mohan <[email protected]> > wrote: > >> Dear All, >> >> I am trying to use DMTCP and still learning to use it. I want to >> checkpoint across multiple nodes using IP addresses. I am starting small, >> with only two nodes to checkpoint. The application that I am trying to >> checkpoint is sample app "dmtcp1". This is what I have done so far: >> >> 1. dmtcp_coordinator is running on one of the nodes. >> >> 2. Local Node [Node 1] is connected to coordinator using following >> command: >> dmtcp_checkpoint --host 127.0.1.1 --port 7779 test/dmtcp1 >> (as coordinator started on host address 127.0.1.1) >> >> 3. Remote Node [Node 2] is connected to coordinator using command: >> dmtcp_checkpoint --host 192.168.32.192 --port 7779 >> test/dmtcp1 >> (as coordinator machine's IP address >> is 192.168.32.192 ) >> >> 4. Both the machines are connected to coordinator and counting. >> >> 5. Stop Node1 and Node 2 (Note that coordinator is still up and running) >> >> Now, the problem comes into play when restarting: >> >> *Step 1:* Restart on Node 1 using command: >> dmtcp_restart --join >> ckpt_dmtcp1_16886b7f9e541c55-40000-5457a7f2.dmtcp >> (Note the join flag for joining to running coordinator) >> >> This is the output shown: >> >> [2766] ERROR at coordinatorapi.cpp:567 in sendRecvHandshake; >> REASON='JASSERT(msg.type == DMT_ACCEPT) failed' >> dmtcp_restart (2766): Terminating... >> >> *Step 2: *Restart on Node 2 using command: >> dmtcp_restart --join ckpt_dmtcp1_16886b7f9e541c55-41000-5457f5f2.dmtcp >> >> This is the output I get: >> >> dmtcp_coordinator starting... >> Host: iiitd-HP-Compaq-8200-Elite-MT-PC (127.0.1.1) >> Port: 7779 >> Checkpoint Interval: disabled (checkpoint manually instead) >> Exit on last client: 1 >> >> I am pretty sure I am missing something small and trivial. >> >> Any help will be deeply appreciated. >> >> Thanks and Regards >> >> Nitinder Mohan >> MTech (CE) IIIT Delhi >> http://home.iiitd.edu.in/~nitinder1369/ >> > >
------------------------------------------------------------------------------
_______________________________________________ Dmtcp-forum mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
