Thanks for the extra information.  Yes, why don't I try to reproduce
the issue locally here at my site.  This will be much easier.
[ I also decided to cc my reply to dmtcp-forum in case this is
  of general interest.  I hope that's okay. ]

First of all, the normal operation of DMTCP is to support checkpoint/restart
of a computation.  A computation is all the processes talking
to one coordinator.  So, normally, if an IRC_server and IRC_client
are part of a single computation, then we would usually checkpoint and
restart both at the same time.  We do this for consistency.
It is possible to use '--join' to join an existing computation,
but then I need to understand better the scenario you're trying to create.

Are you trying to checkpoint IRC_server/IRC_client, kill the IRC_client,
and then restart _only_ the IRC_client?  In that case, it would violate
the model of checkpoint and restart of a single computation.
So, in that case, it would be better to run IRC_server outside of DMTCP,
and checkpoint only IRC_client, and then restart IRC_client.  You
would want IRC_client to disconnect at checkpoint time, and re-connnect
at the time of restart or resume.  DMTCP has hooks to help you do that,
and we could talk about that.

In the other situation, maybe you want to chekcpoint IRC_server/IRC_client,
kill both, and then restart both.  That should work normally with DMTCP,
and if it doesn't, then this is a bug in DMTCP.
    In case we are in this scenario, here is my understaing:
1.  I assume we can just use two ordinary hosts.  There is nothing
    here that seems to depend on using a VM.
2.  We should download Unreal IRC from:
      http://www.unrealircd.com/downloads.php  (version 3.2.9 for Linux)
3.  HOST-1:  dmtcp_coordinator
             dmtcp_checkpoint IRC_server  [I'll fix the syntax later.]
    HOST-2:  dmtcp_checkpoint --host HOST-1 IRC_client  [ fix syntax later ]
4.  Kill IRC_server/IRC_client
5.  Restart both (either with the old dmtcp_coordinator, or with a new one).

So, which is your scenario (two paragraph up, or one paragraph up)?
Once you specify, we'll take the next step from there.

Thanks,
- Gene

On Tue, Mar 13, 2012 at 08:25:18PM +0000, Harezga, Nick wrote:
> There is an old coordinator running, but that is sort of the idea of our 
> research. The clients continue to run while the server will be restarted.
> 
> I have tried re-starting the IRC server with both the --join and --no-check 
> options, but those don't work either. 
> 
> There are no issues as long as the server is restarted and the clients aren't 
> running at the time.
> 
> The last item that is output as part of the debug statements when trying to 
> restart the server is below.
> [2266] TRACE at connection.cpp:585 in restore; REASON='registerOutgoing'
>       id() = 407f2176-43000-4f5f9a5e(99008)
>       _acceptRemoteId = 72006bd7-47000-4f5f9ab0(99002)
>       fds[0] = 501
> [2266] TRACE at connection.cpp:478 in restore; REASON='Creating dead socket.'
>       fds[0] = 502
>       fds.size = 1
> 
> If you would like to try to reproduce this bug, we have the following 
> situation.
> On one VM:
> - Running Unreal IRC server
> - Running dmtcp_coordinator
> 
> Another VM (x2):
> - Running BitchX terminal IRC client to connect to Unreal IRC server
> - Set DMTCP_HOST and DMTCP_PORT environment variables to point to other VM.
> 
> After running for a while, a checkpoint is initiated and the IRC server is 
> then stopped. We then try to restart the IRC server from a checkpoint
> and then we get the error from my last e-mail. If we then kill the IRC 
> clients, we are able to start the IRC server followed by the IRC clients.
> 
> 
> -----Original Message-----
> From: Gene Cooperman [mailto:[email protected]] 
> Sent: Tuesday, March 13, 2012 1:06 PM
> To: Harezga, Nick
> Cc: [email protected]
> Subject: Re: [Dmtcp-forum] Error when trying to restart server
> 
> Hi Nick,
>     Have you looked to see if there is an old coordinator still running?
> If there is, try killing it, and then again trying the restart.
> We recently improved the error message about "funny state" to make
> this possibility clearer.
>     And as always, if you have a DMTCP bug that you can reproduce, we'd
> be very eager to get a copy of the code (or a small test case) that
> demonstrates the bug.
> 
> Best,
> - Gene
> 
> On Mon, Mar 12, 2012 at 07:52:38PM +0000, Harezga, Nick wrote:
> > Hi all,
> > 
> > For the purposes of getting a demonstration running, we have decided to 
> > attach both the IRC server and IRC client to the dmtcp_coordinator. We are 
> > able to successfully checkpoint all programs, with the client and server 
> > running on different virtual machines. We can restart everything when 
> > starting the IRC server first, but if we kill the IRC server while the 
> > clients are running, the server gives us the following error when 
> > attempting to restart.
> > 
> > Message: Coordinator in a funny state.  Peers exist, not restarting,
> > but not in a running state.  Checkpointing?
> > Or maybe restarting and running with peers existing?
> > 
> > It then advised to use the utils/dmtcp_backtrace.py utility to dump the 
> > error output, which I have attached below.
> > 
> > Examing stack for call frames from:
> >   /usr/local/bin/dmtcp_restart
> > FORMAT:  FNC: ..., followed by file:line_number (most recent first).
> > 
> > ** FNC: writeBacktrace
> > dmtcp-1.2.4/dmtcp/src/../jalib/jassert.cpp:193
> > ** FNC: jassert_internal::JAssert::jbacktrace()
> > dmtcp-1.2.4/dmtcp/src/../jalib/jassert.cpp:228
> > ** FNC: ~JAssert
> > dmtcp-1.2.4/dmtcp/src/../jalib/jassert.cpp:116
> > ** FNC: dmtcp::DmtcpCoordinatorAPI::recvCoordinatorHandshake(int*)
> > dmtcp-1.2.4/dmtcp/src/dmtcpcoordinatorapi.cpp:224
> > ** FNC: restoreSockets
> > dmtcp-1.2.4/dmtcp/src/dmtcp_restart.cpp:704
> > ** FNC: main
> > dmtcp-1.2.4/dmtcp/src/dmtcp_restart.cpp:925
> > /lib/i386-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x469113]
> > ** FNC: _start
> > ??:0
> > 
> > Any ideas? Is there a reason that we wouldn't be able to restart the server 
> > while the clients are still running?
> > 
> > Thanks,
> > Nick
> 
> > ------------------------------------------------------------------------------
> > Try before you buy = See our experts in action!
> > The most comprehensive online learning library for Microsoft developers
> > is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
> > Metro Style Apps, more. Free future releases when you subscribe now!
> > http://p.sf.net/sfu/learndevnow-dev2
> 
> > _______________________________________________
> > Dmtcp-forum mailing list
> > [email protected]
> > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
> 

> Hi Nick,
>     Have you looked to see if there is an old coordinator still running?
> If there is, try killing it, and then again trying the restart.
> We recently improved the error message about "funny state" to make
> this possibility clearer.
>     And as always, if you have a DMTCP bug that you can reproduce, we'd
> be very eager to get a copy of the code (or a small test case) that
> demonstrates the bug.
> 
> Best,
> - Gene
> _______________________________________________
> Dmtcp-forum mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to