Hi Viren,
    I was going through some older dmtcp-forum emails, and I realized that
we may not have answered your e-mail.  Hopefully you figured it out already,
but the short story is:
  Choose one machine as the coordinator host, and run the coordinator
there only.  Each DMTCP computation corresponds to a _single_ DMTCP coordinator.
Initially, that computation has no processes.  (If you run
dmtcp_coordinator interactively, try 'h' for help, and you'll see that
you can list all the processes.)
    The most common case is then to start your distributed process
on a single machine (e.g., running an MPI job):
  dmtcp_checkpoint mpirun -np 4 mpi_hello_world
By default, dmtcp_checkpoint looks for a coordinator at localhost
and port 7779.  Using the flags --host and --port, you can change that.

    It's also possible to separately start up processes that will
communicate together.  If they specify the same coordinator, they
will be part of the same computation unit.

    Finally, the checkpoint interval is something remembered by
the coordinator.  You can specify the checkpoint interval directly
to the coordinator, or you can specify --interval  as part of the
dmtcp_checkpoint command, and dmtcp_checkpoint will instruct the
coordinator to change its checkpoint interval.

    As for limiting users, so they don't set the checkpoint interval
to be two small, I can think of two options:
1.  PREFERRED:
      Modify the DMTCP source code inside dmtcp/src/dmtcp_coordinator.cpp
2.  OR: Write a wrapper around dmtcp_coordinator and dmtcp_checkpoint

Remember that each user will want to have their own coordinator,
since a DMTCP computation is defined as the processes connected
to a particular coordinator.  A checkpoint is initiated when the
coordinator tells all its connected processes to create a checkpoint.
Each client of the coordinator writes a file, ckpt_*.dmtcp, on its
local machine.

The coordinator then writes "dmtcp_restart_script.sh", which will
restart all processes using the ckpt_*.dmtcp files on the various hosts.

Best wishes,
- Gene (for the DMTCP team)

On Tue, Apr 03, 2012 at 12:05:56PM -0400, Viren Patel wrote:
> I have installed DMTCP 1.2.4 on our cluster and was wondering if
> there are any recommendations on configuring it for a cluster
> enviroment? Specifically I am not clear whether the
> dmtcp_coordinatator process be started on each node or only on the
> head node? Also if a user starts dmtcp_coordinator with automatic
> checkpointing (checkpointing interval > 0), would this policy also
> apply to another user? Finally how to limit user checkpointing
> intervals (i.e. not allow a user to set a too small interval)?
> Thanks.
> 
> Viren

> begin:vcard
> fn:Viren Patel
> n:Patel;Viren
> org:Emory University School of Medicine;Human Genetics
> adr:615 Michael Street;;Whitehead Biomedical Research Building, Ste. 
> 301;Atlanta;Georgia;30322;USA
> email;internet:[email protected]
> title:Application Developer/Analyst
> tel;work:404-727-9447
> url:http://www.genetics.emory.edu
> version:2.1
> end:vcard
> 

> ------------------------------------------------------------------------------
> Better than sec? Nothing is better than sec when it comes to
> monitoring Big Data applications. Try Boundary one-second 
> resolution app monitoring today. Free.
> http://p.sf.net/sfu/Boundary-dev2dev

> _______________________________________________
> Dmtcp-forum mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum


------------------------------------------------------------------------------
For Developers, A Lot Can Happen In A Second.
Boundary is the first to Know...and Tell You.
Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
http://p.sf.net/sfu/Boundary-d2dvs2
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to