On Thu, Mar 06, 2014 at 07:47:22PM -0800, Ralph Castain wrote:
> >>>>> Sorry for delay - yes, that looks like the right direction. I would
> >>>>> suggest doing it via the current state machine, though, by simply
> >>>>> defining another job or proc state in orte/mca/plm/plm_types.h, and
> >>>>> then registering a callback function using the
> >>>>> orte_state.add_job[proc]_state(state, function to be called,
> >>>>> ORTE_ERR_PRI). Then you can activate it by calling
> >>>>> ORTE_ACTIVATE_JOB[PROC]_STATE(NULL, state) and it will be handled in
> >>>>> the proper order.
> >>>>
> >>>> What is a job/proc in the Open MPI context.
> >>>
> >>> A "job" is the entire application, while a "proc" is just one process in
> >>> that application. In this case you could use either one as you are
> >>> checkpointing the entire job, but all this activity is occurring inside
> >>> each proc. So I'd suggest defining it as a proc state since it only
> >>> really involves local actions.
> >>>
> >>> If you like, I can define the required code in the trunk and let you fill
> >>> in the event functionality.
> >>
> >> That would be great.
> >
> > Thanks for your changes. When using --with-ft there are a few compiler
> > errors which I tried to fix with following patch:
> >
> > https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=71521789ef9d248a7eef53030d2ec5de900faa4c
>
> That looks okay, with the only caveat being that you wouldn't ordinarily pass
> the state_caddy_t into a function. It's just there to pass along the job etc
> in case the callback function needs to reference something. In this case, I
> can't think of anything the FT event function would need to know - you just
> want it to quiet all messaging.
I need to pass the type of state to the ft_event() functions:
enum opal_crs_state_type_t {
OPAL_CRS_NONE = 0,
OPAL_CRS_CHECKPOINT = 1,
OPAL_CRS_RESTART_PRE = 2,
OPAL_CRS_RESTART = 3, /* RESTART_POST */
so an int is all I need. So I probably need to encode it into *cbdata. Do I
just use an int directly in *cbdata or should it be part of a struct?
Adrian