On Thu, Mar 06, 2014 at 07:47:22PM -0800, Ralph Castain wrote: > >>>>> Sorry for delay - yes, that looks like the right direction. I would > >>>>> suggest doing it via the current state machine, though, by simply > >>>>> defining another job or proc state in orte/mca/plm/plm_types.h, and > >>>>> then registering a callback function using the > >>>>> orte_state.add_job[proc]_state(state, function to be called, > >>>>> ORTE_ERR_PRI). Then you can activate it by calling > >>>>> ORTE_ACTIVATE_JOB[PROC]_STATE(NULL, state) and it will be handled in > >>>>> the proper order. > >>>> > >>>> What is a job/proc in the Open MPI context. > >>> > >>> A "job" is the entire application, while a "proc" is just one process in > >>> that application. In this case you could use either one as you are > >>> checkpointing the entire job, but all this activity is occurring inside > >>> each proc. So I'd suggest defining it as a proc state since it only > >>> really involves local actions. > >>> > >>> If you like, I can define the required code in the trunk and let you fill > >>> in the event functionality. > >> > >> That would be great. > > > > Thanks for your changes. When using --with-ft there are a few compiler > > errors which I tried to fix with following patch: > > > > https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=71521789ef9d248a7eef53030d2ec5de900faa4c > > That looks okay, with the only caveat being that you wouldn't ordinarily pass > the state_caddy_t into a function. It's just there to pass along the job etc > in case the callback function needs to reference something. In this case, I > can't think of anything the FT event function would need to know - you just > want it to quiet all messaging.
I need to pass the type of state to the ft_event() functions: enum opal_crs_state_type_t { OPAL_CRS_NONE = 0, OPAL_CRS_CHECKPOINT = 1, OPAL_CRS_RESTART_PRE = 2, OPAL_CRS_RESTART = 3, /* RESTART_POST */ so an int is all I need. So I probably need to encode it into *cbdata. Do I just use an int directly in *cbdata or should it be part of a struct? Adrian