Status update of C/R with Open MPI:
With the last two patches applied I am now seeing communication
between orte-checkpoint and orterun:
orte-checkpoint 23975:
[dcbz:23986] orte_checkpoint: Checkpointing...
[dcbz:23986] PID 23975
[dcbz:23986] Connected to Mpirun [[45520,0],0]
[dcbz:23986] orte_checkpoint: notify_hnp: Contact Head Node Process PID 23975
[dcbz:23986] [[45509,0],0] rml_send_buffer to peer [[45520,0],0] at tag 13
[dcbz:23986] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid
[INVALID]
[dcbz:23986] [[45509,0],0] posting recv
[dcbz:23986] [[45509,0],0] posting persistent recv on tag 9 for peer
[[WILDCARD],WILDCARD]
[dcbz:23986] [[45509,0],0] posting recv
[dcbz:23986] [[45509,0],0] posting persistent recv on tag 13 for peer
[[WILDCARD],WILDCARD]
[dcbz:23986] [[45509,0],0] rml_send_msg to peer [[45520,0],0] at tag 13
[dcbz:23986] [[45509,0],0]-[[45520,0],0] Send message complete at
../../../../../orte/mca/oob/tcp/oob_tcp_sendrecv.c:220
[dcbz:23986] [[45509,0],0] Message posted at
../../../../../orte/mca/oob/tcp/oob_tcp_sendrecv.c:519
[dcbz:23986] [[45509,0],0] message received 39 bytes from [[45520,0],0] for tag
13
[dcbz:23986] orte_checkpoint: hnp_receiver: Receive a command message.
[dcbz:23986] orte_checkpoint: hnp_receiver: Status Update.
--
Error: The application (PID = 23975) failed to checkpoint properly.
Returned -1.
--
orterun:
[dcbz:23975] [[45520,0],0] Message posted at
../../../../../orte/mca/oob/tcp/oob_tcp_sendrecv.c:519
[dcbz:23975] [[45520,0],0] message received 50 bytes from [[45509,0],0] for tag
13
[dcbz:23975] Global) Command Line: Start a checkpoint operation [Sender =
[[45509,0],0]]
[dcbz:23975] Global) Command line requested a checkpoint [command 1]
[dcbz:23975] Global-Local) base:ckpt_init_cmd: Receiving commands
[dcbz:23975] Global-Local) base:ckpt_init_cmd: Received [0, 0, [INVALID]]
[dcbz:23975] Global) request_cmd(): Checkpointing currently disabled, rejecting
request
[dcbz:23975] 23975: Failed to checkpoint process [45520,0].
[dcbz:23975] Global-Local) base:ckpt_update_cmd: Sending update command
[dcbz:23975] Global-Local) base:ckpt_update_cmd: Sending update command +
[dcbz:23975] [[45520,0],0] rml_send_buffer to peer [[45509,0],0] at tag 13
[dcbz:23975] Global) Startup Command Line Channel
[dcbz:23975] [[45520,0],0] rml_recv_buffer_nb for peer [[WILDCARD],WILDCARD]
tag 13
[dcbz:23975] [[45520,0],0] rml_send_msg to peer [[45509,0],0] at tag 13
[dcbz:23975] [[45520,0],0] posting recv
[dcbz:23975] [[45520,0],0] posting non-persistent recv on tag 13 for peer
[[WILDCARD],WILDCARD]
[dcbz:23975] [[45520,0],0]-[[45509,0],0] Send message complete at
../../../../../orte/mca/oob/tcp/oob_tcp_sendrecv.c:220
It's still not working but at least both processes are
talking to each other which is good.
Adrian
On Thu, Jan 23, 2014 at 11:27:42AM -0600, Josh Hursey wrote:
> +1
>
>
> On Thu, Jan 23, 2014 at 10:16 AM, Ralph Castain wrote:
>
> > Looks correct to me - you are right in that you cannot release the buffer
> > until after the send completes. We don't copy the data underneath to save
> > memory and time.
> >
> >
> > On Jan 23, 2014, at 6:51 AM, Adrian Reber wrote:
> >
> > > Following patch makes orte-checkpoint communicate with orterun again:
> > >
> > > diff --git a/orte/tools/orte-checkpoint/orte-checkpoint.c
> > b/orte/tools/orte-checkpoint/orte-checkpoint.c
> > > index 7106342..8539f34 100644
> > > --- a/orte/tools/orte-checkpoint/orte-checkpoint.c
> > > +++ b/orte/tools/orte-checkpoint/orte-checkpoint.c
> > > @@ -834,7 +834,7 @@ static int
> > notify_process_for_checkpoint(opal_crs_base_ckpt_options_t *options)
> > > }
> > >
> > > if (ORTE_SUCCESS != (ret =
> > orte_rml.send_buffer_nb(&(orterun_hnp->name), buffer,
> > > -
> > ORTE_RML_TAG_CKPT, hnp_receiver,
> > > +
> > ORTE_RML_TAG_CKPT, orte_rml_send_callback,
> > >NULL))) {
> > > exit_status = ret;
> > > goto cleanup;
> > > @@ -845,11 +845,6 @@ static int
> > notify_process_for_checkpoint(opal_crs_base_ckpt_options_t *options)
> > > ORTE_JOBID_PRINT(jobid));
> > >
> > > cleanup:
> > > -if( NULL != buffer) {
> > > -OBJ_RELEASE(buffer);
> > > -buffer = NULL;
> > > -}
> > > -
> > > if( ORTE_SUCCESS != exit_status ) {
> > > opal_show_help("help-orte-checkpoint.txt", "unable_to_connect",
> > true,
> > >orte_checkpoint_globals.pid);
> > >
> > >
> > > Before committing the code into the repository I wanted to make
> > > sure it is the correct way to fix it.
> > >
> > > The first change changes the callback to orte_rml_send_callback().
> > > When I initially made the code compile again I used hnp_receiver()
> > > to