Re: [OMPI devel] [PATCH] make orte-checkpoint communicate with orterun again

2014-01-24 Thread Jeff Squyres (jsquyres)
On Jan 24, 2014, at 12:35 PM, Adrian Reber  wrote:

> It's still not working but at least both processes are
> talking to each other which is good.


w00t!

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] [PATCH] make orte-checkpoint communicate with orterun again

2014-01-24 Thread Adrian Reber
Status update of C/R with Open MPI:

With the last two patches applied I am now seeing communication
between orte-checkpoint and orterun:

orte-checkpoint 23975:

[dcbz:23986] orte_checkpoint: Checkpointing...
[dcbz:23986] PID 23975
[dcbz:23986] Connected to Mpirun [[45520,0],0]
[dcbz:23986] orte_checkpoint: notify_hnp: Contact Head Node Process PID 23975
[dcbz:23986] [[45509,0],0] rml_send_buffer to peer [[45520,0],0] at tag 13
[dcbz:23986] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid 
[INVALID]
[dcbz:23986] [[45509,0],0] posting recv
[dcbz:23986] [[45509,0],0] posting persistent recv on tag 9 for peer 
[[WILDCARD],WILDCARD]
[dcbz:23986] [[45509,0],0] posting recv
[dcbz:23986] [[45509,0],0] posting persistent recv on tag 13 for peer 
[[WILDCARD],WILDCARD]
[dcbz:23986] [[45509,0],0] rml_send_msg to peer [[45520,0],0] at tag 13
[dcbz:23986] [[45509,0],0]-[[45520,0],0] Send message complete at 
../../../../../orte/mca/oob/tcp/oob_tcp_sendrecv.c:220
[dcbz:23986] [[45509,0],0] Message posted at 
../../../../../orte/mca/oob/tcp/oob_tcp_sendrecv.c:519
[dcbz:23986] [[45509,0],0] message received 39 bytes from [[45520,0],0] for tag 
13
[dcbz:23986] orte_checkpoint: hnp_receiver: Receive a command message.
[dcbz:23986] orte_checkpoint: hnp_receiver: Status Update.
--
Error: The application (PID = 23975) failed to checkpoint properly.
   Returned -1.
--

orterun:

[dcbz:23975] [[45520,0],0] Message posted at 
../../../../../orte/mca/oob/tcp/oob_tcp_sendrecv.c:519
[dcbz:23975] [[45520,0],0] message received 50 bytes from [[45509,0],0] for tag 
13
[dcbz:23975] Global) Command Line: Start a checkpoint operation [Sender = 
[[45509,0],0]]
[dcbz:23975] Global) Command line requested a checkpoint [command 1]
[dcbz:23975] Global-Local) base:ckpt_init_cmd: Receiving commands
[dcbz:23975] Global-Local) base:ckpt_init_cmd: Received [0, 0, [INVALID]]
[dcbz:23975] Global) request_cmd(): Checkpointing currently disabled, rejecting 
request
[dcbz:23975] 23975: Failed to checkpoint process [45520,0].
[dcbz:23975] Global-Local) base:ckpt_update_cmd: Sending update command 
[dcbz:23975] Global-Local) base:ckpt_update_cmd: Sending update command  +  
[dcbz:23975] [[45520,0],0] rml_send_buffer to peer [[45509,0],0] at tag 13
[dcbz:23975] Global) Startup Command Line Channel
[dcbz:23975] [[45520,0],0] rml_recv_buffer_nb for peer [[WILDCARD],WILDCARD] 
tag 13
[dcbz:23975] [[45520,0],0] rml_send_msg to peer [[45509,0],0] at tag 13
[dcbz:23975] [[45520,0],0] posting recv
[dcbz:23975] [[45520,0],0] posting non-persistent recv on tag 13 for peer 
[[WILDCARD],WILDCARD]
[dcbz:23975] [[45520,0],0]-[[45509,0],0] Send message complete at 
../../../../../orte/mca/oob/tcp/oob_tcp_sendrecv.c:220

It's still not working but at least both processes are
talking to each other which is good.

Adrian


On Thu, Jan 23, 2014 at 11:27:42AM -0600, Josh Hursey wrote:
> +1
> 
> 
> On Thu, Jan 23, 2014 at 10:16 AM, Ralph Castain  wrote:
> 
> > Looks correct to me - you are right in that you cannot release the buffer
> > until after the send completes. We don't copy the data underneath to save
> > memory and time.
> >
> >
> > On Jan 23, 2014, at 6:51 AM, Adrian Reber  wrote:
> >
> > > Following patch makes orte-checkpoint communicate with orterun again:
> > >
> > > diff --git a/orte/tools/orte-checkpoint/orte-checkpoint.c
> > b/orte/tools/orte-checkpoint/orte-checkpoint.c
> > > index 7106342..8539f34 100644
> > > --- a/orte/tools/orte-checkpoint/orte-checkpoint.c
> > > +++ b/orte/tools/orte-checkpoint/orte-checkpoint.c
> > > @@ -834,7 +834,7 @@ static int
> > notify_process_for_checkpoint(opal_crs_base_ckpt_options_t *options)
> > > }
> > >
> > > if (ORTE_SUCCESS != (ret =
> > orte_rml.send_buffer_nb(&(orterun_hnp->name), buffer,
> > > -
> > ORTE_RML_TAG_CKPT, hnp_receiver,
> > > +
> > ORTE_RML_TAG_CKPT, orte_rml_send_callback,
> > >NULL))) {
> > > exit_status = ret;
> > > goto cleanup;
> > > @@ -845,11 +845,6 @@ static int
> > notify_process_for_checkpoint(opal_crs_base_ckpt_options_t *options)
> > > ORTE_JOBID_PRINT(jobid));
> > >
> > >  cleanup:
> > > -if( NULL != buffer) {
> > > -OBJ_RELEASE(buffer);
> > > -buffer = NULL;
> > > -}
> > > -
> > > if( ORTE_SUCCESS != exit_status ) {
> > > opal_show_help("help-orte-checkpoint.txt", "unable_to_connect",
> > true,
> > >orte_checkpoint_globals.pid);
> > >
> > >
> > > Before committing the code into the repository I wanted to make
> > > sure it is the correct way to fix it.
> > >
> > > The first change changes the callback to orte_rml_send_callback().
> > > When I initially made the code compile again I used hnp_receiver()
> > > to 

Re: [OMPI devel] [PATCH] make orte-checkpoint communicate with orterun again

2014-01-23 Thread Ralph Castain
Looks correct to me - you are right in that you cannot release the buffer until 
after the send completes. We don't copy the data underneath to save memory and 
time.


On Jan 23, 2014, at 6:51 AM, Adrian Reber  wrote:

> Following patch makes orte-checkpoint communicate with orterun again:
> 
> diff --git a/orte/tools/orte-checkpoint/orte-checkpoint.c 
> b/orte/tools/orte-checkpoint/orte-checkpoint.c
> index 7106342..8539f34 100644
> --- a/orte/tools/orte-checkpoint/orte-checkpoint.c
> +++ b/orte/tools/orte-checkpoint/orte-checkpoint.c
> @@ -834,7 +834,7 @@ static int 
> notify_process_for_checkpoint(opal_crs_base_ckpt_options_t *options)
> }
> 
> if (ORTE_SUCCESS != (ret = orte_rml.send_buffer_nb(&(orterun_hnp->name), 
> buffer,
> -   ORTE_RML_TAG_CKPT, 
> hnp_receiver,
> +   ORTE_RML_TAG_CKPT, 
> orte_rml_send_callback,
>NULL))) {
> exit_status = ret;
> goto cleanup;
> @@ -845,11 +845,6 @@ static int 
> notify_process_for_checkpoint(opal_crs_base_ckpt_options_t *options)
> ORTE_JOBID_PRINT(jobid));
> 
>  cleanup:
> -if( NULL != buffer) {
> -OBJ_RELEASE(buffer);
> -buffer = NULL;
> -}
> -
> if( ORTE_SUCCESS != exit_status ) {
> opal_show_help("help-orte-checkpoint.txt", "unable_to_connect", true,
>orte_checkpoint_globals.pid);
> 
> 
> Before committing the code into the repository I wanted to make
> sure it is the correct way to fix it.
> 
> The first change changes the callback to orte_rml_send_callback().
> When I initially made the code compile again I used hnp_receiver()
> to change the code from blocking to non-blocking and that was
> wrong.
> 
> The second change (removal of OBJ_RELEASE(buffer)) is necessary
> because this seems to delete buffer during communication and then
> everything breaks badly.
> 
>   Adrian
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



[OMPI devel] [PATCH] make orte-checkpoint communicate with orterun again

2014-01-23 Thread Adrian Reber
Following patch makes orte-checkpoint communicate with orterun again:

diff --git a/orte/tools/orte-checkpoint/orte-checkpoint.c 
b/orte/tools/orte-checkpoint/orte-checkpoint.c
index 7106342..8539f34 100644
--- a/orte/tools/orte-checkpoint/orte-checkpoint.c
+++ b/orte/tools/orte-checkpoint/orte-checkpoint.c
@@ -834,7 +834,7 @@ static int 
notify_process_for_checkpoint(opal_crs_base_ckpt_options_t *options)
 }

 if (ORTE_SUCCESS != (ret = orte_rml.send_buffer_nb(&(orterun_hnp->name), 
buffer,
-   ORTE_RML_TAG_CKPT, 
hnp_receiver,
+   ORTE_RML_TAG_CKPT, 
orte_rml_send_callback,
NULL))) {
 exit_status = ret;
 goto cleanup;
@@ -845,11 +845,6 @@ static int 
notify_process_for_checkpoint(opal_crs_base_ckpt_options_t *options)
 ORTE_JOBID_PRINT(jobid));

  cleanup:
-if( NULL != buffer) {
-OBJ_RELEASE(buffer);
-buffer = NULL;
-}
-
 if( ORTE_SUCCESS != exit_status ) {
 opal_show_help("help-orte-checkpoint.txt", "unable_to_connect", true,
orte_checkpoint_globals.pid);


Before committing the code into the repository I wanted to make
sure it is the correct way to fix it.

The first change changes the callback to orte_rml_send_callback().
When I initially made the code compile again I used hnp_receiver()
to change the code from blocking to non-blocking and that was
wrong.

The second change (removal of OBJ_RELEASE(buffer)) is necessary
because this seems to delete buffer during communication and then
everything breaks badly.

Adrian