[OMPI devel] Another Open MPI <-> PSM question (MPI_Isend()/MPI_Cancel())

2015-01-15 Thread Adrian Reber
Doing 

MPI_Isend()

followed by a

MPI_Cancel()

fails on my PSM based system with 1.8.4 like this:

n040108:0.1.Cannot cancel send requests (req=0x2b6279787f80)
n040108:0.0.Cannot cancel send requests (req=0x2b3a3dc92f80)
---
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
---
--
mpirun detected that one or more processes exited with non-zero status,
thus causing
the job to be terminated. The first process to do so was:

  Process name: [[58364,1],1]
  Exit code:255
--

Is this something PSM actually cannot do or an Open MPI error?

Adrian


Re: [OMPI devel] Another Open MPI <-> PSM question (MPI_Isend()/MPI_Cancel())

2015-01-15 Thread George Bosilca
>From the MPI standard perspective MPI_Cancel doesn't have to succeed, it
can also gracefully fail. However, the PSM MTL diverges from the MPI
standard and if a request cannot be canceled an error is returned. Here is
a patch to fix this issue.

diff --git a/ompi/mca/mtl/psm/mtl_psm_cancel.c
b/ompi/mca/mtl/psm/mtl_psm_cancel
index 6da3386..277c761 100644
--- a/ompi/mca/mtl/psm/mtl_psm_cancel.c
+++ b/ompi/mca/mtl/psm/mtl_psm_cancel.c
@@ -37,10 +37,8 @@ int ompi_mtl_psm_cancel(struct mca_mtl_base_module_t*
mtl,
 if(PSM_OK == err) {
   mtl_request->ompi_req->req_status._cancelled = true;
   mtl_psm_request->super.completion_callback(&mtl_psm_request->super);
-  return OMPI_SUCCESS;
-} else {
-  return OMPI_ERROR;
 }
+return OMPI_SUCCESS;
   } else if(PSM_MQ_INCOMPLETE == err) {
 return OMPI_SUCCESS;
   }

  George.


On Thu, Jan 15, 2015 at 1:30 PM, Adrian Reber  wrote:

> Doing
>
> MPI_Isend()
>
> followed by a
>
> MPI_Cancel()
>
> fails on my PSM based system with 1.8.4 like this:
>
> n040108:0.1.Cannot cancel send requests (req=0x2b6279787f80)
> n040108:0.0.Cannot cancel send requests (req=0x2b3a3dc92f80)
> ---
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code.. Per user-direction, the job has been aborted.
> ---
> --
> mpirun detected that one or more processes exited with non-zero status,
> thus causing
> the job to be terminated. The first process to do so was:
>
>   Process name: [[58364,1],1]
>   Exit code:255
> --
>
> Is this something PSM actually cannot do or an Open MPI error?
>
> Adrian
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/01/16783.php
>


Re: [OMPI devel] Another Open MPI <-> PSM question (MPI_Isend()/MPI_Cancel())

2015-01-15 Thread Howard Pritchard
thanks George!


2015-01-15 11:43 GMT-07:00 George Bosilca :

> From the MPI standard perspective MPI_Cancel doesn't have to succeed, it
> can also gracefully fail. However, the PSM MTL diverges from the MPI
> standard and if a request cannot be canceled an error is returned. Here is
> a patch to fix this issue.
>
> diff --git a/ompi/mca/mtl/psm/mtl_psm_cancel.c
> b/ompi/mca/mtl/psm/mtl_psm_cancel
> index 6da3386..277c761 100644
> --- a/ompi/mca/mtl/psm/mtl_psm_cancel.c
> +++ b/ompi/mca/mtl/psm/mtl_psm_cancel.c
> @@ -37,10 +37,8 @@ int ompi_mtl_psm_cancel(struct mca_mtl_base_module_t*
> mtl,
>  if(PSM_OK == err) {
>mtl_request->ompi_req->req_status._cancelled = true;
>mtl_psm_request->super.completion_callback(&mtl_psm_request->super);
> -  return OMPI_SUCCESS;
> -} else {
> -  return OMPI_ERROR;
>  }
> +return OMPI_SUCCESS;
>} else if(PSM_MQ_INCOMPLETE == err) {
>  return OMPI_SUCCESS;
>}
>
>   George.
>
>
> On Thu, Jan 15, 2015 at 1:30 PM, Adrian Reber  wrote:
>
>> Doing
>>
>> MPI_Isend()
>>
>> followed by a
>>
>> MPI_Cancel()
>>
>> fails on my PSM based system with 1.8.4 like this:
>>
>> n040108:0.1.Cannot cancel send requests (req=0x2b6279787f80)
>> n040108:0.0.Cannot cancel send requests (req=0x2b3a3dc92f80)
>> ---
>> Primary job  terminated normally, but 1 process returned
>> a non-zero exit code.. Per user-direction, the job has been aborted.
>> ---
>> --
>> mpirun detected that one or more processes exited with non-zero status,
>> thus causing
>> the job to be terminated. The first process to do so was:
>>
>>   Process name: [[58364,1],1]
>>   Exit code:255
>> --
>>
>> Is this something PSM actually cannot do or an Open MPI error?
>>
>> Adrian
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2015/01/16783.php
>>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/01/16784.php
>


Re: [OMPI devel] Another Open MPI <-> PSM question (MPI_Isend()/MPI_Cancel())

2015-01-15 Thread Adrian Reber
As PSM on master is still broken I applied it on 1.8.4. Unfortunately it
does not work. The error is the same as before.

Looking at your patch I would also expect that this is the correct fix
and I even tried to change ompi_mtl_psm_cancel() to always return
OMPI_SUCCESS. MPI_Cancel() still fails.

Looking at the PSM code it seems it can directly call exit(-1) and thus
terminating and never returning to Open MPI. I do not see any debug
output from Open MPI after "Cannot cancel send requests" from PSM.

Adrian

On Thu, Jan 15, 2015 at 01:43:11PM -0500, George Bosilca wrote:
> >From the MPI standard perspective MPI_Cancel doesn't have to succeed, it
> can also gracefully fail. However, the PSM MTL diverges from the MPI
> standard and if a request cannot be canceled an error is returned. Here is
> a patch to fix this issue.
> 
> diff --git a/ompi/mca/mtl/psm/mtl_psm_cancel.c
> b/ompi/mca/mtl/psm/mtl_psm_cancel
> index 6da3386..277c761 100644
> --- a/ompi/mca/mtl/psm/mtl_psm_cancel.c
> +++ b/ompi/mca/mtl/psm/mtl_psm_cancel.c
> @@ -37,10 +37,8 @@ int ompi_mtl_psm_cancel(struct mca_mtl_base_module_t*
> mtl,
>  if(PSM_OK == err) {
>mtl_request->ompi_req->req_status._cancelled = true;
>mtl_psm_request->super.completion_callback(&mtl_psm_request->super);
> -  return OMPI_SUCCESS;
> -} else {
> -  return OMPI_ERROR;
>  }
> +return OMPI_SUCCESS;
>} else if(PSM_MQ_INCOMPLETE == err) {
>  return OMPI_SUCCESS;
>}
> 
>   George.
> 
> 
> On Thu, Jan 15, 2015 at 1:30 PM, Adrian Reber  wrote:
> 
> > Doing
> >
> > MPI_Isend()
> >
> > followed by a
> >
> > MPI_Cancel()
> >
> > fails on my PSM based system with 1.8.4 like this:
> >
> > n040108:0.1.Cannot cancel send requests (req=0x2b6279787f80)
> > n040108:0.0.Cannot cancel send requests (req=0x2b3a3dc92f80)
> > ---
> > Primary job  terminated normally, but 1 process returned
> > a non-zero exit code.. Per user-direction, the job has been aborted.
> > ---
> > --
> > mpirun detected that one or more processes exited with non-zero status,
> > thus causing
> > the job to be terminated. The first process to do so was:
> >
> >   Process name: [[58364,1],1]
> >   Exit code:255
> > --
> >
> > Is this something PSM actually cannot do or an Open MPI error?
> >
> > Adrian
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> > http://www.open-mpi.org/community/lists/devel/2015/01/16783.php
> >

> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/01/16784.php


Re: [OMPI devel] Another Open MPI <-> PSM question (MPI_Isend()/MPI_Cancel())

2015-01-15 Thread Adrian Reber
It even says so in the code:

ompi/mca/mtl/psm/mtl_psm.c:

   /* Default error handling is enabled, errors will not be returned to
 * user.  PSM prints the error and the offending endpoint's hostname
 * and exits with -1 */

Disabling the default PSM error handler makes MPI_Cancel() fail
gracefully. But then no error is handled anymore.

Adrian

On Thu, Jan 15, 2015 at 10:21:05PM +0100, Adrian Reber wrote:
> As PSM on master is still broken I applied it on 1.8.4. Unfortunately it
> does not work. The error is the same as before.
> 
> Looking at your patch I would also expect that this is the correct fix
> and I even tried to change ompi_mtl_psm_cancel() to always return
> OMPI_SUCCESS. MPI_Cancel() still fails.
> 
> Looking at the PSM code it seems it can directly call exit(-1) and thus
> terminating and never returning to Open MPI. I do not see any debug
> output from Open MPI after "Cannot cancel send requests" from PSM.
> 
>   Adrian
> 
> On Thu, Jan 15, 2015 at 01:43:11PM -0500, George Bosilca wrote:
> > >From the MPI standard perspective MPI_Cancel doesn't have to succeed, it
> > can also gracefully fail. However, the PSM MTL diverges from the MPI
> > standard and if a request cannot be canceled an error is returned. Here is
> > a patch to fix this issue.
> > 
> > diff --git a/ompi/mca/mtl/psm/mtl_psm_cancel.c
> > b/ompi/mca/mtl/psm/mtl_psm_cancel
> > index 6da3386..277c761 100644
> > --- a/ompi/mca/mtl/psm/mtl_psm_cancel.c
> > +++ b/ompi/mca/mtl/psm/mtl_psm_cancel.c
> > @@ -37,10 +37,8 @@ int ompi_mtl_psm_cancel(struct mca_mtl_base_module_t*
> > mtl,
> >  if(PSM_OK == err) {
> >mtl_request->ompi_req->req_status._cancelled = true;
> >mtl_psm_request->super.completion_callback(&mtl_psm_request->super);
> > -  return OMPI_SUCCESS;
> > -} else {
> > -  return OMPI_ERROR;
> >  }
> > +return OMPI_SUCCESS;
> >} else if(PSM_MQ_INCOMPLETE == err) {
> >  return OMPI_SUCCESS;
> >}
> > 
> >   George.
> > 
> > 
> > On Thu, Jan 15, 2015 at 1:30 PM, Adrian Reber  wrote:
> > 
> > > Doing
> > >
> > > MPI_Isend()
> > >
> > > followed by a
> > >
> > > MPI_Cancel()
> > >
> > > fails on my PSM based system with 1.8.4 like this:
> > >
> > > n040108:0.1.Cannot cancel send requests (req=0x2b6279787f80)
> > > n040108:0.0.Cannot cancel send requests (req=0x2b3a3dc92f80)
> > > ---
> > > Primary job  terminated normally, but 1 process returned
> > > a non-zero exit code.. Per user-direction, the job has been aborted.
> > > ---
> > > --
> > > mpirun detected that one or more processes exited with non-zero status,
> > > thus causing
> > > the job to be terminated. The first process to do so was:
> > >
> > >   Process name: [[58364,1],1]
> > >   Exit code:255
> > > --
> > >
> > > Is this something PSM actually cannot do or an Open MPI error?
> > >
> > > Adrian
> > > ___
> > > devel mailing list
> > > de...@open-mpi.org
> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > Link to this post:
> > > http://www.open-mpi.org/community/lists/devel/2015/01/16783.php
> > >
> 
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/devel/2015/01/16784.php
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/01/16786.php

Adrian

-- 
Adrian Reber http://lisas.de/~adrian/
C-3PO: 
Don't call me a mindless philosopher, you overweight
glob of grease!


Re: [OMPI devel] Another Open MPI <-> PSM question (MPI_Isend()/MPI_Cancel())

2015-01-15 Thread George Bosilca
Skimming through the PSM code shows that the return values of the PSM
functions are handled in most cases. Thus, removing the default error
handler might not be such a bad idea.

Did you experience any trouble running with the version without the default
error handler registered?

  George.


On Thu, Jan 15, 2015 at 4:40 PM, Adrian Reber  wrote:

> It even says so in the code:
>
> ompi/mca/mtl/psm/mtl_psm.c:
>
>/* Default error handling is enabled, errors will not be returned to
>  * user.  PSM prints the error and the offending endpoint's
> hostname
>  * and exits with -1 */
>
> Disabling the default PSM error handler makes MPI_Cancel() fail
> gracefully. But then no error is handled anymore.
>
> Adrian
>
> On Thu, Jan 15, 2015 at 10:21:05PM +0100, Adrian Reber wrote:
> > As PSM on master is still broken I applied it on 1.8.4. Unfortunately it
> > does not work. The error is the same as before.
> >
> > Looking at your patch I would also expect that this is the correct fix
> > and I even tried to change ompi_mtl_psm_cancel() to always return
> > OMPI_SUCCESS. MPI_Cancel() still fails.
> >
> > Looking at the PSM code it seems it can directly call exit(-1) and thus
> > terminating and never returning to Open MPI. I do not see any debug
> > output from Open MPI after "Cannot cancel send requests" from PSM.
> >
> >   Adrian
> >
> > On Thu, Jan 15, 2015 at 01:43:11PM -0500, George Bosilca wrote:
> > > >From the MPI standard perspective MPI_Cancel doesn't have to succeed,
> it
> > > can also gracefully fail. However, the PSM MTL diverges from the MPI
> > > standard and if a request cannot be canceled an error is returned.
> Here is
> > > a patch to fix this issue.
> > >
> > > diff --git a/ompi/mca/mtl/psm/mtl_psm_cancel.c
> > > b/ompi/mca/mtl/psm/mtl_psm_cancel
> > > index 6da3386..277c761 100644
> > > --- a/ompi/mca/mtl/psm/mtl_psm_cancel.c
> > > +++ b/ompi/mca/mtl/psm/mtl_psm_cancel.c
> > > @@ -37,10 +37,8 @@ int ompi_mtl_psm_cancel(struct
> mca_mtl_base_module_t*
> > > mtl,
> > >  if(PSM_OK == err) {
> > >mtl_request->ompi_req->req_status._cancelled = true;
> > >
> mtl_psm_request->super.completion_callback(&mtl_psm_request->super);
> > > -  return OMPI_SUCCESS;
> > > -} else {
> > > -  return OMPI_ERROR;
> > >  }
> > > +return OMPI_SUCCESS;
> > >} else if(PSM_MQ_INCOMPLETE == err) {
> > >  return OMPI_SUCCESS;
> > >}
> > >
> > >   George.
> > >
> > >
> > > On Thu, Jan 15, 2015 at 1:30 PM, Adrian Reber  wrote:
> > >
> > > > Doing
> > > >
> > > > MPI_Isend()
> > > >
> > > > followed by a
> > > >
> > > > MPI_Cancel()
> > > >
> > > > fails on my PSM based system with 1.8.4 like this:
> > > >
> > > > n040108:0.1.Cannot cancel send requests (req=0x2b6279787f80)
> > > > n040108:0.0.Cannot cancel send requests (req=0x2b3a3dc92f80)
> > > > ---
> > > > Primary job  terminated normally, but 1 process returned
> > > > a non-zero exit code.. Per user-direction, the job has been aborted.
> > > > ---
> > > >
> --
> > > > mpirun detected that one or more processes exited with non-zero
> status,
> > > > thus causing
> > > > the job to be terminated. The first process to do so was:
> > > >
> > > >   Process name: [[58364,1],1]
> > > >   Exit code:255
> > > >
> --
> > > >
> > > > Is this something PSM actually cannot do or an Open MPI error?
> > > >
> > > > Adrian
> > > > ___
> > > > devel mailing list
> > > > de...@open-mpi.org
> > > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > > Link to this post:
> > > > http://www.open-mpi.org/community/lists/devel/2015/01/16783.php
> > > >
> >
> > > ___
> > > devel mailing list
> > > de...@open-mpi.org
> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/01/16784.php
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/01/16786.php
>
> Adrian
>
> --
> Adrian Reber http://lisas.de/~adrian/
> C-3PO:
> Don't call me a mindless philosopher, you overweight
> glob of grease!
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/01/16787.php
>


Re: [OMPI devel] Another Open MPI <-> PSM question (MPI_Isend()/MPI_Cancel())

2015-01-16 Thread Adrian Reber
See my comment on https://github.com/open-mpi/ompi/issues/347

On Thu, Jan 15, 2015 at 05:01:00PM -0500, George Bosilca wrote:
> Skimming through the PSM code shows that the return values of the PSM
> functions are handled in most cases. Thus, removing the default error
> handler might not be such a bad idea.
> 
> Did you experience any trouble running with the version without the default
> error handler registered?
> 
>   George.
> 
> 
> On Thu, Jan 15, 2015 at 4:40 PM, Adrian Reber  wrote:
> 
> > It even says so in the code:
> >
> > ompi/mca/mtl/psm/mtl_psm.c:
> >
> >/* Default error handling is enabled, errors will not be returned to
> >  * user.  PSM prints the error and the offending endpoint's
> > hostname
> >  * and exits with -1 */
> >
> > Disabling the default PSM error handler makes MPI_Cancel() fail
> > gracefully. But then no error is handled anymore.
> >
> > Adrian
> >
> > On Thu, Jan 15, 2015 at 10:21:05PM +0100, Adrian Reber wrote:
> > > As PSM on master is still broken I applied it on 1.8.4. Unfortunately it
> > > does not work. The error is the same as before.
> > >
> > > Looking at your patch I would also expect that this is the correct fix
> > > and I even tried to change ompi_mtl_psm_cancel() to always return
> > > OMPI_SUCCESS. MPI_Cancel() still fails.
> > >
> > > Looking at the PSM code it seems it can directly call exit(-1) and thus
> > > terminating and never returning to Open MPI. I do not see any debug
> > > output from Open MPI after "Cannot cancel send requests" from PSM.
> > >
> > >   Adrian
> > >
> > > On Thu, Jan 15, 2015 at 01:43:11PM -0500, George Bosilca wrote:
> > > > >From the MPI standard perspective MPI_Cancel doesn't have to succeed,
> > it
> > > > can also gracefully fail. However, the PSM MTL diverges from the MPI
> > > > standard and if a request cannot be canceled an error is returned.
> > Here is
> > > > a patch to fix this issue.
> > > >
> > > > diff --git a/ompi/mca/mtl/psm/mtl_psm_cancel.c
> > > > b/ompi/mca/mtl/psm/mtl_psm_cancel
> > > > index 6da3386..277c761 100644
> > > > --- a/ompi/mca/mtl/psm/mtl_psm_cancel.c
> > > > +++ b/ompi/mca/mtl/psm/mtl_psm_cancel.c
> > > > @@ -37,10 +37,8 @@ int ompi_mtl_psm_cancel(struct
> > mca_mtl_base_module_t*
> > > > mtl,
> > > >  if(PSM_OK == err) {
> > > >mtl_request->ompi_req->req_status._cancelled = true;
> > > >
> > mtl_psm_request->super.completion_callback(&mtl_psm_request->super);
> > > > -  return OMPI_SUCCESS;
> > > > -} else {
> > > > -  return OMPI_ERROR;
> > > >  }
> > > > +return OMPI_SUCCESS;
> > > >} else if(PSM_MQ_INCOMPLETE == err) {
> > > >  return OMPI_SUCCESS;
> > > >}
> > > >
> > > >   George.
> > > >
> > > >
> > > > On Thu, Jan 15, 2015 at 1:30 PM, Adrian Reber  wrote:
> > > >
> > > > > Doing
> > > > >
> > > > > MPI_Isend()
> > > > >
> > > > > followed by a
> > > > >
> > > > > MPI_Cancel()
> > > > >
> > > > > fails on my PSM based system with 1.8.4 like this:
> > > > >
> > > > > n040108:0.1.Cannot cancel send requests (req=0x2b6279787f80)
> > > > > n040108:0.0.Cannot cancel send requests (req=0x2b3a3dc92f80)
> > > > > ---
> > > > > Primary job  terminated normally, but 1 process returned
> > > > > a non-zero exit code.. Per user-direction, the job has been aborted.
> > > > > ---
> > > > >
> > --
> > > > > mpirun detected that one or more processes exited with non-zero
> > status,
> > > > > thus causing
> > > > > the job to be terminated. The first process to do so was:
> > > > >
> > > > >   Process name: [[58364,1],1]
> > > > >   Exit code:255
> > > > >
> > --
> > > > >
> > > > > Is this something PSM actually cannot do or an Open MPI error?
> > > > >
> > > > > Adrian
> > > > > ___
> > > > > devel mailing list
> > > > > de...@open-mpi.org
> > > > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > > > Link to this post:
> > > > > http://www.open-mpi.org/community/lists/devel/2015/01/16783.php
> > > > >
> > >
> > > > ___
> > > > devel mailing list
> > > > de...@open-mpi.org
> > > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > > Link to this post:
> > http://www.open-mpi.org/community/lists/devel/2015/01/16784.php
> > > ___
> > > devel mailing list
> > > de...@open-mpi.org
> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > Link to this post:
> > http://www.open-mpi.org/community/lists/devel/2015/01/16786.php