It even says so in the code: ompi/mca/mtl/psm/mtl_psm.c:
/* Default error handling is enabled, errors will not be returned to * user. PSM prints the error and the offending endpoint's hostname * and exits with -1 */ Disabling the default PSM error handler makes MPI_Cancel() fail gracefully. But then no error is handled anymore. Adrian On Thu, Jan 15, 2015 at 10:21:05PM +0100, Adrian Reber wrote: > As PSM on master is still broken I applied it on 1.8.4. Unfortunately it > does not work. The error is the same as before. > > Looking at your patch I would also expect that this is the correct fix > and I even tried to change ompi_mtl_psm_cancel() to always return > OMPI_SUCCESS. MPI_Cancel() still fails. > > Looking at the PSM code it seems it can directly call exit(-1) and thus > terminating and never returning to Open MPI. I do not see any debug > output from Open MPI after "Cannot cancel send requests" from PSM. > > Adrian > > On Thu, Jan 15, 2015 at 01:43:11PM -0500, George Bosilca wrote: > > >From the MPI standard perspective MPI_Cancel doesn't have to succeed, it > > can also gracefully fail. However, the PSM MTL diverges from the MPI > > standard and if a request cannot be canceled an error is returned. Here is > > a patch to fix this issue. > > > > diff --git a/ompi/mca/mtl/psm/mtl_psm_cancel.c > > b/ompi/mca/mtl/psm/mtl_psm_cancel > > index 6da3386..277c761 100644 > > --- a/ompi/mca/mtl/psm/mtl_psm_cancel.c > > +++ b/ompi/mca/mtl/psm/mtl_psm_cancel.c > > @@ -37,10 +37,8 @@ int ompi_mtl_psm_cancel(struct mca_mtl_base_module_t* > > mtl, > > if(PSM_OK == err) { > > mtl_request->ompi_req->req_status._cancelled = true; > > mtl_psm_request->super.completion_callback(&mtl_psm_request->super); > > - return OMPI_SUCCESS; > > - } else { > > - return OMPI_ERROR; > > } > > + return OMPI_SUCCESS; > > } else if(PSM_MQ_INCOMPLETE == err) { > > return OMPI_SUCCESS; > > } > > > > George. > > > > > > On Thu, Jan 15, 2015 at 1:30 PM, Adrian Reber <adr...@lisas.de> wrote: > > > > > Doing > > > > > > MPI_Isend() > > > > > > followed by a > > > > > > MPI_Cancel() > > > > > > fails on my PSM based system with 1.8.4 like this: > > > > > > n040108:0.1.Cannot cancel send requests (req=0x2b6279787f80) > > > n040108:0.0.Cannot cancel send requests (req=0x2b3a3dc92f80) > > > ------------------------------------------------------- > > > Primary job terminated normally, but 1 process returned > > > a non-zero exit code.. Per user-direction, the job has been aborted. > > > ------------------------------------------------------- > > > -------------------------------------------------------------------------- > > > mpirun detected that one or more processes exited with non-zero status, > > > thus causing > > > the job to be terminated. The first process to do so was: > > > > > > Process name: [[58364,1],1] > > > Exit code: 255 > > > -------------------------------------------------------------------------- > > > > > > Is this something PSM actually cannot do or an Open MPI error? > > > > > > Adrian > > > _______________________________________________ > > > devel mailing list > > > de...@open-mpi.org > > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > Link to this post: > > > http://www.open-mpi.org/community/lists/devel/2015/01/16783.php > > > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > > http://www.open-mpi.org/community/lists/devel/2015/01/16784.php > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/01/16786.php Adrian -- Adrian Reber <adr...@lisas.de> http://lisas.de/~adrian/ C-3PO: Don't call me a mindless philosopher, you overweight glob of grease!