Re: [OMPI devel] Another Open MPI <-> PSM question (MPI_Isend()/MPI_Cancel())
See my comment on https://github.com/open-mpi/ompi/issues/347 On Thu, Jan 15, 2015 at 05:01:00PM -0500, George Bosilca wrote: > Skimming through the PSM code shows that the return values of the PSM > functions are handled in most cases. Thus, removing the default error > handler might not be such a bad idea. > > Did you experience any trouble running with the version without the default > error handler registered? > > George. > > > On Thu, Jan 15, 2015 at 4:40 PM, Adrian Reberwrote: > > > It even says so in the code: > > > > ompi/mca/mtl/psm/mtl_psm.c: > > > >/* Default error handling is enabled, errors will not be returned to > > * user. PSM prints the error and the offending endpoint's > > hostname > > * and exits with -1 */ > > > > Disabling the default PSM error handler makes MPI_Cancel() fail > > gracefully. But then no error is handled anymore. > > > > Adrian > > > > On Thu, Jan 15, 2015 at 10:21:05PM +0100, Adrian Reber wrote: > > > As PSM on master is still broken I applied it on 1.8.4. Unfortunately it > > > does not work. The error is the same as before. > > > > > > Looking at your patch I would also expect that this is the correct fix > > > and I even tried to change ompi_mtl_psm_cancel() to always return > > > OMPI_SUCCESS. MPI_Cancel() still fails. > > > > > > Looking at the PSM code it seems it can directly call exit(-1) and thus > > > terminating and never returning to Open MPI. I do not see any debug > > > output from Open MPI after "Cannot cancel send requests" from PSM. > > > > > > Adrian > > > > > > On Thu, Jan 15, 2015 at 01:43:11PM -0500, George Bosilca wrote: > > > > >From the MPI standard perspective MPI_Cancel doesn't have to succeed, > > it > > > > can also gracefully fail. However, the PSM MTL diverges from the MPI > > > > standard and if a request cannot be canceled an error is returned. > > Here is > > > > a patch to fix this issue. > > > > > > > > diff --git a/ompi/mca/mtl/psm/mtl_psm_cancel.c > > > > b/ompi/mca/mtl/psm/mtl_psm_cancel > > > > index 6da3386..277c761 100644 > > > > --- a/ompi/mca/mtl/psm/mtl_psm_cancel.c > > > > +++ b/ompi/mca/mtl/psm/mtl_psm_cancel.c > > > > @@ -37,10 +37,8 @@ int ompi_mtl_psm_cancel(struct > > mca_mtl_base_module_t* > > > > mtl, > > > > if(PSM_OK == err) { > > > >mtl_request->ompi_req->req_status._cancelled = true; > > > > > > mtl_psm_request->super.completion_callback(_psm_request->super); > > > > - return OMPI_SUCCESS; > > > > -} else { > > > > - return OMPI_ERROR; > > > > } > > > > +return OMPI_SUCCESS; > > > >} else if(PSM_MQ_INCOMPLETE == err) { > > > > return OMPI_SUCCESS; > > > >} > > > > > > > > George. > > > > > > > > > > > > On Thu, Jan 15, 2015 at 1:30 PM, Adrian Reber wrote: > > > > > > > > > Doing > > > > > > > > > > MPI_Isend() > > > > > > > > > > followed by a > > > > > > > > > > MPI_Cancel() > > > > > > > > > > fails on my PSM based system with 1.8.4 like this: > > > > > > > > > > n040108:0.1.Cannot cancel send requests (req=0x2b6279787f80) > > > > > n040108:0.0.Cannot cancel send requests (req=0x2b3a3dc92f80) > > > > > --- > > > > > Primary job terminated normally, but 1 process returned > > > > > a non-zero exit code.. Per user-direction, the job has been aborted. > > > > > --- > > > > > > > -- > > > > > mpirun detected that one or more processes exited with non-zero > > status, > > > > > thus causing > > > > > the job to be terminated. The first process to do so was: > > > > > > > > > > Process name: [[58364,1],1] > > > > > Exit code:255 > > > > > > > -- > > > > > > > > > > Is this something PSM actually cannot do or an Open MPI error? > > > > > > > > > > Adrian > > > > > ___ > > > > > devel mailing list > > > > > de...@open-mpi.org > > > > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > > Link to this post: > > > > > http://www.open-mpi.org/community/lists/devel/2015/01/16783.php > > > > > > > > > > > > ___ > > > > devel mailing list > > > > de...@open-mpi.org > > > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > Link to this post: > > http://www.open-mpi.org/community/lists/devel/2015/01/16784.php > > > ___ > > > devel mailing list > > > de...@open-mpi.org > > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > Link to this post: > > http://www.open-mpi.org/community/lists/devel/2015/01/16786.php
Re: [OMPI devel] Another Open MPI <-> PSM question (MPI_Isend()/MPI_Cancel())
It even says so in the code: ompi/mca/mtl/psm/mtl_psm.c: /* Default error handling is enabled, errors will not be returned to * user. PSM prints the error and the offending endpoint's hostname * and exits with -1 */ Disabling the default PSM error handler makes MPI_Cancel() fail gracefully. But then no error is handled anymore. Adrian On Thu, Jan 15, 2015 at 10:21:05PM +0100, Adrian Reber wrote: > As PSM on master is still broken I applied it on 1.8.4. Unfortunately it > does not work. The error is the same as before. > > Looking at your patch I would also expect that this is the correct fix > and I even tried to change ompi_mtl_psm_cancel() to always return > OMPI_SUCCESS. MPI_Cancel() still fails. > > Looking at the PSM code it seems it can directly call exit(-1) and thus > terminating and never returning to Open MPI. I do not see any debug > output from Open MPI after "Cannot cancel send requests" from PSM. > > Adrian > > On Thu, Jan 15, 2015 at 01:43:11PM -0500, George Bosilca wrote: > > >From the MPI standard perspective MPI_Cancel doesn't have to succeed, it > > can also gracefully fail. However, the PSM MTL diverges from the MPI > > standard and if a request cannot be canceled an error is returned. Here is > > a patch to fix this issue. > > > > diff --git a/ompi/mca/mtl/psm/mtl_psm_cancel.c > > b/ompi/mca/mtl/psm/mtl_psm_cancel > > index 6da3386..277c761 100644 > > --- a/ompi/mca/mtl/psm/mtl_psm_cancel.c > > +++ b/ompi/mca/mtl/psm/mtl_psm_cancel.c > > @@ -37,10 +37,8 @@ int ompi_mtl_psm_cancel(struct mca_mtl_base_module_t* > > mtl, > > if(PSM_OK == err) { > >mtl_request->ompi_req->req_status._cancelled = true; > >mtl_psm_request->super.completion_callback(_psm_request->super); > > - return OMPI_SUCCESS; > > -} else { > > - return OMPI_ERROR; > > } > > +return OMPI_SUCCESS; > >} else if(PSM_MQ_INCOMPLETE == err) { > > return OMPI_SUCCESS; > >} > > > > George. > > > > > > On Thu, Jan 15, 2015 at 1:30 PM, Adrian Reberwrote: > > > > > Doing > > > > > > MPI_Isend() > > > > > > followed by a > > > > > > MPI_Cancel() > > > > > > fails on my PSM based system with 1.8.4 like this: > > > > > > n040108:0.1.Cannot cancel send requests (req=0x2b6279787f80) > > > n040108:0.0.Cannot cancel send requests (req=0x2b3a3dc92f80) > > > --- > > > Primary job terminated normally, but 1 process returned > > > a non-zero exit code.. Per user-direction, the job has been aborted. > > > --- > > > -- > > > mpirun detected that one or more processes exited with non-zero status, > > > thus causing > > > the job to be terminated. The first process to do so was: > > > > > > Process name: [[58364,1],1] > > > Exit code:255 > > > -- > > > > > > Is this something PSM actually cannot do or an Open MPI error? > > > > > > Adrian > > > ___ > > > devel mailing list > > > de...@open-mpi.org > > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > Link to this post: > > > http://www.open-mpi.org/community/lists/devel/2015/01/16783.php > > > > > > ___ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > > http://www.open-mpi.org/community/lists/devel/2015/01/16784.php > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/01/16786.php Adrian -- Adrian Reber http://lisas.de/~adrian/ C-3PO: Don't call me a mindless philosopher, you overweight glob of grease!
Re: [OMPI devel] Another Open MPI <-> PSM question (MPI_Isend()/MPI_Cancel())
As PSM on master is still broken I applied it on 1.8.4. Unfortunately it does not work. The error is the same as before. Looking at your patch I would also expect that this is the correct fix and I even tried to change ompi_mtl_psm_cancel() to always return OMPI_SUCCESS. MPI_Cancel() still fails. Looking at the PSM code it seems it can directly call exit(-1) and thus terminating and never returning to Open MPI. I do not see any debug output from Open MPI after "Cannot cancel send requests" from PSM. Adrian On Thu, Jan 15, 2015 at 01:43:11PM -0500, George Bosilca wrote: > >From the MPI standard perspective MPI_Cancel doesn't have to succeed, it > can also gracefully fail. However, the PSM MTL diverges from the MPI > standard and if a request cannot be canceled an error is returned. Here is > a patch to fix this issue. > > diff --git a/ompi/mca/mtl/psm/mtl_psm_cancel.c > b/ompi/mca/mtl/psm/mtl_psm_cancel > index 6da3386..277c761 100644 > --- a/ompi/mca/mtl/psm/mtl_psm_cancel.c > +++ b/ompi/mca/mtl/psm/mtl_psm_cancel.c > @@ -37,10 +37,8 @@ int ompi_mtl_psm_cancel(struct mca_mtl_base_module_t* > mtl, > if(PSM_OK == err) { >mtl_request->ompi_req->req_status._cancelled = true; >mtl_psm_request->super.completion_callback(_psm_request->super); > - return OMPI_SUCCESS; > -} else { > - return OMPI_ERROR; > } > +return OMPI_SUCCESS; >} else if(PSM_MQ_INCOMPLETE == err) { > return OMPI_SUCCESS; >} > > George. > > > On Thu, Jan 15, 2015 at 1:30 PM, Adrian Reberwrote: > > > Doing > > > > MPI_Isend() > > > > followed by a > > > > MPI_Cancel() > > > > fails on my PSM based system with 1.8.4 like this: > > > > n040108:0.1.Cannot cancel send requests (req=0x2b6279787f80) > > n040108:0.0.Cannot cancel send requests (req=0x2b3a3dc92f80) > > --- > > Primary job terminated normally, but 1 process returned > > a non-zero exit code.. Per user-direction, the job has been aborted. > > --- > > -- > > mpirun detected that one or more processes exited with non-zero status, > > thus causing > > the job to be terminated. The first process to do so was: > > > > Process name: [[58364,1],1] > > Exit code:255 > > -- > > > > Is this something PSM actually cannot do or an Open MPI error? > > > > Adrian > > ___ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > > http://www.open-mpi.org/community/lists/devel/2015/01/16783.php > > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/01/16784.php
[OMPI devel] Another Open MPI <-> PSM question (MPI_Isend()/MPI_Cancel())
Doing MPI_Isend() followed by a MPI_Cancel() fails on my PSM based system with 1.8.4 like this: n040108:0.1.Cannot cancel send requests (req=0x2b6279787f80) n040108:0.0.Cannot cancel send requests (req=0x2b3a3dc92f80) --- Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted. --- -- mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[58364,1],1] Exit code:255 -- Is this something PSM actually cannot do or an Open MPI error? Adrian