I do not see any resetting of sequence numbers. It has been a long time since I have looked at the matching code, so don't know if the out-of-order handling has been taken out. If not, the sequence number has to be dealt with in some manner, or else there will be a gap in the arriving sequence numbers, and the matching logic will prevent any further progress.
Rich -----Original Message----- From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On Behalf Of George Bosilca Sent: Thursday, July 26, 2012 10:06 AM To: Open MPI Developers Subject: Re: [OMPI devel] [patch] MPI_Cancel should not cancel a request if it has a matched recv frag Takahiro, Indeed we were way to lax on canceling the requests. I modified your patch to correctly deal with the MEMCHECK macro (remove the call from the branch that will requires a completion function). The modified patch is attached below. I will commit asap. Thanks, george. Index: ompi/mca/pml/ob1/pml_ob1_recvreq.c =================================================================== --- ompi/mca/pml/ob1/pml_ob1_recvreq.c (revision 26870) +++ ompi/mca/pml/ob1/pml_ob1_recvreq.c (working copy) @@ -3,7 +3,7 @@ * Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana * University Research and Technology * Corporation. All rights reserved. - * Copyright (c) 2004-2009 The University of Tennessee and The University + * Copyright (c) 2004-2012 The University of Tennessee and The + University * of Tennessee Research Foundation. All rights * reserved. * Copyright (c) 2004-2008 High Performance Computing Center Stuttgart, @@ -15,6 +15,7 @@ * Copyright (c) 2012 NVIDIA Corporation. All rights reserved. * Copyright (c) 2011-2012 Los Alamos National Security, LLC. All rights * reserved. + * Copyright (c) 2012 FUJITSU LIMITED. All rights reserved. * $COPYRIGHT$ * * Additional copyrights may follow @@ -97,36 +98,26 @@ mca_pml_ob1_recv_request_t* request = (mca_pml_ob1_recv_request_t*)ompi_request; mca_pml_ob1_comm_t* comm = request->req_recv.req_base.req_comm->c_pml_comm; - if( true == ompi_request->req_complete ) { /* way to late to cancel this one */ - /* - * Receive request completed, make user buffer accessable. - */ - MEMCHECKER( - memchecker_call(&opal_memchecker_base_mem_defined, - request->req_recv.req_base.req_addr, - request->req_recv.req_base.req_count, - request->req_recv.req_base.req_datatype); - ); + if( true == request->req_match_received ) { /* way to late to cancel this one */ + assert( OMPI_ANY_TAG != ompi_request->req_status.MPI_TAG ); /* + not matched isn't it */ return OMPI_SUCCESS; } /* The rest should be protected behind the match logic lock */ OPAL_THREAD_LOCK(&comm->matching_lock); - if( OMPI_ANY_TAG == ompi_request->req_status.MPI_TAG ) { /* the match has not been already done */ - if( request->req_recv.req_base.req_peer == OMPI_ANY_SOURCE ) { - opal_list_remove_item( &comm->wild_receives, (opal_list_item_t*)request ); - } else { - mca_pml_ob1_comm_proc_t* proc = comm->procs + request->req_recv.req_base.req_peer; - opal_list_remove_item(&proc->specific_receives, (opal_list_item_t*)request); - } - PERUSE_TRACE_COMM_EVENT( PERUSE_COMM_REQ_REMOVE_FROM_POSTED_Q, - &(request->req_recv.req_base), PERUSE_RECV ); - /** - * As now the PML is done with this request we have to force the pml_complete - * to true. Otherwise, the request will never be freed. - */ - request->req_recv.req_base.req_pml_complete = true; + if( request->req_recv.req_base.req_peer == OMPI_ANY_SOURCE ) { + opal_list_remove_item( &comm->wild_receives, (opal_list_item_t*)request ); + } else { + mca_pml_ob1_comm_proc_t* proc = comm->procs + request->req_recv.req_base.req_peer; + opal_list_remove_item(&proc->specific_receives, + (opal_list_item_t*)request); } + PERUSE_TRACE_COMM_EVENT( PERUSE_COMM_REQ_REMOVE_FROM_POSTED_Q, + &(request->req_recv.req_base), PERUSE_RECV ); + /** + * As now the PML is done with this request we have to force the pml_complete + * to true. Otherwise, the request will never be freed. + */ + request->req_recv.req_base.req_pml_complete = true; OPAL_THREAD_UNLOCK(&comm->matching_lock); OPAL_THREAD_LOCK(&ompi_request_lock); @@ -138,7 +129,7 @@ MCA_PML_OB1_RECV_REQUEST_MPI_COMPLETE(request); OPAL_THREAD_UNLOCK(&ompi_request_lock); /* - * Receive request cancelled, make user buffer accessable. + * Receive request cancelled, make user buffer accessible. */ MEMCHECKER( memchecker_call(&opal_memchecker_base_mem_defined, On Jul 26, 2012, at 13:41 , Kawashima, Takahiro wrote: > Hi Open MPI developers, > > I found a small bug in Open MPI. > > See attached program cancelled.c. > In this program, rank 1 tries to cancel a MPI_Irecv and calls a > MPI_Recv instead if the cancellation succeeds. This program should > terminate whether the cancellation succeeds or not. But it leads a > deadlock in MPI_Recv after printing "MPI_Test_cancelled: 1". > I confirmed it works fine with MPICH2. > > The problem is in mca_pml_ob1_recv_request_cancel function in > ompi/mca/pml/ob1/pml_ob1_recvreq.c. It accepts the cancellation unless > the request has been completed. I think it should not accept the > cancellation if the request has been matched. If it want to accept the > cancellation, it must push the recv frag to the unexpected message > queue back and redo matching. Furthermore, the receive buffer must be > reverted if the received message has been written to the receive > buffer partially in a pipeline protocol. > > Attached patch cancel-recv.patch is a sample fix for this bug for Open > MPI trunk. Though this patch has 65 lines, main modifications are > adding one if-statement and deleting one if-statement. Other lines are > just for indent alignment. > I cannot confirm the MEMCHECKER part is correct. Could anyone review > it before committing? > > Regards, > > Takahiro Kawashima, > MPI development team, > Fujitsu > <cancelled.c><cancel-recv.patch><License.txt>_________________________ > ______________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel _______________________________________________ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel