George, Thanks for review and commit! I've confirmed your modification.
Takahiro Kawashima, MPI development team, Fujitsu > Takahiro, > > Indeed we were way to lax on canceling the requests. I modified your patch to > correctly deal with the MEMCHECK macro (remove the call from the branch that > will requires a completion function). The modified patch is attached below. I > will commit asap. > > Thanks, > george. > > > Index: ompi/mca/pml/ob1/pml_ob1_recvreq.c > =================================================================== > --- ompi/mca/pml/ob1/pml_ob1_recvreq.c (revision 26870) > +++ ompi/mca/pml/ob1/pml_ob1_recvreq.c (working copy) > @@ -3,7 +3,7 @@ > * Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana > * University Research and Technology > * Corporation. All rights reserved. > - * Copyright (c) 2004-2009 The University of Tennessee and The University > + * Copyright (c) 2004-2012 The University of Tennessee and The University > * of Tennessee Research Foundation. All rights > * reserved. > * Copyright (c) 2004-2008 High Performance Computing Center Stuttgart, > @@ -15,6 +15,7 @@ > * Copyright (c) 2012 NVIDIA Corporation. All rights reserved. > * Copyright (c) 2011-2012 Los Alamos National Security, LLC. All rights > * reserved. > + * Copyright (c) 2012 FUJITSU LIMITED. All rights reserved. > * $COPYRIGHT$ > * > * Additional copyrights may follow > @@ -97,36 +98,26 @@ > mca_pml_ob1_recv_request_t* request = > (mca_pml_ob1_recv_request_t*)ompi_request; > mca_pml_ob1_comm_t* comm = > request->req_recv.req_base.req_comm->c_pml_comm; > > - if( true == ompi_request->req_complete ) { /* way to late to cancel this > one */ > - /* > - * Receive request completed, make user buffer accessable. > - */ > - MEMCHECKER( > - memchecker_call(&opal_memchecker_base_mem_defined, > - request->req_recv.req_base.req_addr, > - request->req_recv.req_base.req_count, > - request->req_recv.req_base.req_datatype); > - ); > + if( true == request->req_match_received ) { /* way to late to cancel > this one */ > + assert( OMPI_ANY_TAG != ompi_request->req_status.MPI_TAG ); /* not > matched isn't it */ > return OMPI_SUCCESS; > } > > /* The rest should be protected behind the match logic lock */ > OPAL_THREAD_LOCK(&comm->matching_lock); > - if( OMPI_ANY_TAG == ompi_request->req_status.MPI_TAG ) { /* the match > has not been already done */ > - if( request->req_recv.req_base.req_peer == OMPI_ANY_SOURCE ) { > - opal_list_remove_item( &comm->wild_receives, > (opal_list_item_t*)request ); > - } else { > - mca_pml_ob1_comm_proc_t* proc = comm->procs + > request->req_recv.req_base.req_peer; > - opal_list_remove_item(&proc->specific_receives, > (opal_list_item_t*)request); > - } > - PERUSE_TRACE_COMM_EVENT( PERUSE_COMM_REQ_REMOVE_FROM_POSTED_Q, > - &(request->req_recv.req_base), PERUSE_RECV ); > - /** > - * As now the PML is done with this request we have to force the > pml_complete > - * to true. Otherwise, the request will never be freed. > - */ > - request->req_recv.req_base.req_pml_complete = true; > + if( request->req_recv.req_base.req_peer == OMPI_ANY_SOURCE ) { > + opal_list_remove_item( &comm->wild_receives, > (opal_list_item_t*)request ); > + } else { > + mca_pml_ob1_comm_proc_t* proc = comm->procs + > request->req_recv.req_base.req_peer; > + opal_list_remove_item(&proc->specific_receives, > (opal_list_item_t*)request); > } > + PERUSE_TRACE_COMM_EVENT( PERUSE_COMM_REQ_REMOVE_FROM_POSTED_Q, > + &(request->req_recv.req_base), PERUSE_RECV ); > + /** > + * As now the PML is done with this request we have to force the > pml_complete > + * to true. Otherwise, the request will never be freed. > + */ > + request->req_recv.req_base.req_pml_complete = true; > OPAL_THREAD_UNLOCK(&comm->matching_lock); > > OPAL_THREAD_LOCK(&ompi_request_lock); > @@ -138,7 +129,7 @@ > MCA_PML_OB1_RECV_REQUEST_MPI_COMPLETE(request); > OPAL_THREAD_UNLOCK(&ompi_request_lock); > /* > - * Receive request cancelled, make user buffer accessable. > + * Receive request cancelled, make user buffer accessible. > */ > MEMCHECKER( > memchecker_call(&opal_memchecker_base_mem_defined, > > On Jul 26, 2012, at 13:41 , Kawashima, Takahiro wrote: > > > Hi Open MPI developers, > > > > I found a small bug in Open MPI. > > > > See attached program cancelled.c. > > In this program, rank 1 tries to cancel a MPI_Irecv and calls a MPI_Recv > > instead if the cancellation succeeds. This program should terminate whether > > the cancellation succeeds or not. But it leads a deadlock in MPI_Recv after > > printing "MPI_Test_cancelled: 1". > > I confirmed it works fine with MPICH2. > > > > The problem is in mca_pml_ob1_recv_request_cancel function in > > ompi/mca/pml/ob1/pml_ob1_recvreq.c. It accepts the cancellation unless > > the request has been completed. I think it should not accept the > > cancellation if the request has been matched. If it want to accept the > > cancellation, it must push the recv frag to the unexpected message queue > > back and redo matching. Furthermore, the receive buffer must be reverted > > if the received message has been written to the receive buffer partially > > in a pipeline protocol. > > > > Attached patch cancel-recv.patch is a sample fix for this bug for Open MPI > > trunk. Though this patch has 65 lines, main modifications are adding one > > if-statement and deleting one if-statement. Other lines are just for > > indent alignment. > > I cannot confirm the MEMCHECKER part is correct. Could anyone review it > > before committing? > > > > Regards, > > > > Takahiro Kawashima, > > MPI development team, > > Fujitsu