Rich,

There is no matching in this case. Canceling a receive operation is possible 
only up to the moment the request has been matched. Up to this point the 
sequence numbers of the peers are not used, so removing a non-matched request 
has no impact on the sequence number.

  george.

On Jul 26, 2012, at 16:31 , Richard Graham wrote:

> I do not see any resetting of sequence numbers.  It has been a long time 
> since I have looked at the matching code, so don't know if the out-of-order 
> handling has been taken out.  If not, the sequence number has to be dealt 
> with in some manner, or else there will be a gap in the arriving sequence 
> numbers, and the matching logic will prevent any further progress.
> 
> Rich
> 
> -----Original Message-----
> From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On 
> Behalf Of George Bosilca
> Sent: Thursday, July 26, 2012 10:06 AM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] [patch] MPI_Cancel should not cancel a request if 
> it has a matched recv frag
> 
> Takahiro,
> 
> Indeed we were way to lax on canceling the requests. I modified your patch to 
> correctly deal with the MEMCHECK macro (remove the call from the branch that 
> will requires a completion function). The modified patch is attached below. I 
> will commit asap.
> 
>  Thanks,
>    george.
> 
> 
> Index: ompi/mca/pml/ob1/pml_ob1_recvreq.c
> ===================================================================
> --- ompi/mca/pml/ob1/pml_ob1_recvreq.c        (revision 26870)
> +++ ompi/mca/pml/ob1/pml_ob1_recvreq.c        (working copy)
> @@ -3,7 +3,7 @@
>  * Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
>  *                         University Research and Technology
>  *                         Corporation.  All rights reserved.
> - * Copyright (c) 2004-2009 The University of Tennessee and The University
> + * Copyright (c) 2004-2012 The University of Tennessee and The 
> + University
>  *                         of Tennessee Research Foundation.  All rights
>  *                         reserved.
>  * Copyright (c) 2004-2008 High Performance Computing Center Stuttgart, @@ 
> -15,6 +15,7 @@
>  * Copyright (c) 2012      NVIDIA Corporation.  All rights reserved.
>  * Copyright (c) 2011-2012 Los Alamos National Security, LLC. All rights
>  *                         reserved.
> + * Copyright (c) 2012      FUJITSU LIMITED.  All rights reserved.
>  * $COPYRIGHT$
>  *
>  * Additional copyrights may follow
> @@ -97,36 +98,26 @@
>     mca_pml_ob1_recv_request_t* request = 
> (mca_pml_ob1_recv_request_t*)ompi_request;
>     mca_pml_ob1_comm_t* comm = 
> request->req_recv.req_base.req_comm->c_pml_comm;
> 
> -    if( true == ompi_request->req_complete ) { /* way to late to cancel this 
> one */
> -        /*
> -         * Receive request completed, make user buffer accessable.
> -         */
> -        MEMCHECKER(
> -            memchecker_call(&opal_memchecker_base_mem_defined,
> -                            request->req_recv.req_base.req_addr,
> -                            request->req_recv.req_base.req_count,
> -                            request->req_recv.req_base.req_datatype);
> -        );
> +    if( true == request->req_match_received ) { /* way to late to cancel 
> this one */
> +        assert( OMPI_ANY_TAG != ompi_request->req_status.MPI_TAG ); /* 
> + not matched isn't it */
>         return OMPI_SUCCESS;
>     }
> 
>     /* The rest should be protected behind the match logic lock */
>     OPAL_THREAD_LOCK(&comm->matching_lock);
> -    if( OMPI_ANY_TAG == ompi_request->req_status.MPI_TAG ) { /* the match 
> has not been already done */
> -       if( request->req_recv.req_base.req_peer == OMPI_ANY_SOURCE ) {
> -          opal_list_remove_item( &comm->wild_receives, 
> (opal_list_item_t*)request );
> -       } else {
> -          mca_pml_ob1_comm_proc_t* proc = comm->procs + 
> request->req_recv.req_base.req_peer;
> -          opal_list_remove_item(&proc->specific_receives, 
> (opal_list_item_t*)request);
> -       }
> -       PERUSE_TRACE_COMM_EVENT( PERUSE_COMM_REQ_REMOVE_FROM_POSTED_Q,
> -                                &(request->req_recv.req_base), PERUSE_RECV );
> -       /**
> -        * As now the PML is done with this request we have to force the 
> pml_complete
> -        * to true. Otherwise, the request will never be freed.
> -        */
> -       request->req_recv.req_base.req_pml_complete = true;
> +    if( request->req_recv.req_base.req_peer == OMPI_ANY_SOURCE ) {
> +        opal_list_remove_item( &comm->wild_receives, 
> (opal_list_item_t*)request );
> +    } else {
> +        mca_pml_ob1_comm_proc_t* proc = comm->procs + 
> request->req_recv.req_base.req_peer;
> +        opal_list_remove_item(&proc->specific_receives, 
> + (opal_list_item_t*)request);
>     }
> +    PERUSE_TRACE_COMM_EVENT( PERUSE_COMM_REQ_REMOVE_FROM_POSTED_Q,
> +                             &(request->req_recv.req_base), PERUSE_RECV );
> +    /**
> +     * As now the PML is done with this request we have to force the 
> pml_complete
> +     * to true. Otherwise, the request will never be freed.
> +     */
> +    request->req_recv.req_base.req_pml_complete = true;
>     OPAL_THREAD_UNLOCK(&comm->matching_lock);
> 
>     OPAL_THREAD_LOCK(&ompi_request_lock);
> @@ -138,7 +129,7 @@
>     MCA_PML_OB1_RECV_REQUEST_MPI_COMPLETE(request);
>     OPAL_THREAD_UNLOCK(&ompi_request_lock);
>     /*
> -     * Receive request cancelled, make user buffer accessable.
> +     * Receive request cancelled, make user buffer accessible.
>      */
>     MEMCHECKER(
>         memchecker_call(&opal_memchecker_base_mem_defined,
> 
> On Jul 26, 2012, at 13:41 , Kawashima, Takahiro wrote:
> 
>> Hi Open MPI developers,
>> 
>> I found a small bug in Open MPI.
>> 
>> See attached program cancelled.c.
>> In this program, rank 1 tries to cancel a MPI_Irecv and calls a 
>> MPI_Recv instead if the cancellation succeeds. This program should 
>> terminate whether the cancellation succeeds or not. But it leads a 
>> deadlock in MPI_Recv after printing "MPI_Test_cancelled: 1".
>> I confirmed it works fine with MPICH2.
>> 
>> The problem is in mca_pml_ob1_recv_request_cancel function in 
>> ompi/mca/pml/ob1/pml_ob1_recvreq.c. It accepts the cancellation unless 
>> the request has been completed. I think it should not accept the 
>> cancellation if the request has been matched. If it want to accept the 
>> cancellation, it must push the recv frag to the unexpected message 
>> queue back and redo matching. Furthermore, the receive buffer must be 
>> reverted if the received message has been written to the receive 
>> buffer partially in a pipeline protocol.
>> 
>> Attached patch cancel-recv.patch is a sample fix for this bug for Open 
>> MPI trunk. Though this patch has 65 lines, main modifications are 
>> adding one if-statement and deleting one if-statement. Other lines are 
>> just for indent alignment.
>> I cannot confirm the MEMCHECKER part is correct. Could anyone review 
>> it before committing?
>> 
>> Regards,
>> 
>> Takahiro Kawashima,
>> MPI development team,
>> Fujitsu
>> <cancelled.c><cancel-recv.patch><License.txt>_________________________
>> ______________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to