Alexey, There is a conceptual different between GET and WAIT: one can return NULL while the other cannot. If you want a solution with do {} while, I think the best place is specifically in the PML OB1 recv functions (around the OMPI_FREE_LIST_GET_MT) and not inside the OMPI_FREE_LIST_GET_MT macro itself.
George. On Thu, Sep 17, 2015 at 2:35 AM, Алексей Рыжих <avryzh...@compcenter.org> wrote: > George, > > Thank you for response. > > In my opinion our solution with do/while() loop in OMPI_FREE_LIST_GET_MT > is better for our MPI+OpenMP hybrid application than using > OMPI_FREE_LIST_WAIT_MT. > > Because in case OMPI_FREE_LIST_WAIT_MT MPI_Irecv() will be suspended in > opal_progress() until one of MPI_Irecv() requests from other thread is > completed. > > And it is not the case when the list reached free_list_max_num limit. > The situation is that the threads consumed all items from free list > before one other thread completed ompi_free_list_grow() and that thread > executing ompi_free_list_grow() got NULL. > > > > Sorry for my poor English. > > > > Alexey. > > > > *From:* devel [mailto:devel-boun...@open-mpi.org] *On Behalf Of *George > Bosilca > *Sent:* Wednesday, September 16, 2015 10:18 PM > > *To:* Open MPI Developers > *Subject:* Re: [OMPI devel] The issue with OMPI_FREE_LIST_GET_MT() > > > > On Wed, Sep 16, 2015 at 3:11 PM, Владимир Трущин < > vdtrusc...@compcenter.org> wrote: > > Sorry, “We saw the following problem in OMPI_FREE_LIST_GET_MT…”. > > > > That's exactly what the WAIT macro is supposed to solve, wait (grow the > freelist and call opal_progress) until an item become available. > > > > George. > > > > > > > > *From:* Владимир Трущин [mailto:vdtrusc...@compcenter.org] > *Sent:* Wednesday, September 16, 2015 10:09 PM > *To:* 'Open MPI Developers' > *Subject:* RE: [OMPI devel] The issue with OMPI_FREE_LIST_GET_MT() > > > > George, > > > > You are right. The sequence of calls in our test is MPI_Irecv -> > mca_pml_ob1_irecv -> MCA_PML_OB1_RECV_REQUEST_ALLOC. We will try to use > OMPI_FREE_LIST_WAIT_MT. > > > > We saw the following problem in OMPI_FREE_LIST_WAIT_MT. It returned NULL > in case when thread A was suspended after the call of ompi_free_list_grow. > At this time others threads took all items from free list at the first call > of opal_atomic_lifo_pop in macro. So, when thread A was unsuspended and > call the second opal_atomic_lifo_pop in macro - it returned NULL. > > > > Best regards, > > Vladimir. > > > > *From:* devel [mailto:devel-boun...@open-mpi.org > <devel-boun...@open-mpi.org>] *On Behalf Of *George Bosilca > *Sent:* Wednesday, September 16, 2015 7:00 PM > *To:* Open MPI Developers > *Subject:* Re: [OMPI devel] The issue with OMPI_FREE_LIST_GET_MT() > > > > Alexey, > > > > This is not necessarily the fix for all cases. Most of the internal uses > of the free_list can easily accommodate to the fact that no more elements > are available. Based on your description of the problem I would assume you > encounter this problem once the MCA_PML_OB1_RECV_REQUEST_ALLOC is called. > In this particular case the problem is that fact that we call > OMPI_FREE_LIST_GET_MT and that the upper level is unable to correctly deal > with the case where the returned item is NULL. In this particular case the > real fix is to use the blocking version of the free_list accessor (similar > to the case for send) OMPI_FREE_LIST_WAIT_MT. > > > > > > It is also possible that I misunderstood your problem. IF the solution > above doesn't work can you describe exactly where the NULL return of the > OMPI_FREE_LIST_GET_MT is creating an issue? > > > > George. > > > > > > On Wed, Sep 16, 2015 at 9:03 AM, Алексей Рыжих <avryzh...@compcenter.org> > wrote: > > Hi all, > > We experimented with MPI+OpenMP hybrid application (MPI_THREAD_MULTIPLE > support level) where several threads submits a lot of MPI_Irecv() requests > simultaneously and encountered an intermittent bug > OMPI_ERR_TEMP_OUT_OF_RESOURCE after MCA_PML_OB1_RECV_REQUEST_ALLOC() > because OMPI_FREE_LIST_GET_MT() returned NULL. Investigating this bug we > found that sometimes the thread calling ompi_free_list_grow() don’t have > any free items in LIFO list at exit because other threads retrieved all > new items at opal_atomic_lifo_pop() > > So we suggest to change OMPI_FREE_LIST_GET_MT() as below: > > > > #define OMPI_FREE_LIST_GET_MT(fl, item) > \ > > { > \ > > item = (ompi_free_list_item_t*) > opal_atomic_lifo_pop(&((fl)->super)); \ > > if( OPAL_UNLIKELY(NULL == item) ) > { \ > > if(opal_using_threads()) > { \ > > int rc; > \ > > > opal_mutex_lock(&((fl)->fl_lock)); \ > > > do \ > > { > \ > > rc = ompi_free_list_grow((fl), > (fl)->fl_num_per_alloc); \ > > if( OPAL_UNLIKELY(rc != OMPI_SUCCESS)) > break; \ > > > \ > > item = (ompi_free_list_item_t*) > opal_atomic_lifo_pop(&((fl)->super)); \ > > > \ > > } while > (!item); \ > > > opal_mutex_unlock(&((fl)->fl_lock)); \ > > } else > { \ > > ompi_free_list_grow((fl), > (fl)->fl_num_per_alloc); \ > > item = (ompi_free_list_item_t*) > opal_atomic_lifo_pop(&((fl)->super)); \ > > } /* opal_using_threads() */ > \ > > } /* NULL == item > */ \ > > } > > > > > > Another workaround is to increase the value of pml_ob1_free_list_inc > parameter. > > > > Regards, > > Alexey > > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/09/18039.php > > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/09/18054.php > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/09/18061.php >