Alexey,
There is a conceptual different between GET and WAIT: one can return NULL
while the other cannot. If you want a solution with do {} while, I think
the best place is specifically in the PML OB1 recv functions (around the
OMPI_FREE_LIST_GET_MT) and not inside the OMPI_FREE_LIST_GET_MT macro
itself.
George.
On Thu, Sep 17, 2015 at 2:35 AM, Алексей Рыжих <[email protected]>
wrote:
> George,
>
> Thank you for response.
>
> In my opinion our solution with do/while() loop in OMPI_FREE_LIST_GET_MT
> is better for our MPI+OpenMP hybrid application than using
> OMPI_FREE_LIST_WAIT_MT.
>
> Because in case OMPI_FREE_LIST_WAIT_MT MPI_Irecv() will be suspended in
> opal_progress() until one of MPI_Irecv() requests from other thread is
> completed.
>
> And it is not the case when the list reached free_list_max_num limit.
> The situation is that the threads consumed all items from free list
> before one other thread completed ompi_free_list_grow() and that thread
> executing ompi_free_list_grow() got NULL.
>
>
>
> Sorry for my poor English.
>
>
>
> Alexey.
>
>
>
> *From:* devel [mailto:[email protected]] *On Behalf Of *George
> Bosilca
> *Sent:* Wednesday, September 16, 2015 10:18 PM
>
> *To:* Open MPI Developers
> *Subject:* Re: [OMPI devel] The issue with OMPI_FREE_LIST_GET_MT()
>
>
>
> On Wed, Sep 16, 2015 at 3:11 PM, Владимир Трущин <
> [email protected]> wrote:
>
> Sorry, “We saw the following problem in OMPI_FREE_LIST_GET_MT…”.
>
>
>
> That's exactly what the WAIT macro is supposed to solve, wait (grow the
> freelist and call opal_progress) until an item become available.
>
>
>
> George.
>
>
>
>
>
>
>
> *From:* Владимир Трущин [mailto:[email protected]]
> *Sent:* Wednesday, September 16, 2015 10:09 PM
> *To:* 'Open MPI Developers'
> *Subject:* RE: [OMPI devel] The issue with OMPI_FREE_LIST_GET_MT()
>
>
>
> George,
>
>
>
> You are right. The sequence of calls in our test is MPI_Irecv ->
> mca_pml_ob1_irecv -> MCA_PML_OB1_RECV_REQUEST_ALLOC. We will try to use
> OMPI_FREE_LIST_WAIT_MT.
>
>
>
> We saw the following problem in OMPI_FREE_LIST_WAIT_MT. It returned NULL
> in case when thread A was suspended after the call of ompi_free_list_grow.
> At this time others threads took all items from free list at the first call
> of opal_atomic_lifo_pop in macro. So, when thread A was unsuspended and
> call the second opal_atomic_lifo_pop in macro - it returned NULL.
>
>
>
> Best regards,
>
> Vladimir.
>
>
>
> *From:* devel [mailto:[email protected]
> <[email protected]>] *On Behalf Of *George Bosilca
> *Sent:* Wednesday, September 16, 2015 7:00 PM
> *To:* Open MPI Developers
> *Subject:* Re: [OMPI devel] The issue with OMPI_FREE_LIST_GET_MT()
>
>
>
> Alexey,
>
>
>
> This is not necessarily the fix for all cases. Most of the internal uses
> of the free_list can easily accommodate to the fact that no more elements
> are available. Based on your description of the problem I would assume you
> encounter this problem once the MCA_PML_OB1_RECV_REQUEST_ALLOC is called.
> In this particular case the problem is that fact that we call
> OMPI_FREE_LIST_GET_MT and that the upper level is unable to correctly deal
> with the case where the returned item is NULL. In this particular case the
> real fix is to use the blocking version of the free_list accessor (similar
> to the case for send) OMPI_FREE_LIST_WAIT_MT.
>
>
>
>
>
> It is also possible that I misunderstood your problem. IF the solution
> above doesn't work can you describe exactly where the NULL return of the
> OMPI_FREE_LIST_GET_MT is creating an issue?
>
>
>
> George.
>
>
>
>
>
> On Wed, Sep 16, 2015 at 9:03 AM, Алексей Рыжих <[email protected]>
> wrote:
>
> Hi all,
>
> We experimented with MPI+OpenMP hybrid application (MPI_THREAD_MULTIPLE
> support level) where several threads submits a lot of MPI_Irecv() requests
> simultaneously and encountered an intermittent bug
> OMPI_ERR_TEMP_OUT_OF_RESOURCE after MCA_PML_OB1_RECV_REQUEST_ALLOC()
> because OMPI_FREE_LIST_GET_MT() returned NULL. Investigating this bug we
> found that sometimes the thread calling ompi_free_list_grow() don’t have
> any free items in LIFO list at exit because other threads retrieved all
> new items at opal_atomic_lifo_pop()
>
> So we suggest to change OMPI_FREE_LIST_GET_MT() as below:
>
>
>
> #define OMPI_FREE_LIST_GET_MT(fl, item)
> \
>
> {
> \
>
> item = (ompi_free_list_item_t*)
> opal_atomic_lifo_pop(&((fl)->super)); \
>
> if( OPAL_UNLIKELY(NULL == item) )
> { \
>
> if(opal_using_threads())
> { \
>
> int rc;
> \
>
>
> opal_mutex_lock(&((fl)->fl_lock)); \
>
>
> do \
>
> {
> \
>
> rc = ompi_free_list_grow((fl),
> (fl)->fl_num_per_alloc); \
>
> if( OPAL_UNLIKELY(rc != OMPI_SUCCESS))
> break; \
>
>
> \
>
> item = (ompi_free_list_item_t*)
> opal_atomic_lifo_pop(&((fl)->super)); \
>
>
> \
>
> } while
> (!item); \
>
>
> opal_mutex_unlock(&((fl)->fl_lock)); \
>
> } else
> { \
>
> ompi_free_list_grow((fl),
> (fl)->fl_num_per_alloc); \
>
> item = (ompi_free_list_item_t*)
> opal_atomic_lifo_pop(&((fl)->super)); \
>
> } /* opal_using_threads() */
> \
>
> } /* NULL == item
> */ \
>
> }
>
>
>
>
>
> Another workaround is to increase the value of pml_ob1_free_list_inc
> parameter.
>
>
>
> Regards,
>
> Alexey
>
>
>
>
> _______________________________________________
> devel mailing list
> [email protected]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/09/18039.php
>
>
>
>
> _______________________________________________
> devel mailing list
> [email protected]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/09/18054.php
>
>
>
> _______________________________________________
> devel mailing list
> [email protected]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/09/18061.php
>