Hi all,
We experimented with MPI+OpenMP hybrid application (MPI_THREAD_MULTIPLE
support level) where several threads submits a lot of MPI_Irecv() requests
simultaneously and encountered an intermittent bug
OMPI_ERR_TEMP_OUT_OF_RESOURCE after MCA_PML_OB1_RECV_REQUEST_ALLOC()
because OMPI_FREE_LIST_GET_MT() returned NULL. Investigating this bug we
found that sometimes the thread calling ompi_free_list_grow() don’t have
any free items in LIFO list at exit because other threads retrieved all
new items at opal_atomic_lifo_pop()
So we suggest to change OMPI_FREE_LIST_GET_MT() as below:
#define OMPI_FREE_LIST_GET_MT(fl, item)
\
{
\
item = (ompi_free_list_item_t*)
opal_atomic_lifo_pop(&((fl)->super)); \
if( OPAL_UNLIKELY(NULL == item) )
{ \
if(opal_using_threads())
{ \
int rc;
\
opal_mutex_lock(&((fl)->fl_lock)); \
do \
{
\
rc = ompi_free_list_grow((fl),
(fl)->fl_num_per_alloc); \
if( OPAL_UNLIKELY(rc != OMPI_SUCCESS))
break; \
\
item = (ompi_free_list_item_t*)
opal_atomic_lifo_pop(&((fl)->super)); \
\
} while
(!item); \
opal_mutex_unlock(&((fl)->fl_lock)); \
} else {
\
ompi_free_list_grow((fl),
(fl)->fl_num_per_alloc); \
item = (ompi_free_list_item_t*)
opal_atomic_lifo_pop(&((fl)->super)); \
} /* opal_using_threads() */
\
} /* NULL == item
*/ \
}
Another workaround is to increase the value of pml_ob1_free_list_inc
parameter.
Regards,
Alexey