[OMPI devel] [PATCH] OSC/RDMA: Fix a potential deadlock
Hello, This patch fixes a potential loss of a lock request in function ompi_osc_rdma_passive_unlock_complete(). A new pending request is taken from the m_locks_pending list. If m_lock_status is not equal to 0, this new entry is then set to NULL and thus lost. This can lead to a deadlock situation. So this patch moves the update of new_pending in its right place. This patch was tested on v1.5. Regards Guillaume --- diff --git a/ompi/mca/osc/rdma/osc_rdma_sync.c b/ompi/mca/osc/rdma/osc_rdma_sync.c --- a/ompi/mca/osc/rdma/osc_rdma_sync.c +++ b/ompi/mca/osc/rdma/osc_rdma_sync.c @@ -748,9 +748,9 @@ ompi_osc_rdma_passive_unlock_complete(om /* if we were really unlocked, see if we have another lock request we can satisfy */ OPAL_THREAD_LOCK(&(module->m_lock)); -new_pending = (ompi_osc_rdma_pending_lock_t*) -opal_list_remove_first(&(module->m_locks_pending)); if (0 == module->m_lock_status) { +new_pending = (ompi_osc_rdma_pending_lock_t*) +opal_list_remove_first(&(module->m_locks_pending)); if (NULL != new_pending) { ompi_win_append_mode(module->m_win, OMPI_WIN_EXPOSE_EPOCH); /* set lock state and generate a lock request */
[OMPI devel] [PATCH] OSC/RDMA: Add a missing OBJ_DESTRUCT
Hello, In function ompi_osc_rdma_passive_unlock_complete(), an object copy_unlock_acks was built but it is never destroyed. The following patch adds its destruction. Tested on Open MPI v1.5 Regards, Guillaume --- diff --git a/ompi/mca/osc/rdma/osc_rdma_sync.c b/ompi/mca/osc/rdma/osc_rdma_sync.c --- a/ompi/mca/osc/rdma/osc_rdma_sync.c +++ b/ompi/mca/osc/rdma/osc_rdma_sync.c @@ -745,6 +745,8 @@ ompi_osc_rdma_passive_unlock_complete(om OBJ_RELEASE(new_pending); } +OBJ_DESTRUCT(©_unlock_acks); + /* if we were really unlocked, see if we have another lock request we can satisfy */ OPAL_THREAD_LOCK(&(module->m_lock));
[OMPI devel] [patch] return value not updated in ompi_mpi_init()
Hello, It seems that a return value is not updated during the setup of process affinity in function ompi_mpi_init() ompi/runtime/ompi_mpi_init.c:459 The problem is in the following piece of code: [... here ret == OPAL_SUCCESS ...] phys_cpu = opal_paffinity_base_get_physical_processor_id(nrank); if (0 > phys_cpu) { error = "Could not get physical processor id - cannot set processor affinity"; goto error; } [...] If opal_paffinity_base_get_physical_processor_id() failed ret is not updated and we will reach the "error:" label while ret == OPAL_SUCCESS. As a result MPI_Init() will return without having initialized the MPI_COMM_WORLD struct leading to a segmentation fault on calls like MPI_Comm_size(). I got the bug recently with new westmere processors for which the function opal_paffinity_base_get_physical_processor_id() failed if we are using the mca parameter "opal_paffinity_alone 1" during the execution. I'm not sure that it's the right way to fix the problem but here is a patch tested with v1.5. This patch allows to report the problem instead of generating a segmentation fault. With the patch, the output is: -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): Could not get physical processor id - cannot set processor affinity --> Returned "Not found" (-5) instead of "Success" (0) -- Without the patch, the output was: *** Process received signal *** Signal: Segmentation fault (11) Signal code: Address not mapped (1) Failing at address: 0x10 [ 0] /lib64/libpthread.so.0 [0x3d4e20ee90] [ 1] /home_nfs/thouveng/dev/openmpi-v1.5/lib/libmpi.so.0(MPI_Comm_size+0x9c) [0x7fce74468dfc] [ 2] ./IMB-MPI1(IMB_init_pointers+0x2f) [0x40629f] [ 3] ./IMB-MPI1(main+0x65) [0x4035c5] [ 4] /lib64/libc.so.6(__libc_start_main+0xfd) [0x3d4da1ea2d] [ 5] ./IMB-MPI1 [0x403499] Regards, Guillaume --- diff --git a/ompi/runtime/ompi_mpi_init.c b/ompi/runtime/ompi_mpi_init.c --- a/ompi/runtime/ompi_mpi_init.c +++ b/ompi/runtime/ompi_mpi_init.c @@ -459,6 +459,7 @@ int ompi_mpi_init(int argc, char **argv, OPAL_PAFFINITY_CPU_ZERO(mask); phys_cpu = opal_paffinity_base_get_physical_processor_id(nrank); if (0 > phys_cpu) { +ret = phys_cpu; error = "Could not get physical processor id - cannot set processor affinity"; goto error; }
[OMPI devel] [patch] MPI_Comm_Spawn(), parent name is empty
Hello, When calling MPI_Comm_get_name() on the predefined communicator MPI_COMM_PARENT after a call to MPI_Comm_spawn(), we are expecting the name MPI_COMM_PARENT as stated into the MPI Standard 2.2. In practice, MPI_Comm_get_name() returns an empty string. As far as I understand the problem, it seems that there is a bug into dyn_init(). The name is set but the flags is not updated. The following patch fixes the problem. Guillaume --- diff --git a/ompi/mca/dpm/orte/dpm_orte.c b/ompi/mca/dpm/orte/dpm_orte.c --- a/ompi/mca/dpm/orte/dpm_orte.c +++ b/ompi/mca/dpm/orte/dpm_orte.c @@ -965,6 +965,7 @@ static int dyn_init(void) /* Set name for debugging purposes */ snprintf(newcomm->c_name, MPI_MAX_OBJECT_NAME, "MPI_COMM_PARENT"); +newcomm->c_flags |= OMPI_COMM_NAMEISSET; return OMPI_SUCCESS; }