[OMPI devel] [PATCH] OSC/RDMA: Fix a potential deadlock

2010-12-13 Thread Guillaume Thouvenin
Hello,

 This patch fixes a potential loss of a lock request in function
ompi_osc_rdma_passive_unlock_complete(). A new pending request is taken
from the m_locks_pending list. If m_lock_status is not equal to 0, this
new entry is then set to NULL and thus lost. This can lead to a deadlock
situation. So this patch moves the update of new_pending in its right
place.

 This patch was tested on v1.5.

Regards
Guillaume
---

diff --git a/ompi/mca/osc/rdma/osc_rdma_sync.c
b/ompi/mca/osc/rdma/osc_rdma_sync.c ---
a/ompi/mca/osc/rdma/osc_rdma_sync.c +++
b/ompi/mca/osc/rdma/osc_rdma_sync.c @@ -748,9 +748,9 @@
ompi_osc_rdma_passive_unlock_complete(om /* if we were really unlocked,
see if we have another lock request we can satisfy */
 OPAL_THREAD_LOCK(&(module->m_lock));
-new_pending = (ompi_osc_rdma_pending_lock_t*) 
-opal_list_remove_first(&(module->m_locks_pending));
 if (0 == module->m_lock_status) {
+new_pending = (ompi_osc_rdma_pending_lock_t*)
+opal_list_remove_first(&(module->m_locks_pending));
 if (NULL != new_pending) {
 ompi_win_append_mode(module->m_win, OMPI_WIN_EXPOSE_EPOCH);
 /* set lock state and generate a lock request */


[OMPI devel] [PATCH] OSC/RDMA: Add a missing OBJ_DESTRUCT

2010-12-13 Thread Guillaume Thouvenin
Hello,

 In function ompi_osc_rdma_passive_unlock_complete(), an object
copy_unlock_acks was built but it is never destroyed. The following
patch adds its destruction.

 Tested on Open MPI v1.5

Regards,
Guillaume
---

diff --git a/ompi/mca/osc/rdma/osc_rdma_sync.c 
b/ompi/mca/osc/rdma/osc_rdma_sync.c
--- a/ompi/mca/osc/rdma/osc_rdma_sync.c
+++ b/ompi/mca/osc/rdma/osc_rdma_sync.c
@@ -745,6 +745,8 @@ ompi_osc_rdma_passive_unlock_complete(om
 OBJ_RELEASE(new_pending);
 }

+OBJ_DESTRUCT(_unlock_acks);
+
 /* if we were really unlocked, see if we have another lock request
we can satisfy */
 OPAL_THREAD_LOCK(&(module->m_lock));



[OMPI devel] [patch] return value not updated in ompi_mpi_init()

2010-02-09 Thread Guillaume Thouvenin
Hello,

 It seems that a return value is not updated during the setup of
process affinity in function ompi_mpi_init()
ompi/runtime/ompi_mpi_init.c:459

 The problem is in the following piece of code:

[... here ret == OPAL_SUCCESS ...]
phys_cpu = opal_paffinity_base_get_physical_processor_id(nrank);
if (0 > phys_cpu) {
error = "Could not get physical processor id - cannot set processor 
affinity";
goto error;
}
[...]

 If opal_paffinity_base_get_physical_processor_id() failed ret is not
updated and we will reach the "error:" label while ret == OPAL_SUCCESS.

 As a result MPI_Init() will return without having initialized the
MPI_COMM_WORLD struct leading to a segmentation fault on calls like
MPI_Comm_size().

 I got the bug recently with new westmere processors for which the
function opal_paffinity_base_get_physical_processor_id() failed if we
are using the mca parameter "opal_paffinity_alone 1" during the
execution.

 I'm not sure that it's the right way to fix the problem but here is a
patch tested with v1.5. This patch allows to report the problem instead
of generating a segmentation fault.

With the patch, the output is:

--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  Could not get physical processor id - cannot set processor affinity
  --> Returned "Not found" (-5) instead of "Success" (0)
--

Without the patch, the output was:

 *** Process received signal ***
 Signal: Segmentation fault (11)
 Signal code: Address not mapped (1)
 Failing at address: 0x10
[ 0] /lib64/libpthread.so.0 [0x3d4e20ee90]
[ 1] /home_nfs/thouveng/dev/openmpi-v1.5/lib/libmpi.so.0(MPI_Comm_size+0x9c) 
[0x7fce74468dfc]
[ 2] ./IMB-MPI1(IMB_init_pointers+0x2f) [0x40629f]
[ 3] ./IMB-MPI1(main+0x65) [0x4035c5]
[ 4] /lib64/libc.so.6(__libc_start_main+0xfd) [0x3d4da1ea2d]
[ 5] ./IMB-MPI1 [0x403499]


Regards,
Guillaume

---
diff --git a/ompi/runtime/ompi_mpi_init.c b/ompi/runtime/ompi_mpi_init.c
--- a/ompi/runtime/ompi_mpi_init.c
+++ b/ompi/runtime/ompi_mpi_init.c
@@ -459,6 +459,7 @@ int ompi_mpi_init(int argc, char **argv,
 OPAL_PAFFINITY_CPU_ZERO(mask);
 phys_cpu = 
opal_paffinity_base_get_physical_processor_id(nrank);
 if (0 > phys_cpu) {
+ret = phys_cpu;
 error = "Could not get physical processor id - cannot set 
processor affinity";
 goto error;
 }


[OMPI devel] [patch] MPI_Comm_Spawn(), parent name is empty

2010-01-20 Thread Guillaume Thouvenin
Hello,

 When calling MPI_Comm_get_name() on the predefined communicator
MPI_COMM_PARENT after a call to MPI_Comm_spawn(), we are expecting the
name MPI_COMM_PARENT as stated into the MPI Standard 2.2. 

 In practice, MPI_Comm_get_name() returns an empty string. As far as I
understand the problem, it seems that there is a bug into dyn_init().
The name is set but the flags is not updated. The following patch fixes
the problem.

Guillaume
---

diff --git a/ompi/mca/dpm/orte/dpm_orte.c b/ompi/mca/dpm/orte/dpm_orte.c
--- a/ompi/mca/dpm/orte/dpm_orte.c
+++ b/ompi/mca/dpm/orte/dpm_orte.c
@@ -965,6 +965,7 @@ static int dyn_init(void)

 /* Set name for debugging purposes */
 snprintf(newcomm->c_name, MPI_MAX_OBJECT_NAME, "MPI_COMM_PARENT");
+newcomm->c_flags |= OMPI_COMM_NAMEISSET;

 return OMPI_SUCCESS;
 }