Re: [OMPI devel] [patch] return value not updated in ompi_mpi_init()

2010-02-09 Thread Ralph Castain
Oops - yep, that is an oversight! Will fix - thanks!

On Feb 9, 2010, at 7:13 AM, Guillaume Thouvenin wrote:

> Hello,
> 
> It seems that a return value is not updated during the setup of
> process affinity in function ompi_mpi_init()
> ompi/runtime/ompi_mpi_init.c:459
> 
> The problem is in the following piece of code:
> 
>[... here ret == OPAL_SUCCESS ...]
>phys_cpu = opal_paffinity_base_get_physical_processor_id(nrank);
>if (0 > phys_cpu) {
>error = "Could not get physical processor id - cannot set processor 
> affinity";
>goto error;
>}
>[...]
> 
> If opal_paffinity_base_get_physical_processor_id() failed ret is not
> updated and we will reach the "error:" label while ret == OPAL_SUCCESS.
> 
> As a result MPI_Init() will return without having initialized the
> MPI_COMM_WORLD struct leading to a segmentation fault on calls like
> MPI_Comm_size().
> 
> I got the bug recently with new westmere processors for which the
> function opal_paffinity_base_get_physical_processor_id() failed if we
> are using the mca parameter "opal_paffinity_alone 1" during the
> execution.
> 
> I'm not sure that it's the right way to fix the problem but here is a
> patch tested with v1.5. This patch allows to report the problem instead
> of generating a segmentation fault.
> 
> With the patch, the output is:
> 
> --
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
> 
>  Could not get physical processor id - cannot set processor affinity
>  --> Returned "Not found" (-5) instead of "Success" (0)
> --
> 
> Without the patch, the output was:
> 
> *** Process received signal ***
> Signal: Segmentation fault (11)
> Signal code: Address not mapped (1)
> Failing at address: 0x10
> [ 0] /lib64/libpthread.so.0 [0x3d4e20ee90]
> [ 1] /home_nfs/thouveng/dev/openmpi-v1.5/lib/libmpi.so.0(MPI_Comm_size+0x9c) 
> [0x7fce74468dfc]
> [ 2] ./IMB-MPI1(IMB_init_pointers+0x2f) [0x40629f]
> [ 3] ./IMB-MPI1(main+0x65) [0x4035c5]
> [ 4] /lib64/libc.so.6(__libc_start_main+0xfd) [0x3d4da1ea2d]
> [ 5] ./IMB-MPI1 [0x403499]
> 
> 
> Regards,
> Guillaume
> 
> ---
> diff --git a/ompi/runtime/ompi_mpi_init.c b/ompi/runtime/ompi_mpi_init.c
> --- a/ompi/runtime/ompi_mpi_init.c
> +++ b/ompi/runtime/ompi_mpi_init.c
> @@ -459,6 +459,7 @@ int ompi_mpi_init(int argc, char **argv,
> OPAL_PAFFINITY_CPU_ZERO(mask);
> phys_cpu = 
> opal_paffinity_base_get_physical_processor_id(nrank);
> if (0 > phys_cpu) {
> +ret = phys_cpu;
> error = "Could not get physical processor id - cannot set 
> processor affinity";
> goto error;
> }
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




[OMPI devel] [patch] return value not updated in ompi_mpi_init()

2010-02-09 Thread Guillaume Thouvenin
Hello,

 It seems that a return value is not updated during the setup of
process affinity in function ompi_mpi_init()
ompi/runtime/ompi_mpi_init.c:459

 The problem is in the following piece of code:

[... here ret == OPAL_SUCCESS ...]
phys_cpu = opal_paffinity_base_get_physical_processor_id(nrank);
if (0 > phys_cpu) {
error = "Could not get physical processor id - cannot set processor 
affinity";
goto error;
}
[...]

 If opal_paffinity_base_get_physical_processor_id() failed ret is not
updated and we will reach the "error:" label while ret == OPAL_SUCCESS.

 As a result MPI_Init() will return without having initialized the
MPI_COMM_WORLD struct leading to a segmentation fault on calls like
MPI_Comm_size().

 I got the bug recently with new westmere processors for which the
function opal_paffinity_base_get_physical_processor_id() failed if we
are using the mca parameter "opal_paffinity_alone 1" during the
execution.

 I'm not sure that it's the right way to fix the problem but here is a
patch tested with v1.5. This patch allows to report the problem instead
of generating a segmentation fault.

With the patch, the output is:

--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  Could not get physical processor id - cannot set processor affinity
  --> Returned "Not found" (-5) instead of "Success" (0)
--

Without the patch, the output was:

 *** Process received signal ***
 Signal: Segmentation fault (11)
 Signal code: Address not mapped (1)
 Failing at address: 0x10
[ 0] /lib64/libpthread.so.0 [0x3d4e20ee90]
[ 1] /home_nfs/thouveng/dev/openmpi-v1.5/lib/libmpi.so.0(MPI_Comm_size+0x9c) 
[0x7fce74468dfc]
[ 2] ./IMB-MPI1(IMB_init_pointers+0x2f) [0x40629f]
[ 3] ./IMB-MPI1(main+0x65) [0x4035c5]
[ 4] /lib64/libc.so.6(__libc_start_main+0xfd) [0x3d4da1ea2d]
[ 5] ./IMB-MPI1 [0x403499]


Regards,
Guillaume

---
diff --git a/ompi/runtime/ompi_mpi_init.c b/ompi/runtime/ompi_mpi_init.c
--- a/ompi/runtime/ompi_mpi_init.c
+++ b/ompi/runtime/ompi_mpi_init.c
@@ -459,6 +459,7 @@ int ompi_mpi_init(int argc, char **argv,
 OPAL_PAFFINITY_CPU_ZERO(mask);
 phys_cpu = 
opal_paffinity_base_get_physical_processor_id(nrank);
 if (0 > phys_cpu) {
+ret = phys_cpu;
 error = "Could not get physical processor id - cannot set 
processor affinity";
 goto error;
 }