Oops - yep, that is an oversight! Will fix - thanks!
On Feb 9, 2010, at 7:13 AM, Guillaume Thouvenin wrote:
> Hello,
>
> It seems that a return value is not updated during the setup of
> process affinity in function ompi_mpi_init()
> ompi/runtime/ompi_mpi_init.c:459
>
> The problem is in the following piece of code:
>
>[... here ret == OPAL_SUCCESS ...]
>phys_cpu = opal_paffinity_base_get_physical_processor_id(nrank);
>if (0 > phys_cpu) {
>error = "Could not get physical processor id - cannot set processor
> affinity";
>goto error;
>}
>[...]
>
> If opal_paffinity_base_get_physical_processor_id() failed ret is not
> updated and we will reach the "error:" label while ret == OPAL_SUCCESS.
>
> As a result MPI_Init() will return without having initialized the
> MPI_COMM_WORLD struct leading to a segmentation fault on calls like
> MPI_Comm_size().
>
> I got the bug recently with new westmere processors for which the
> function opal_paffinity_base_get_physical_processor_id() failed if we
> are using the mca parameter "opal_paffinity_alone 1" during the
> execution.
>
> I'm not sure that it's the right way to fix the problem but here is a
> patch tested with v1.5. This patch allows to report the problem instead
> of generating a segmentation fault.
>
> With the patch, the output is:
>
> --
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort. There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems. This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
>
> Could not get physical processor id - cannot set processor affinity
> --> Returned "Not found" (-5) instead of "Success" (0)
> --
>
> Without the patch, the output was:
>
> *** Process received signal ***
> Signal: Segmentation fault (11)
> Signal code: Address not mapped (1)
> Failing at address: 0x10
> [ 0] /lib64/libpthread.so.0 [0x3d4e20ee90]
> [ 1] /home_nfs/thouveng/dev/openmpi-v1.5/lib/libmpi.so.0(MPI_Comm_size+0x9c)
> [0x7fce74468dfc]
> [ 2] ./IMB-MPI1(IMB_init_pointers+0x2f) [0x40629f]
> [ 3] ./IMB-MPI1(main+0x65) [0x4035c5]
> [ 4] /lib64/libc.so.6(__libc_start_main+0xfd) [0x3d4da1ea2d]
> [ 5] ./IMB-MPI1 [0x403499]
>
>
> Regards,
> Guillaume
>
> ---
> diff --git a/ompi/runtime/ompi_mpi_init.c b/ompi/runtime/ompi_mpi_init.c
> --- a/ompi/runtime/ompi_mpi_init.c
> +++ b/ompi/runtime/ompi_mpi_init.c
> @@ -459,6 +459,7 @@ int ompi_mpi_init(int argc, char **argv,
> OPAL_PAFFINITY_CPU_ZERO(mask);
> phys_cpu =
> opal_paffinity_base_get_physical_processor_id(nrank);
> if (0 > phys_cpu) {
> +ret = phys_cpu;
> error = "Could not get physical processor id - cannot set
> processor affinity";
> goto error;
> }
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel