I don't really want to throw fud on this list but we've seen all sorts of oddities with OMPI 1.3.4 being built with Intel's 11.1 compiler versus their 11.0 or other compilers (gcc, Sun Studio, pgi, and pathscale). I have not tested your specific failing case but considering your issue doesn't show up with gcc I am wondering if there is some sort of optimization issue with the 11.1 compiler.

It might be interesting to see if using certain optimization levels with the Intel 11.1 compiler produces a working OMPI library.

--td

Daan van Rossum wrote:
Hi Ralph,

I took the Dec 10th snapshot, but got exactly the same behavior as with version 
1.3.4.

I just noticed that even this rankfile doesn't work, with a single process:
 rank 0=node01 slot=0-3

------------
[node01:31105] mca:base:select:(paffinity) Querying component [linux]
[node01:31105] mca:base:select:(paffinity) Query of component [linux] set 
priority to 10
[node01:31105] mca:base:select:(paffinity) Selected component [linux]
[node01:31105] paffinity slot assignment: slot_list == 0-3
[node01:31105] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
[node01:31105] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
[node01:31105] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
[node01:31105] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
[node01:31106] mca:base:select:(paffinity) Querying component [linux]
[node01:31106] mca:base:select:(paffinity) Query of component [linux] set 
priority to 10
[node01:31106] mca:base:select:(paffinity) Selected component [linux]
[node01:31106] paffinity slot assignment: slot_list == 0-3
[node01:31106] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
[node01:31106] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
[node01:31106] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
[node01:31106] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
[node01:31106] *** An error occurred in MPI_Comm_rank
[node01:31106] *** on a NULL communicator
[node01:31106] *** Unknown error
[node01:31106] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
forrtl: severe (174): SIGSEGV, segmentation fault occurred
------------

The spawned compute process doesn't sense that it should skip the setting 
paffinity...


I saw the posting from last July about a similar problem (the problem that I 
mentioned on the bottom, with the slot=0:* notation not working). But that is a 
different problem (besides, that is still not working as it seems).

Best,
Daan

* on Saturday, 12.12.09 at 18:48, Ralph Castain <r...@open-mpi.org> wrote:

This looks like an uninitialized variable that gnu c handles one way and intel 
another. Someone recently contributed a patch to the ompi trunk to fix just 
such a  thing in this code area - don't know if it addresses this problem or 
not.

Can you try the ompi trunk (a nightly tarball from the last day or so forward) 
and see if this still occurs?

Thanks
Ralph

On Dec 11, 2009, at 4:06 PM, Daan van Rossum wrote:

Hi all,

There's a problem with ompi 1.3.4 when compiled with the intel 11.1.059 c 
compiler, related with the built in processor binding functionallity. The 
problem does not occur when ompi is compiled with the gnu c compiler.

A mpi program execution fails (segfault) on mpi_init() when the following rank 
file is used:
rank 0=node01 slot=0-3
rank 1=node01 slot=0-3
but runs fine with:
rank 0=node01 slot=0
rank 1=node01 slot=1-3
and fine with:
rank 0=node01 slot=0-1
rank 1=node01 slot=1-3
but segfaults with:
rank 0=node01 slot=0-2
rank 1=node01 slot=1-3

This is on a two-processor quad-core opteron machine (occurs on all nodes of 
the cluster) with Ubuntu 8.10, kernel 2.6.27-16.
This is the siplest case that fails. Generally, I would like to bind processors 
to physical procs but always allow any core, like
rank 0=node01 slot=p0:0-3
rank 1=node01 slot=p0:0-3
rank 2=node01 slot=p0:0-3
rank 3=node01 slot=p0:0-3
rank 4=node01 slot=p1:0-3
rank 5=node01 slot=p1:0-3
rank 6=node01 slot=p1:0-3
rank 7=node01 slot=p1:0-3
which fails too.

This happens with a test code that contains only two lines of code, calling 
mpi_init and mpi_finalize subsequently, and happens in both fortran and in c.

One more interesting thing is, that the problem with setting the process 
affinity does not occur on our four-processor quad-core opteron nodes, with 
exactly the same OS etc.


Setting "--mca paffinity_base_verbose 5" shows what is going wrong for this 
rankfile:
rank 0=node01 slot=0-3
rank 1=node01 slot=0-3
------------- WRONG -----------------
[node01:23174] mca:base:select:(paffinity) Querying component [linux]
[node01:23174] mca:base:select:(paffinity) Query of component [linux] set 
priority to 10
[node01:23174] mca:base:select:(paffinity) Selected component [linux]
[node01:23174] paffinity slot assignment: slot_list == 0-3
[node01:23174] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
[node01:23174] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
[node01:23174] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
[node01:23174] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
[node01:23174] paffinity slot assignment: slot_list == 0-3
[node01:23174] paffinity slot assignment: rank 1 runs on cpu #0 (#0)
[node01:23174] paffinity slot assignment: rank 1 runs on cpu #1 (#1)
[node01:23174] paffinity slot assignment: rank 1 runs on cpu #2 (#2)
[node01:23174] paffinity slot assignment: rank 1 runs on cpu #3 (#3)
[node01:23175] mca:base:select:(paffinity) Querying component [linux]
[node01:23175] mca:base:select:(paffinity) Query of component [linux] set 
priority to 10
[node01:23175] mca:base:select:(paffinity) Selected component [linux]
[node01:23176] mca:base:select:(paffinity) Querying component [linux]
[node01:23176] mca:base:select:(paffinity) Query of component [linux] set 
priority to 10
[node01:23176] mca:base:select:(paffinity) Selected component [linux]
[node01:23175] paffinity slot assignment: slot_list == 0-3
[node01:23175] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
[node01:23175] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
[node01:23175] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
[node01:23175] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
[node01:23176] paffinity slot assignment: slot_list == 0-3
[node01:23176] paffinity slot assignment: rank 1 runs on cpu #0 (#0)
[node01:23176] paffinity slot assignment: rank 1 runs on cpu #1 (#1)
[node01:23176] paffinity slot assignment: rank 1 runs on cpu #2 (#2)
[node01:23176] paffinity slot assignment: rank 1 runs on cpu #3 (#3)
[node01:23175] *** Process received signal ***
[node01:23176] *** Process received signal ***
[node01:23175] Signal: Segmentation fault (11)
[node01:23175] Signal code: Address not mapped (1)
[node01:23175] Failing at address: 0x30
[node01:23176] Signal: Segmentation fault (11)
[node01:23176] Signal code: Address not mapped (1)
[node01:23176] Failing at address: 0x30
------------- WRONG -----------------

------------- RIGHT -----------------
[node25:23241] mca:base:select:(paffinity) Querying component [linux]
[node25:23241] mca:base:select:(paffinity) Query of component [linux] set 
priority to 10
[node25:23241] mca:base:select:(paffinity) Selected component [linux]
[node25:23241] paffinity slot assignment: slot_list == 0-3
[node25:23241] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
[node25:23241] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
[node25:23241] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
[node25:23241] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
[node25:23241] paffinity slot assignment: slot_list == 0-3
[node25:23241] paffinity slot assignment: rank 1 runs on cpu #0 (#0)
[node25:23241] paffinity slot assignment: rank 1 runs on cpu #1 (#1)
[node25:23241] paffinity slot assignment: rank 1 runs on cpu #2 (#2)
[node25:23241] paffinity slot assignment: rank 1 runs on cpu #3 (#3)
[node25:23242] mca:base:select:(paffinity) Querying component [linux]
[node25:23242] mca:base:select:(paffinity) Query of component [linux] set 
priority to 10
[node25:23242] mca:base:select:(paffinity) Selected component [linux]
[node25:23243] mca:base:select:(paffinity) Querying component [linux]
[node25:23243] mca:base:select:(paffinity) Query of component [linux] set 
priority to 10
[node25:23243] mca:base:select:(paffinity) Selected component [linux]
------------- RIGHT -----------------

Apparently, only a master process (ID [node01:23174] and [node25:23241]) set 
the paffinity in the RIGHT case, but in the WRONG case, also the compute 
processes ([node01:23175] and [node01:23176], rank0 and rank1) try to set the 
their own paffinity properties.



Note that for the rankfile also the notation does not work. But that seems to 
have a different origin, as it tries to bind to a core# 4, whereas there are 
just 0-3.
rank 0=node01 slot=0:*
rank 1=node01 slot=0:*


Thanks for your help on this!

--
Daan van Rossum
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
Daan van Rossum

University of Chicago
Department of Astronomy and Astrophysics
5640 S. Ellis Ave
Chicago, IL 60637
phone: 773-7020624
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to