I don't really want to throw fud on this list but we've seen all sorts
of oddities with OMPI 1.3.4 being built with Intel's 11.1 compiler
versus their 11.0 or other compilers (gcc, Sun Studio, pgi, and
pathscale). I have not tested your specific failing case but
considering your issue doesn't show up with gcc I am wondering if there
is some sort of optimization issue with the 11.1 compiler.
It might be interesting to see if using certain optimization levels with
the Intel 11.1 compiler produces a working OMPI library.
--td
Daan van Rossum wrote:
Hi Ralph,
I took the Dec 10th snapshot, but got exactly the same behavior as with version
1.3.4.
I just noticed that even this rankfile doesn't work, with a single process:
rank 0=node01 slot=0-3
------------
[node01:31105] mca:base:select:(paffinity) Querying component [linux]
[node01:31105] mca:base:select:(paffinity) Query of component [linux] set
priority to 10
[node01:31105] mca:base:select:(paffinity) Selected component [linux]
[node01:31105] paffinity slot assignment: slot_list == 0-3
[node01:31105] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
[node01:31105] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
[node01:31105] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
[node01:31105] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
[node01:31106] mca:base:select:(paffinity) Querying component [linux]
[node01:31106] mca:base:select:(paffinity) Query of component [linux] set
priority to 10
[node01:31106] mca:base:select:(paffinity) Selected component [linux]
[node01:31106] paffinity slot assignment: slot_list == 0-3
[node01:31106] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
[node01:31106] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
[node01:31106] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
[node01:31106] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
[node01:31106] *** An error occurred in MPI_Comm_rank
[node01:31106] *** on a NULL communicator
[node01:31106] *** Unknown error
[node01:31106] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
forrtl: severe (174): SIGSEGV, segmentation fault occurred
------------
The spawned compute process doesn't sense that it should skip the setting
paffinity...
I saw the posting from last July about a similar problem (the problem that I
mentioned on the bottom, with the slot=0:* notation not working). But that is a
different problem (besides, that is still not working as it seems).
Best,
Daan
* on Saturday, 12.12.09 at 18:48, Ralph Castain <r...@open-mpi.org> wrote:
This looks like an uninitialized variable that gnu c handles one way and intel
another. Someone recently contributed a patch to the ompi trunk to fix just
such a thing in this code area - don't know if it addresses this problem or
not.
Can you try the ompi trunk (a nightly tarball from the last day or so forward)
and see if this still occurs?
Thanks
Ralph
On Dec 11, 2009, at 4:06 PM, Daan van Rossum wrote:
Hi all,
There's a problem with ompi 1.3.4 when compiled with the intel 11.1.059 c
compiler, related with the built in processor binding functionallity. The
problem does not occur when ompi is compiled with the gnu c compiler.
A mpi program execution fails (segfault) on mpi_init() when the following rank
file is used:
rank 0=node01 slot=0-3
rank 1=node01 slot=0-3
but runs fine with:
rank 0=node01 slot=0
rank 1=node01 slot=1-3
and fine with:
rank 0=node01 slot=0-1
rank 1=node01 slot=1-3
but segfaults with:
rank 0=node01 slot=0-2
rank 1=node01 slot=1-3
This is on a two-processor quad-core opteron machine (occurs on all nodes of
the cluster) with Ubuntu 8.10, kernel 2.6.27-16.
This is the siplest case that fails. Generally, I would like to bind processors
to physical procs but always allow any core, like
rank 0=node01 slot=p0:0-3
rank 1=node01 slot=p0:0-3
rank 2=node01 slot=p0:0-3
rank 3=node01 slot=p0:0-3
rank 4=node01 slot=p1:0-3
rank 5=node01 slot=p1:0-3
rank 6=node01 slot=p1:0-3
rank 7=node01 slot=p1:0-3
which fails too.
This happens with a test code that contains only two lines of code, calling
mpi_init and mpi_finalize subsequently, and happens in both fortran and in c.
One more interesting thing is, that the problem with setting the process
affinity does not occur on our four-processor quad-core opteron nodes, with
exactly the same OS etc.
Setting "--mca paffinity_base_verbose 5" shows what is going wrong for this
rankfile:
rank 0=node01 slot=0-3
rank 1=node01 slot=0-3
------------- WRONG -----------------
[node01:23174] mca:base:select:(paffinity) Querying component [linux]
[node01:23174] mca:base:select:(paffinity) Query of component [linux] set
priority to 10
[node01:23174] mca:base:select:(paffinity) Selected component [linux]
[node01:23174] paffinity slot assignment: slot_list == 0-3
[node01:23174] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
[node01:23174] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
[node01:23174] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
[node01:23174] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
[node01:23174] paffinity slot assignment: slot_list == 0-3
[node01:23174] paffinity slot assignment: rank 1 runs on cpu #0 (#0)
[node01:23174] paffinity slot assignment: rank 1 runs on cpu #1 (#1)
[node01:23174] paffinity slot assignment: rank 1 runs on cpu #2 (#2)
[node01:23174] paffinity slot assignment: rank 1 runs on cpu #3 (#3)
[node01:23175] mca:base:select:(paffinity) Querying component [linux]
[node01:23175] mca:base:select:(paffinity) Query of component [linux] set
priority to 10
[node01:23175] mca:base:select:(paffinity) Selected component [linux]
[node01:23176] mca:base:select:(paffinity) Querying component [linux]
[node01:23176] mca:base:select:(paffinity) Query of component [linux] set
priority to 10
[node01:23176] mca:base:select:(paffinity) Selected component [linux]
[node01:23175] paffinity slot assignment: slot_list == 0-3
[node01:23175] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
[node01:23175] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
[node01:23175] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
[node01:23175] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
[node01:23176] paffinity slot assignment: slot_list == 0-3
[node01:23176] paffinity slot assignment: rank 1 runs on cpu #0 (#0)
[node01:23176] paffinity slot assignment: rank 1 runs on cpu #1 (#1)
[node01:23176] paffinity slot assignment: rank 1 runs on cpu #2 (#2)
[node01:23176] paffinity slot assignment: rank 1 runs on cpu #3 (#3)
[node01:23175] *** Process received signal ***
[node01:23176] *** Process received signal ***
[node01:23175] Signal: Segmentation fault (11)
[node01:23175] Signal code: Address not mapped (1)
[node01:23175] Failing at address: 0x30
[node01:23176] Signal: Segmentation fault (11)
[node01:23176] Signal code: Address not mapped (1)
[node01:23176] Failing at address: 0x30
------------- WRONG -----------------
------------- RIGHT -----------------
[node25:23241] mca:base:select:(paffinity) Querying component [linux]
[node25:23241] mca:base:select:(paffinity) Query of component [linux] set
priority to 10
[node25:23241] mca:base:select:(paffinity) Selected component [linux]
[node25:23241] paffinity slot assignment: slot_list == 0-3
[node25:23241] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
[node25:23241] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
[node25:23241] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
[node25:23241] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
[node25:23241] paffinity slot assignment: slot_list == 0-3
[node25:23241] paffinity slot assignment: rank 1 runs on cpu #0 (#0)
[node25:23241] paffinity slot assignment: rank 1 runs on cpu #1 (#1)
[node25:23241] paffinity slot assignment: rank 1 runs on cpu #2 (#2)
[node25:23241] paffinity slot assignment: rank 1 runs on cpu #3 (#3)
[node25:23242] mca:base:select:(paffinity) Querying component [linux]
[node25:23242] mca:base:select:(paffinity) Query of component [linux] set
priority to 10
[node25:23242] mca:base:select:(paffinity) Selected component [linux]
[node25:23243] mca:base:select:(paffinity) Querying component [linux]
[node25:23243] mca:base:select:(paffinity) Query of component [linux] set
priority to 10
[node25:23243] mca:base:select:(paffinity) Selected component [linux]
------------- RIGHT -----------------
Apparently, only a master process (ID [node01:23174] and [node25:23241]) set
the paffinity in the RIGHT case, but in the WRONG case, also the compute
processes ([node01:23175] and [node01:23176], rank0 and rank1) try to set the
their own paffinity properties.
Note that for the rankfile also the notation does not work. But that seems to
have a different origin, as it tries to bind to a core# 4, whereas there are
just 0-3.
rank 0=node01 slot=0:*
rank 1=node01 slot=0:*
Thanks for your help on this!
--
Daan van Rossum
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Daan van Rossum
University of Chicago
Department of Astronomy and Astrophysics
5640 S. Ellis Ave
Chicago, IL 60637
phone: 773-7020624
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel