This looks like an uninitialized variable that gnu c handles one way and intel another. Someone recently contributed a patch to the ompi trunk to fix just such a thing in this code area - don't know if it addresses this problem or not.
Can you try the ompi trunk (a nightly tarball from the last day or so forward) and see if this still occurs? Thanks Ralph On Dec 11, 2009, at 4:06 PM, Daan van Rossum wrote: > Hi all, > > There's a problem with ompi 1.3.4 when compiled with the intel 11.1.059 c > compiler, related with the built in processor binding functionallity. The > problem does not occur when ompi is compiled with the gnu c compiler. > > A mpi program execution fails (segfault) on mpi_init() when the following > rank file is used: > rank 0=node01 slot=0-3 > rank 1=node01 slot=0-3 > but runs fine with: > rank 0=node01 slot=0 > rank 1=node01 slot=1-3 > and fine with: > rank 0=node01 slot=0-1 > rank 1=node01 slot=1-3 > but segfaults with: > rank 0=node01 slot=0-2 > rank 1=node01 slot=1-3 > > This is on a two-processor quad-core opteron machine (occurs on all nodes of > the cluster) with Ubuntu 8.10, kernel 2.6.27-16. > This is the siplest case that fails. Generally, I would like to bind > processors to physical procs but always allow any core, like > rank 0=node01 slot=p0:0-3 > rank 1=node01 slot=p0:0-3 > rank 2=node01 slot=p0:0-3 > rank 3=node01 slot=p0:0-3 > rank 4=node01 slot=p1:0-3 > rank 5=node01 slot=p1:0-3 > rank 6=node01 slot=p1:0-3 > rank 7=node01 slot=p1:0-3 > which fails too. > > This happens with a test code that contains only two lines of code, calling > mpi_init and mpi_finalize subsequently, and happens in both fortran and in c. > > One more interesting thing is, that the problem with setting the process > affinity does not occur on our four-processor quad-core opteron nodes, with > exactly the same OS etc. > > > Setting "--mca paffinity_base_verbose 5" shows what is going wrong for this > rankfile: > rank 0=node01 slot=0-3 > rank 1=node01 slot=0-3 > ------------- WRONG ----------------- > [node01:23174] mca:base:select:(paffinity) Querying component [linux] > [node01:23174] mca:base:select:(paffinity) Query of component [linux] set > priority to 10 > [node01:23174] mca:base:select:(paffinity) Selected component [linux] > [node01:23174] paffinity slot assignment: slot_list == 0-3 > [node01:23174] paffinity slot assignment: rank 0 runs on cpu #0 (#0) > [node01:23174] paffinity slot assignment: rank 0 runs on cpu #1 (#1) > [node01:23174] paffinity slot assignment: rank 0 runs on cpu #2 (#2) > [node01:23174] paffinity slot assignment: rank 0 runs on cpu #3 (#3) > [node01:23174] paffinity slot assignment: slot_list == 0-3 > [node01:23174] paffinity slot assignment: rank 1 runs on cpu #0 (#0) > [node01:23174] paffinity slot assignment: rank 1 runs on cpu #1 (#1) > [node01:23174] paffinity slot assignment: rank 1 runs on cpu #2 (#2) > [node01:23174] paffinity slot assignment: rank 1 runs on cpu #3 (#3) > [node01:23175] mca:base:select:(paffinity) Querying component [linux] > [node01:23175] mca:base:select:(paffinity) Query of component [linux] set > priority to 10 > [node01:23175] mca:base:select:(paffinity) Selected component [linux] > [node01:23176] mca:base:select:(paffinity) Querying component [linux] > [node01:23176] mca:base:select:(paffinity) Query of component [linux] set > priority to 10 > [node01:23176] mca:base:select:(paffinity) Selected component [linux] > [node01:23175] paffinity slot assignment: slot_list == 0-3 > [node01:23175] paffinity slot assignment: rank 0 runs on cpu #0 (#0) > [node01:23175] paffinity slot assignment: rank 0 runs on cpu #1 (#1) > [node01:23175] paffinity slot assignment: rank 0 runs on cpu #2 (#2) > [node01:23175] paffinity slot assignment: rank 0 runs on cpu #3 (#3) > [node01:23176] paffinity slot assignment: slot_list == 0-3 > [node01:23176] paffinity slot assignment: rank 1 runs on cpu #0 (#0) > [node01:23176] paffinity slot assignment: rank 1 runs on cpu #1 (#1) > [node01:23176] paffinity slot assignment: rank 1 runs on cpu #2 (#2) > [node01:23176] paffinity slot assignment: rank 1 runs on cpu #3 (#3) > [node01:23175] *** Process received signal *** > [node01:23176] *** Process received signal *** > [node01:23175] Signal: Segmentation fault (11) > [node01:23175] Signal code: Address not mapped (1) > [node01:23175] Failing at address: 0x30 > [node01:23176] Signal: Segmentation fault (11) > [node01:23176] Signal code: Address not mapped (1) > [node01:23176] Failing at address: 0x30 > ------------- WRONG ----------------- > > ------------- RIGHT ----------------- > [node25:23241] mca:base:select:(paffinity) Querying component [linux] > [node25:23241] mca:base:select:(paffinity) Query of component [linux] set > priority to 10 > [node25:23241] mca:base:select:(paffinity) Selected component [linux] > [node25:23241] paffinity slot assignment: slot_list == 0-3 > [node25:23241] paffinity slot assignment: rank 0 runs on cpu #0 (#0) > [node25:23241] paffinity slot assignment: rank 0 runs on cpu #1 (#1) > [node25:23241] paffinity slot assignment: rank 0 runs on cpu #2 (#2) > [node25:23241] paffinity slot assignment: rank 0 runs on cpu #3 (#3) > [node25:23241] paffinity slot assignment: slot_list == 0-3 > [node25:23241] paffinity slot assignment: rank 1 runs on cpu #0 (#0) > [node25:23241] paffinity slot assignment: rank 1 runs on cpu #1 (#1) > [node25:23241] paffinity slot assignment: rank 1 runs on cpu #2 (#2) > [node25:23241] paffinity slot assignment: rank 1 runs on cpu #3 (#3) > [node25:23242] mca:base:select:(paffinity) Querying component [linux] > [node25:23242] mca:base:select:(paffinity) Query of component [linux] set > priority to 10 > [node25:23242] mca:base:select:(paffinity) Selected component [linux] > [node25:23243] mca:base:select:(paffinity) Querying component [linux] > [node25:23243] mca:base:select:(paffinity) Query of component [linux] set > priority to 10 > [node25:23243] mca:base:select:(paffinity) Selected component [linux] > ------------- RIGHT ----------------- > > Apparently, only a master process (ID [node01:23174] and [node25:23241]) set > the paffinity in the RIGHT case, but in the WRONG case, also the compute > processes ([node01:23175] and [node01:23176], rank0 and rank1) try to set the > their own paffinity properties. > > > > Note that for the rankfile also the notation does not work. But that seems to > have a different origin, as it tries to bind to a core# 4, whereas there are > just 0-3. > rank 0=node01 slot=0:* > rank 1=node01 slot=0:* > > > Thanks for your help on this! > > -- > Daan van Rossum > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel