Hi Ralph, I took the Dec 10th snapshot, but got exactly the same behavior as with version 1.3.4.
I just noticed that even this rankfile doesn't work, with a single process: rank 0=node01 slot=0-3 ------------ [node01:31105] mca:base:select:(paffinity) Querying component [linux] [node01:31105] mca:base:select:(paffinity) Query of component [linux] set priority to 10 [node01:31105] mca:base:select:(paffinity) Selected component [linux] [node01:31105] paffinity slot assignment: slot_list == 0-3 [node01:31105] paffinity slot assignment: rank 0 runs on cpu #0 (#0) [node01:31105] paffinity slot assignment: rank 0 runs on cpu #1 (#1) [node01:31105] paffinity slot assignment: rank 0 runs on cpu #2 (#2) [node01:31105] paffinity slot assignment: rank 0 runs on cpu #3 (#3) [node01:31106] mca:base:select:(paffinity) Querying component [linux] [node01:31106] mca:base:select:(paffinity) Query of component [linux] set priority to 10 [node01:31106] mca:base:select:(paffinity) Selected component [linux] [node01:31106] paffinity slot assignment: slot_list == 0-3 [node01:31106] paffinity slot assignment: rank 0 runs on cpu #0 (#0) [node01:31106] paffinity slot assignment: rank 0 runs on cpu #1 (#1) [node01:31106] paffinity slot assignment: rank 0 runs on cpu #2 (#2) [node01:31106] paffinity slot assignment: rank 0 runs on cpu #3 (#3) [node01:31106] *** An error occurred in MPI_Comm_rank [node01:31106] *** on a NULL communicator [node01:31106] *** Unknown error [node01:31106] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) forrtl: severe (174): SIGSEGV, segmentation fault occurred ------------ The spawned compute process doesn't sense that it should skip the setting paffinity... I saw the posting from last July about a similar problem (the problem that I mentioned on the bottom, with the slot=0:* notation not working). But that is a different problem (besides, that is still not working as it seems). Best, Daan * on Saturday, 12.12.09 at 18:48, Ralph Castain <r...@open-mpi.org> wrote: > This looks like an uninitialized variable that gnu c handles one way and > intel another. Someone recently contributed a patch to the ompi trunk to fix > just such a thing in this code area - don't know if it addresses this > problem or not. > > Can you try the ompi trunk (a nightly tarball from the last day or so > forward) and see if this still occurs? > > Thanks > Ralph > > On Dec 11, 2009, at 4:06 PM, Daan van Rossum wrote: > > > Hi all, > > > > There's a problem with ompi 1.3.4 when compiled with the intel 11.1.059 c > > compiler, related with the built in processor binding functionallity. The > > problem does not occur when ompi is compiled with the gnu c compiler. > > > > A mpi program execution fails (segfault) on mpi_init() when the following > > rank file is used: > > rank 0=node01 slot=0-3 > > rank 1=node01 slot=0-3 > > but runs fine with: > > rank 0=node01 slot=0 > > rank 1=node01 slot=1-3 > > and fine with: > > rank 0=node01 slot=0-1 > > rank 1=node01 slot=1-3 > > but segfaults with: > > rank 0=node01 slot=0-2 > > rank 1=node01 slot=1-3 > > > > This is on a two-processor quad-core opteron machine (occurs on all nodes > > of the cluster) with Ubuntu 8.10, kernel 2.6.27-16. > > This is the siplest case that fails. Generally, I would like to bind > > processors to physical procs but always allow any core, like > > rank 0=node01 slot=p0:0-3 > > rank 1=node01 slot=p0:0-3 > > rank 2=node01 slot=p0:0-3 > > rank 3=node01 slot=p0:0-3 > > rank 4=node01 slot=p1:0-3 > > rank 5=node01 slot=p1:0-3 > > rank 6=node01 slot=p1:0-3 > > rank 7=node01 slot=p1:0-3 > > which fails too. > > > > This happens with a test code that contains only two lines of code, calling > > mpi_init and mpi_finalize subsequently, and happens in both fortran and in > > c. > > > > One more interesting thing is, that the problem with setting the process > > affinity does not occur on our four-processor quad-core opteron nodes, with > > exactly the same OS etc. > > > > > > Setting "--mca paffinity_base_verbose 5" shows what is going wrong for this > > rankfile: > > rank 0=node01 slot=0-3 > > rank 1=node01 slot=0-3 > > ------------- WRONG ----------------- > > [node01:23174] mca:base:select:(paffinity) Querying component [linux] > > [node01:23174] mca:base:select:(paffinity) Query of component [linux] set > > priority to 10 > > [node01:23174] mca:base:select:(paffinity) Selected component [linux] > > [node01:23174] paffinity slot assignment: slot_list == 0-3 > > [node01:23174] paffinity slot assignment: rank 0 runs on cpu #0 (#0) > > [node01:23174] paffinity slot assignment: rank 0 runs on cpu #1 (#1) > > [node01:23174] paffinity slot assignment: rank 0 runs on cpu #2 (#2) > > [node01:23174] paffinity slot assignment: rank 0 runs on cpu #3 (#3) > > [node01:23174] paffinity slot assignment: slot_list == 0-3 > > [node01:23174] paffinity slot assignment: rank 1 runs on cpu #0 (#0) > > [node01:23174] paffinity slot assignment: rank 1 runs on cpu #1 (#1) > > [node01:23174] paffinity slot assignment: rank 1 runs on cpu #2 (#2) > > [node01:23174] paffinity slot assignment: rank 1 runs on cpu #3 (#3) > > [node01:23175] mca:base:select:(paffinity) Querying component [linux] > > [node01:23175] mca:base:select:(paffinity) Query of component [linux] set > > priority to 10 > > [node01:23175] mca:base:select:(paffinity) Selected component [linux] > > [node01:23176] mca:base:select:(paffinity) Querying component [linux] > > [node01:23176] mca:base:select:(paffinity) Query of component [linux] set > > priority to 10 > > [node01:23176] mca:base:select:(paffinity) Selected component [linux] > > [node01:23175] paffinity slot assignment: slot_list == 0-3 > > [node01:23175] paffinity slot assignment: rank 0 runs on cpu #0 (#0) > > [node01:23175] paffinity slot assignment: rank 0 runs on cpu #1 (#1) > > [node01:23175] paffinity slot assignment: rank 0 runs on cpu #2 (#2) > > [node01:23175] paffinity slot assignment: rank 0 runs on cpu #3 (#3) > > [node01:23176] paffinity slot assignment: slot_list == 0-3 > > [node01:23176] paffinity slot assignment: rank 1 runs on cpu #0 (#0) > > [node01:23176] paffinity slot assignment: rank 1 runs on cpu #1 (#1) > > [node01:23176] paffinity slot assignment: rank 1 runs on cpu #2 (#2) > > [node01:23176] paffinity slot assignment: rank 1 runs on cpu #3 (#3) > > [node01:23175] *** Process received signal *** > > [node01:23176] *** Process received signal *** > > [node01:23175] Signal: Segmentation fault (11) > > [node01:23175] Signal code: Address not mapped (1) > > [node01:23175] Failing at address: 0x30 > > [node01:23176] Signal: Segmentation fault (11) > > [node01:23176] Signal code: Address not mapped (1) > > [node01:23176] Failing at address: 0x30 > > ------------- WRONG ----------------- > > > > ------------- RIGHT ----------------- > > [node25:23241] mca:base:select:(paffinity) Querying component [linux] > > [node25:23241] mca:base:select:(paffinity) Query of component [linux] set > > priority to 10 > > [node25:23241] mca:base:select:(paffinity) Selected component [linux] > > [node25:23241] paffinity slot assignment: slot_list == 0-3 > > [node25:23241] paffinity slot assignment: rank 0 runs on cpu #0 (#0) > > [node25:23241] paffinity slot assignment: rank 0 runs on cpu #1 (#1) > > [node25:23241] paffinity slot assignment: rank 0 runs on cpu #2 (#2) > > [node25:23241] paffinity slot assignment: rank 0 runs on cpu #3 (#3) > > [node25:23241] paffinity slot assignment: slot_list == 0-3 > > [node25:23241] paffinity slot assignment: rank 1 runs on cpu #0 (#0) > > [node25:23241] paffinity slot assignment: rank 1 runs on cpu #1 (#1) > > [node25:23241] paffinity slot assignment: rank 1 runs on cpu #2 (#2) > > [node25:23241] paffinity slot assignment: rank 1 runs on cpu #3 (#3) > > [node25:23242] mca:base:select:(paffinity) Querying component [linux] > > [node25:23242] mca:base:select:(paffinity) Query of component [linux] set > > priority to 10 > > [node25:23242] mca:base:select:(paffinity) Selected component [linux] > > [node25:23243] mca:base:select:(paffinity) Querying component [linux] > > [node25:23243] mca:base:select:(paffinity) Query of component [linux] set > > priority to 10 > > [node25:23243] mca:base:select:(paffinity) Selected component [linux] > > ------------- RIGHT ----------------- > > > > Apparently, only a master process (ID [node01:23174] and [node25:23241]) > > set the paffinity in the RIGHT case, but in the WRONG case, also the > > compute processes ([node01:23175] and [node01:23176], rank0 and rank1) try > > to set the their own paffinity properties. > > > > > > > > Note that for the rankfile also the notation does not work. But that seems > > to have a different origin, as it tries to bind to a core# 4, whereas there > > are just 0-3. > > rank 0=node01 slot=0:* > > rank 1=node01 slot=0:* > > > > > > Thanks for your help on this! > > > > -- > > Daan van Rossum > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Daan van Rossum University of Chicago Department of Astronomy and Astrophysics 5640 S. Ellis Ave Chicago, IL 60637 phone: 773-7020624