I'll have to look through the logic and see if I can spot something obvious. I don't have access to an Intel compiler, and as you noted it works fine with gcc. Afraid I can't do much more than that, so this may take awhile and not necessarily have a positive result.
On Dec 14, 2009, at 12:32 PM, Daan van Rossum wrote: > Hi Ralph, > > I took the Dec 10th snapshot, but got exactly the same behavior as with > version 1.3.4. > > I just noticed that even this rankfile doesn't work, with a single process: > rank 0=node01 slot=0-3 > > ------------ > [node01:31105] mca:base:select:(paffinity) Querying component [linux] > [node01:31105] mca:base:select:(paffinity) Query of component [linux] set > priority to 10 > [node01:31105] mca:base:select:(paffinity) Selected component [linux] > [node01:31105] paffinity slot assignment: slot_list == 0-3 > [node01:31105] paffinity slot assignment: rank 0 runs on cpu #0 (#0) > [node01:31105] paffinity slot assignment: rank 0 runs on cpu #1 (#1) > [node01:31105] paffinity slot assignment: rank 0 runs on cpu #2 (#2) > [node01:31105] paffinity slot assignment: rank 0 runs on cpu #3 (#3) > [node01:31106] mca:base:select:(paffinity) Querying component [linux] > [node01:31106] mca:base:select:(paffinity) Query of component [linux] set > priority to 10 > [node01:31106] mca:base:select:(paffinity) Selected component [linux] > [node01:31106] paffinity slot assignment: slot_list == 0-3 > [node01:31106] paffinity slot assignment: rank 0 runs on cpu #0 (#0) > [node01:31106] paffinity slot assignment: rank 0 runs on cpu #1 (#1) > [node01:31106] paffinity slot assignment: rank 0 runs on cpu #2 (#2) > [node01:31106] paffinity slot assignment: rank 0 runs on cpu #3 (#3) > [node01:31106] *** An error occurred in MPI_Comm_rank > [node01:31106] *** on a NULL communicator > [node01:31106] *** Unknown error > [node01:31106] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) > forrtl: severe (174): SIGSEGV, segmentation fault occurred > ------------ > > The spawned compute process doesn't sense that it should skip the setting > paffinity... > > > I saw the posting from last July about a similar problem (the problem that I > mentioned on the bottom, with the slot=0:* notation not working). But that is > a different problem (besides, that is still not working as it seems). > > Best, > Daan > > * on Saturday, 12.12.09 at 18:48, Ralph Castain <r...@open-mpi.org> wrote: > >> This looks like an uninitialized variable that gnu c handles one way and >> intel another. Someone recently contributed a patch to the ompi trunk to fix >> just such a thing in this code area - don't know if it addresses this >> problem or not. >> >> Can you try the ompi trunk (a nightly tarball from the last day or so >> forward) and see if this still occurs? >> >> Thanks >> Ralph >> >> On Dec 11, 2009, at 4:06 PM, Daan van Rossum wrote: >> >>> Hi all, >>> >>> There's a problem with ompi 1.3.4 when compiled with the intel 11.1.059 c >>> compiler, related with the built in processor binding functionallity. The >>> problem does not occur when ompi is compiled with the gnu c compiler. >>> >>> A mpi program execution fails (segfault) on mpi_init() when the following >>> rank file is used: >>> rank 0=node01 slot=0-3 >>> rank 1=node01 slot=0-3 >>> but runs fine with: >>> rank 0=node01 slot=0 >>> rank 1=node01 slot=1-3 >>> and fine with: >>> rank 0=node01 slot=0-1 >>> rank 1=node01 slot=1-3 >>> but segfaults with: >>> rank 0=node01 slot=0-2 >>> rank 1=node01 slot=1-3 >>> >>> This is on a two-processor quad-core opteron machine (occurs on all nodes >>> of the cluster) with Ubuntu 8.10, kernel 2.6.27-16. >>> This is the siplest case that fails. Generally, I would like to bind >>> processors to physical procs but always allow any core, like >>> rank 0=node01 slot=p0:0-3 >>> rank 1=node01 slot=p0:0-3 >>> rank 2=node01 slot=p0:0-3 >>> rank 3=node01 slot=p0:0-3 >>> rank 4=node01 slot=p1:0-3 >>> rank 5=node01 slot=p1:0-3 >>> rank 6=node01 slot=p1:0-3 >>> rank 7=node01 slot=p1:0-3 >>> which fails too. >>> >>> This happens with a test code that contains only two lines of code, calling >>> mpi_init and mpi_finalize subsequently, and happens in both fortran and in >>> c. >>> >>> One more interesting thing is, that the problem with setting the process >>> affinity does not occur on our four-processor quad-core opteron nodes, with >>> exactly the same OS etc. >>> >>> >>> Setting "--mca paffinity_base_verbose 5" shows what is going wrong for this >>> rankfile: >>> rank 0=node01 slot=0-3 >>> rank 1=node01 slot=0-3 >>> ------------- WRONG ----------------- >>> [node01:23174] mca:base:select:(paffinity) Querying component [linux] >>> [node01:23174] mca:base:select:(paffinity) Query of component [linux] set >>> priority to 10 >>> [node01:23174] mca:base:select:(paffinity) Selected component [linux] >>> [node01:23174] paffinity slot assignment: slot_list == 0-3 >>> [node01:23174] paffinity slot assignment: rank 0 runs on cpu #0 (#0) >>> [node01:23174] paffinity slot assignment: rank 0 runs on cpu #1 (#1) >>> [node01:23174] paffinity slot assignment: rank 0 runs on cpu #2 (#2) >>> [node01:23174] paffinity slot assignment: rank 0 runs on cpu #3 (#3) >>> [node01:23174] paffinity slot assignment: slot_list == 0-3 >>> [node01:23174] paffinity slot assignment: rank 1 runs on cpu #0 (#0) >>> [node01:23174] paffinity slot assignment: rank 1 runs on cpu #1 (#1) >>> [node01:23174] paffinity slot assignment: rank 1 runs on cpu #2 (#2) >>> [node01:23174] paffinity slot assignment: rank 1 runs on cpu #3 (#3) >>> [node01:23175] mca:base:select:(paffinity) Querying component [linux] >>> [node01:23175] mca:base:select:(paffinity) Query of component [linux] set >>> priority to 10 >>> [node01:23175] mca:base:select:(paffinity) Selected component [linux] >>> [node01:23176] mca:base:select:(paffinity) Querying component [linux] >>> [node01:23176] mca:base:select:(paffinity) Query of component [linux] set >>> priority to 10 >>> [node01:23176] mca:base:select:(paffinity) Selected component [linux] >>> [node01:23175] paffinity slot assignment: slot_list == 0-3 >>> [node01:23175] paffinity slot assignment: rank 0 runs on cpu #0 (#0) >>> [node01:23175] paffinity slot assignment: rank 0 runs on cpu #1 (#1) >>> [node01:23175] paffinity slot assignment: rank 0 runs on cpu #2 (#2) >>> [node01:23175] paffinity slot assignment: rank 0 runs on cpu #3 (#3) >>> [node01:23176] paffinity slot assignment: slot_list == 0-3 >>> [node01:23176] paffinity slot assignment: rank 1 runs on cpu #0 (#0) >>> [node01:23176] paffinity slot assignment: rank 1 runs on cpu #1 (#1) >>> [node01:23176] paffinity slot assignment: rank 1 runs on cpu #2 (#2) >>> [node01:23176] paffinity slot assignment: rank 1 runs on cpu #3 (#3) >>> [node01:23175] *** Process received signal *** >>> [node01:23176] *** Process received signal *** >>> [node01:23175] Signal: Segmentation fault (11) >>> [node01:23175] Signal code: Address not mapped (1) >>> [node01:23175] Failing at address: 0x30 >>> [node01:23176] Signal: Segmentation fault (11) >>> [node01:23176] Signal code: Address not mapped (1) >>> [node01:23176] Failing at address: 0x30 >>> ------------- WRONG ----------------- >>> >>> ------------- RIGHT ----------------- >>> [node25:23241] mca:base:select:(paffinity) Querying component [linux] >>> [node25:23241] mca:base:select:(paffinity) Query of component [linux] set >>> priority to 10 >>> [node25:23241] mca:base:select:(paffinity) Selected component [linux] >>> [node25:23241] paffinity slot assignment: slot_list == 0-3 >>> [node25:23241] paffinity slot assignment: rank 0 runs on cpu #0 (#0) >>> [node25:23241] paffinity slot assignment: rank 0 runs on cpu #1 (#1) >>> [node25:23241] paffinity slot assignment: rank 0 runs on cpu #2 (#2) >>> [node25:23241] paffinity slot assignment: rank 0 runs on cpu #3 (#3) >>> [node25:23241] paffinity slot assignment: slot_list == 0-3 >>> [node25:23241] paffinity slot assignment: rank 1 runs on cpu #0 (#0) >>> [node25:23241] paffinity slot assignment: rank 1 runs on cpu #1 (#1) >>> [node25:23241] paffinity slot assignment: rank 1 runs on cpu #2 (#2) >>> [node25:23241] paffinity slot assignment: rank 1 runs on cpu #3 (#3) >>> [node25:23242] mca:base:select:(paffinity) Querying component [linux] >>> [node25:23242] mca:base:select:(paffinity) Query of component [linux] set >>> priority to 10 >>> [node25:23242] mca:base:select:(paffinity) Selected component [linux] >>> [node25:23243] mca:base:select:(paffinity) Querying component [linux] >>> [node25:23243] mca:base:select:(paffinity) Query of component [linux] set >>> priority to 10 >>> [node25:23243] mca:base:select:(paffinity) Selected component [linux] >>> ------------- RIGHT ----------------- >>> >>> Apparently, only a master process (ID [node01:23174] and [node25:23241]) >>> set the paffinity in the RIGHT case, but in the WRONG case, also the >>> compute processes ([node01:23175] and [node01:23176], rank0 and rank1) try >>> to set the their own paffinity properties. >>> >>> >>> >>> Note that for the rankfile also the notation does not work. But that seems >>> to have a different origin, as it tries to bind to a core# 4, whereas there >>> are just 0-3. >>> rank 0=node01 slot=0:* >>> rank 1=node01 slot=0:* >>> >>> >>> Thanks for your help on this! >>> >>> -- >>> Daan van Rossum >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > -- > Daan van Rossum > > University of Chicago > Department of Astronomy and Astrophysics > 5640 S. Ellis Ave > Chicago, IL 60637 > phone: 773-7020624 > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel