Hi, can you provide $cat /proc/cpuinfo I am not optimistic that it will help, but still... thanks Lenny.
On Wed, Dec 16, 2009 at 6:01 PM, Daan van Rossum <d...@flash.uchicago.edu>wrote: > Hi Terry, > > Thanks for your hint. I tried configure --enable-debug and even compiled it > with all kind of manual debug flags turned on, but it doesn't help to get > rid of this problem. So it definitively is not an optimization flaw. > One more interesting test would be to try an older version of the Intel > compiler. But the next older version that I have is 10.0.015, which is too > old for the operating system (must be >10.1). > > > A good thing is that this bug is very easy to test. You only need one line > of MPI code and one process in the execution. > > A few more test cases: > rank 0=node01 slot=1-7 > and > rank 0=node01 slot=0,2-7 > and > rank 0=node01 slot=0-1,3-7 > work WELL. > But > rank 0=node01 slot=0-2,4-7 > FAILS. > > As long as either slot 0, 1, OR 2 is excluded from the list it's allright. > Excluding a different slot, like slot 3, does not help. > > > I'll try to get hold of an Intel v10.1 compiler version. > > Best, > Daan > > * on Monday, 14.12.09 at 14:57, Terry Dontje <terry.don...@sun.com> wrote: > > > I don't really want to throw fud on this list but we've seen all > > sorts of oddities with OMPI 1.3.4 being built with Intel's 11.1 > > compiler versus their 11.0 or other compilers (gcc, Sun Studio, pgi, > > and pathscale). I have not tested your specific failing case but > > considering your issue doesn't show up with gcc I am wondering if > > there is some sort of optimization issue with the 11.1 compiler. > > > > It might be interesting to see if using certain optimization levels > > with the Intel 11.1 compiler produces a working OMPI library. > > > > --td > > > > Daan van Rossum wrote: > > >Hi Ralph, > > > > > >I took the Dec 10th snapshot, but got exactly the same behavior as with > version 1.3.4. > > > > > >I just noticed that even this rankfile doesn't work, with a single > process: > > > rank 0=node01 slot=0-3 > > > > > >------------ > > >[node01:31105] mca:base:select:(paffinity) Querying component [linux] > > >[node01:31105] mca:base:select:(paffinity) Query of component [linux] > set priority to 10 > > >[node01:31105] mca:base:select:(paffinity) Selected component [linux] > > >[node01:31105] paffinity slot assignment: slot_list == 0-3 > > >[node01:31105] paffinity slot assignment: rank 0 runs on cpu #0 (#0) > > >[node01:31105] paffinity slot assignment: rank 0 runs on cpu #1 (#1) > > >[node01:31105] paffinity slot assignment: rank 0 runs on cpu #2 (#2) > > >[node01:31105] paffinity slot assignment: rank 0 runs on cpu #3 (#3) > > >[node01:31106] mca:base:select:(paffinity) Querying component [linux] > > >[node01:31106] mca:base:select:(paffinity) Query of component [linux] > set priority to 10 > > >[node01:31106] mca:base:select:(paffinity) Selected component [linux] > > >[node01:31106] paffinity slot assignment: slot_list == 0-3 > > >[node01:31106] paffinity slot assignment: rank 0 runs on cpu #0 (#0) > > >[node01:31106] paffinity slot assignment: rank 0 runs on cpu #1 (#1) > > >[node01:31106] paffinity slot assignment: rank 0 runs on cpu #2 (#2) > > >[node01:31106] paffinity slot assignment: rank 0 runs on cpu #3 (#3) > > >[node01:31106] *** An error occurred in MPI_Comm_rank > > >[node01:31106] *** on a NULL communicator > > >[node01:31106] *** Unknown error > > >[node01:31106] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) > > >forrtl: severe (174): SIGSEGV, segmentation fault occurred > > >------------ > > > > > >The spawned compute process doesn't sense that it should skip the > setting paffinity... > > > > > > > > >I saw the posting from last July about a similar problem (the problem > that I mentioned on the bottom, with the slot=0:* notation not working). But > that is a different problem (besides, that is still not working as it > seems). > > > > > >Best, > > >Daan > > > > > >* on Saturday, 12.12.09 at 18:48, Ralph Castain <r...@open-mpi.org> > wrote: > > > > > >>This looks like an uninitialized variable that gnu c handles one way > and intel another. Someone recently contributed a patch to the ompi trunk to > fix just such a thing in this code area - don't know if it addresses this > problem or not. > > >> > > >>Can you try the ompi trunk (a nightly tarball from the last day or so > forward) and see if this still occurs? > > >> > > >>Thanks > > >>Ralph > > >> > > >>On Dec 11, 2009, at 4:06 PM, Daan van Rossum wrote: > > >> > > >>>Hi all, > > >>> > > >>>There's a problem with ompi 1.3.4 when compiled with the intel > 11.1.059 c compiler, related with the built in processor binding > functionallity. The problem does not occur when ompi is compiled with the > gnu c compiler. > > >>> > > >>>A mpi program execution fails (segfault) on mpi_init() when the > following rank file is used: > > >>>rank 0=node01 slot=0-3 > > >>>rank 1=node01 slot=0-3 > > >>>but runs fine with: > > >>>rank 0=node01 slot=0 > > >>>rank 1=node01 slot=1-3 > > >>>and fine with: > > >>>rank 0=node01 slot=0-1 > > >>>rank 1=node01 slot=1-3 > > >>>but segfaults with: > > >>>rank 0=node01 slot=0-2 > > >>>rank 1=node01 slot=1-3 > > >>> > > >>>This is on a two-processor quad-core opteron machine (occurs on all > nodes of the cluster) with Ubuntu 8.10, kernel 2.6.27-16. > > >>>This is the siplest case that fails. Generally, I would like to bind > processors to physical procs but always allow any core, like > > >>>rank 0=node01 slot=p0:0-3 > > >>>rank 1=node01 slot=p0:0-3 > > >>>rank 2=node01 slot=p0:0-3 > > >>>rank 3=node01 slot=p0:0-3 > > >>>rank 4=node01 slot=p1:0-3 > > >>>rank 5=node01 slot=p1:0-3 > > >>>rank 6=node01 slot=p1:0-3 > > >>>rank 7=node01 slot=p1:0-3 > > >>>which fails too. > > >>> > > >>>This happens with a test code that contains only two lines of code, > calling mpi_init and mpi_finalize subsequently, and happens in both fortran > and in c. > > >>> > > >>>One more interesting thing is, that the problem with setting the > process affinity does not occur on our four-processor quad-core opteron > nodes, with exactly the same OS etc. > > >>> > > >>> > > >>>Setting "--mca paffinity_base_verbose 5" shows what is going wrong for > this rankfile: > > >>>rank 0=node01 slot=0-3 > > >>>rank 1=node01 slot=0-3 > > >>>------------- WRONG ----------------- > > >>>[node01:23174] mca:base:select:(paffinity) Querying component [linux] > > >>>[node01:23174] mca:base:select:(paffinity) Query of component [linux] > set priority to 10 > > >>>[node01:23174] mca:base:select:(paffinity) Selected component [linux] > > >>>[node01:23174] paffinity slot assignment: slot_list == 0-3 > > >>>[node01:23174] paffinity slot assignment: rank 0 runs on cpu #0 (#0) > > >>>[node01:23174] paffinity slot assignment: rank 0 runs on cpu #1 (#1) > > >>>[node01:23174] paffinity slot assignment: rank 0 runs on cpu #2 (#2) > > >>>[node01:23174] paffinity slot assignment: rank 0 runs on cpu #3 (#3) > > >>>[node01:23174] paffinity slot assignment: slot_list == 0-3 > > >>>[node01:23174] paffinity slot assignment: rank 1 runs on cpu #0 (#0) > > >>>[node01:23174] paffinity slot assignment: rank 1 runs on cpu #1 (#1) > > >>>[node01:23174] paffinity slot assignment: rank 1 runs on cpu #2 (#2) > > >>>[node01:23174] paffinity slot assignment: rank 1 runs on cpu #3 (#3) > > >>>[node01:23175] mca:base:select:(paffinity) Querying component [linux] > > >>>[node01:23175] mca:base:select:(paffinity) Query of component [linux] > set priority to 10 > > >>>[node01:23175] mca:base:select:(paffinity) Selected component [linux] > > >>>[node01:23176] mca:base:select:(paffinity) Querying component [linux] > > >>>[node01:23176] mca:base:select:(paffinity) Query of component [linux] > set priority to 10 > > >>>[node01:23176] mca:base:select:(paffinity) Selected component [linux] > > >>>[node01:23175] paffinity slot assignment: slot_list == 0-3 > > >>>[node01:23175] paffinity slot assignment: rank 0 runs on cpu #0 (#0) > > >>>[node01:23175] paffinity slot assignment: rank 0 runs on cpu #1 (#1) > > >>>[node01:23175] paffinity slot assignment: rank 0 runs on cpu #2 (#2) > > >>>[node01:23175] paffinity slot assignment: rank 0 runs on cpu #3 (#3) > > >>>[node01:23176] paffinity slot assignment: slot_list == 0-3 > > >>>[node01:23176] paffinity slot assignment: rank 1 runs on cpu #0 (#0) > > >>>[node01:23176] paffinity slot assignment: rank 1 runs on cpu #1 (#1) > > >>>[node01:23176] paffinity slot assignment: rank 1 runs on cpu #2 (#2) > > >>>[node01:23176] paffinity slot assignment: rank 1 runs on cpu #3 (#3) > > >>>[node01:23175] *** Process received signal *** > > >>>[node01:23176] *** Process received signal *** > > >>>[node01:23175] Signal: Segmentation fault (11) > > >>>[node01:23175] Signal code: Address not mapped (1) > > >>>[node01:23175] Failing at address: 0x30 > > >>>[node01:23176] Signal: Segmentation fault (11) > > >>>[node01:23176] Signal code: Address not mapped (1) > > >>>[node01:23176] Failing at address: 0x30 > > >>>------------- WRONG ----------------- > > >>> > > >>>------------- RIGHT ----------------- > > >>>[node25:23241] mca:base:select:(paffinity) Querying component [linux] > > >>>[node25:23241] mca:base:select:(paffinity) Query of component [linux] > set priority to 10 > > >>>[node25:23241] mca:base:select:(paffinity) Selected component [linux] > > >>>[node25:23241] paffinity slot assignment: slot_list == 0-3 > > >>>[node25:23241] paffinity slot assignment: rank 0 runs on cpu #0 (#0) > > >>>[node25:23241] paffinity slot assignment: rank 0 runs on cpu #1 (#1) > > >>>[node25:23241] paffinity slot assignment: rank 0 runs on cpu #2 (#2) > > >>>[node25:23241] paffinity slot assignment: rank 0 runs on cpu #3 (#3) > > >>>[node25:23241] paffinity slot assignment: slot_list == 0-3 > > >>>[node25:23241] paffinity slot assignment: rank 1 runs on cpu #0 (#0) > > >>>[node25:23241] paffinity slot assignment: rank 1 runs on cpu #1 (#1) > > >>>[node25:23241] paffinity slot assignment: rank 1 runs on cpu #2 (#2) > > >>>[node25:23241] paffinity slot assignment: rank 1 runs on cpu #3 (#3) > > >>>[node25:23242] mca:base:select:(paffinity) Querying component [linux] > > >>>[node25:23242] mca:base:select:(paffinity) Query of component [linux] > set priority to 10 > > >>>[node25:23242] mca:base:select:(paffinity) Selected component [linux] > > >>>[node25:23243] mca:base:select:(paffinity) Querying component [linux] > > >>>[node25:23243] mca:base:select:(paffinity) Query of component [linux] > set priority to 10 > > >>>[node25:23243] mca:base:select:(paffinity) Selected component [linux] > > >>>------------- RIGHT ----------------- > > >>> > > >>>Apparently, only a master process (ID [node01:23174] and > [node25:23241]) set the paffinity in the RIGHT case, but in the WRONG case, > also the compute processes ([node01:23175] and [node01:23176], rank0 and > rank1) try to set the their own paffinity properties. > > >>> > > >>> > > >>> > > >>>Note that for the rankfile also the notation does not work. But that > seems to have a different origin, as it tries to bind to a core# 4, whereas > there are just 0-3. > > >>>rank 0=node01 slot=0:* > > >>>rank 1=node01 slot=0:* > > >>> > > >>> > > >>>Thanks for your help on this! > > >>> > > >>>-- > > >>>Daan van Rossum > > >>>_______________________________________________ > > >>>devel mailing list > > >>>de...@open-mpi.org > > >>>http://www.open-mpi.org/mailman/listinfo.cgi/devel > > >>_______________________________________________ > > >>devel mailing list > > >>de...@open-mpi.org > > >>http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > > >-- > > >Daan van Rossum > > > > > >University of Chicago > > >Department of Astronomy and Astrophysics > > >5640 S. Ellis Ave > > >Chicago, IL 60637 > > >phone: 773-7020624 > > >_______________________________________________ > > >devel mailing list > > >de...@open-mpi.org > > >http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > -- > Daan van Rossum > > University of Chicago > Department of Astronomy and Astrophysics > 5640 S. Ellis Ave > Chicago, IL 60637 > phone: 773-7020624 > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >