I'll have to look through the logic and see if I can spot something obvious. I 
don't have access to an Intel compiler, and as you noted it works fine with 
gcc. Afraid I can't do much more than that, so this may take awhile and not 
necessarily have a positive result.


On Dec 14, 2009, at 12:32 PM, Daan van Rossum wrote:

> Hi Ralph,
> 
> I took the Dec 10th snapshot, but got exactly the same behavior as with 
> version 1.3.4.
> 
> I just noticed that even this rankfile doesn't work, with a single process:
> rank 0=node01 slot=0-3
> 
> ------------
> [node01:31105] mca:base:select:(paffinity) Querying component [linux]
> [node01:31105] mca:base:select:(paffinity) Query of component [linux] set 
> priority to 10
> [node01:31105] mca:base:select:(paffinity) Selected component [linux]
> [node01:31105] paffinity slot assignment: slot_list == 0-3
> [node01:31105] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
> [node01:31105] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
> [node01:31105] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
> [node01:31105] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
> [node01:31106] mca:base:select:(paffinity) Querying component [linux]
> [node01:31106] mca:base:select:(paffinity) Query of component [linux] set 
> priority to 10
> [node01:31106] mca:base:select:(paffinity) Selected component [linux]
> [node01:31106] paffinity slot assignment: slot_list == 0-3
> [node01:31106] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
> [node01:31106] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
> [node01:31106] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
> [node01:31106] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
> [node01:31106] *** An error occurred in MPI_Comm_rank
> [node01:31106] *** on a NULL communicator
> [node01:31106] *** Unknown error
> [node01:31106] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> forrtl: severe (174): SIGSEGV, segmentation fault occurred
> ------------
> 
> The spawned compute process doesn't sense that it should skip the setting 
> paffinity...
> 
> 
> I saw the posting from last July about a similar problem (the problem that I 
> mentioned on the bottom, with the slot=0:* notation not working). But that is 
> a different problem (besides, that is still not working as it seems).
> 
> Best,
> Daan
> 
> * on Saturday, 12.12.09 at 18:48, Ralph Castain <r...@open-mpi.org> wrote:
> 
>> This looks like an uninitialized variable that gnu c handles one way and 
>> intel another. Someone recently contributed a patch to the ompi trunk to fix 
>> just such a  thing in this code area - don't know if it addresses this 
>> problem or not.
>> 
>> Can you try the ompi trunk (a nightly tarball from the last day or so 
>> forward) and see if this still occurs?
>> 
>> Thanks
>> Ralph
>> 
>> On Dec 11, 2009, at 4:06 PM, Daan van Rossum wrote:
>> 
>>> Hi all,
>>> 
>>> There's a problem with ompi 1.3.4 when compiled with the intel 11.1.059 c 
>>> compiler, related with the built in processor binding functionallity. The 
>>> problem does not occur when ompi is compiled with the gnu c compiler.
>>> 
>>> A mpi program execution fails (segfault) on mpi_init() when the following 
>>> rank file is used:
>>> rank 0=node01 slot=0-3
>>> rank 1=node01 slot=0-3
>>> but runs fine with:
>>> rank 0=node01 slot=0
>>> rank 1=node01 slot=1-3
>>> and fine with:
>>> rank 0=node01 slot=0-1
>>> rank 1=node01 slot=1-3
>>> but segfaults with:
>>> rank 0=node01 slot=0-2
>>> rank 1=node01 slot=1-3
>>> 
>>> This is on a two-processor quad-core opteron machine (occurs on all nodes 
>>> of the cluster) with Ubuntu 8.10, kernel 2.6.27-16.
>>> This is the siplest case that fails. Generally, I would like to bind 
>>> processors to physical procs but always allow any core, like
>>> rank 0=node01 slot=p0:0-3
>>> rank 1=node01 slot=p0:0-3
>>> rank 2=node01 slot=p0:0-3
>>> rank 3=node01 slot=p0:0-3
>>> rank 4=node01 slot=p1:0-3
>>> rank 5=node01 slot=p1:0-3
>>> rank 6=node01 slot=p1:0-3
>>> rank 7=node01 slot=p1:0-3
>>> which fails too.
>>> 
>>> This happens with a test code that contains only two lines of code, calling 
>>> mpi_init and mpi_finalize subsequently, and happens in both fortran and in 
>>> c.
>>> 
>>> One more interesting thing is, that the problem with setting the process 
>>> affinity does not occur on our four-processor quad-core opteron nodes, with 
>>> exactly the same OS etc.
>>> 
>>> 
>>> Setting "--mca paffinity_base_verbose 5" shows what is going wrong for this 
>>> rankfile:
>>> rank 0=node01 slot=0-3
>>> rank 1=node01 slot=0-3
>>> ------------- WRONG -----------------
>>> [node01:23174] mca:base:select:(paffinity) Querying component [linux]
>>> [node01:23174] mca:base:select:(paffinity) Query of component [linux] set 
>>> priority to 10
>>> [node01:23174] mca:base:select:(paffinity) Selected component [linux]
>>> [node01:23174] paffinity slot assignment: slot_list == 0-3
>>> [node01:23174] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
>>> [node01:23174] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
>>> [node01:23174] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
>>> [node01:23174] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
>>> [node01:23174] paffinity slot assignment: slot_list == 0-3
>>> [node01:23174] paffinity slot assignment: rank 1 runs on cpu #0 (#0)
>>> [node01:23174] paffinity slot assignment: rank 1 runs on cpu #1 (#1)
>>> [node01:23174] paffinity slot assignment: rank 1 runs on cpu #2 (#2)
>>> [node01:23174] paffinity slot assignment: rank 1 runs on cpu #3 (#3)
>>> [node01:23175] mca:base:select:(paffinity) Querying component [linux]
>>> [node01:23175] mca:base:select:(paffinity) Query of component [linux] set 
>>> priority to 10
>>> [node01:23175] mca:base:select:(paffinity) Selected component [linux]
>>> [node01:23176] mca:base:select:(paffinity) Querying component [linux]
>>> [node01:23176] mca:base:select:(paffinity) Query of component [linux] set 
>>> priority to 10
>>> [node01:23176] mca:base:select:(paffinity) Selected component [linux]
>>> [node01:23175] paffinity slot assignment: slot_list == 0-3
>>> [node01:23175] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
>>> [node01:23175] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
>>> [node01:23175] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
>>> [node01:23175] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
>>> [node01:23176] paffinity slot assignment: slot_list == 0-3
>>> [node01:23176] paffinity slot assignment: rank 1 runs on cpu #0 (#0)
>>> [node01:23176] paffinity slot assignment: rank 1 runs on cpu #1 (#1)
>>> [node01:23176] paffinity slot assignment: rank 1 runs on cpu #2 (#2)
>>> [node01:23176] paffinity slot assignment: rank 1 runs on cpu #3 (#3)
>>> [node01:23175] *** Process received signal ***
>>> [node01:23176] *** Process received signal ***
>>> [node01:23175] Signal: Segmentation fault (11)
>>> [node01:23175] Signal code: Address not mapped (1)
>>> [node01:23175] Failing at address: 0x30
>>> [node01:23176] Signal: Segmentation fault (11)
>>> [node01:23176] Signal code: Address not mapped (1)
>>> [node01:23176] Failing at address: 0x30
>>> ------------- WRONG -----------------
>>> 
>>> ------------- RIGHT -----------------
>>> [node25:23241] mca:base:select:(paffinity) Querying component [linux]
>>> [node25:23241] mca:base:select:(paffinity) Query of component [linux] set 
>>> priority to 10
>>> [node25:23241] mca:base:select:(paffinity) Selected component [linux]
>>> [node25:23241] paffinity slot assignment: slot_list == 0-3
>>> [node25:23241] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
>>> [node25:23241] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
>>> [node25:23241] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
>>> [node25:23241] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
>>> [node25:23241] paffinity slot assignment: slot_list == 0-3
>>> [node25:23241] paffinity slot assignment: rank 1 runs on cpu #0 (#0)
>>> [node25:23241] paffinity slot assignment: rank 1 runs on cpu #1 (#1)
>>> [node25:23241] paffinity slot assignment: rank 1 runs on cpu #2 (#2)
>>> [node25:23241] paffinity slot assignment: rank 1 runs on cpu #3 (#3)
>>> [node25:23242] mca:base:select:(paffinity) Querying component [linux]
>>> [node25:23242] mca:base:select:(paffinity) Query of component [linux] set 
>>> priority to 10
>>> [node25:23242] mca:base:select:(paffinity) Selected component [linux]
>>> [node25:23243] mca:base:select:(paffinity) Querying component [linux]
>>> [node25:23243] mca:base:select:(paffinity) Query of component [linux] set 
>>> priority to 10
>>> [node25:23243] mca:base:select:(paffinity) Selected component [linux]
>>> ------------- RIGHT -----------------
>>> 
>>> Apparently, only a master process (ID [node01:23174] and [node25:23241]) 
>>> set the paffinity in the RIGHT case, but in the WRONG case, also the 
>>> compute processes ([node01:23175] and [node01:23176], rank0 and rank1) try 
>>> to set the their own paffinity properties.
>>> 
>>> 
>>> 
>>> Note that for the rankfile also the notation does not work. But that seems 
>>> to have a different origin, as it tries to bind to a core# 4, whereas there 
>>> are just 0-3.
>>> rank 0=node01 slot=0:*
>>> rank 1=node01 slot=0:*
>>> 
>>> 
>>> Thanks for your help on this!
>>> 
>>> --
>>> Daan van Rossum
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> --
> Daan van Rossum
> 
> University of Chicago
> Department of Astronomy and Astrophysics
> 5640 S. Ellis Ave
> Chicago, IL 60637
> phone: 773-7020624
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to