Hi Ralph,

I took the Dec 10th snapshot, but got exactly the same behavior as with version 
1.3.4.

I just noticed that even this rankfile doesn't work, with a single process:
 rank 0=node01 slot=0-3

------------
[node01:31105] mca:base:select:(paffinity) Querying component [linux]
[node01:31105] mca:base:select:(paffinity) Query of component [linux] set 
priority to 10
[node01:31105] mca:base:select:(paffinity) Selected component [linux]
[node01:31105] paffinity slot assignment: slot_list == 0-3
[node01:31105] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
[node01:31105] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
[node01:31105] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
[node01:31105] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
[node01:31106] mca:base:select:(paffinity) Querying component [linux]
[node01:31106] mca:base:select:(paffinity) Query of component [linux] set 
priority to 10
[node01:31106] mca:base:select:(paffinity) Selected component [linux]
[node01:31106] paffinity slot assignment: slot_list == 0-3
[node01:31106] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
[node01:31106] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
[node01:31106] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
[node01:31106] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
[node01:31106] *** An error occurred in MPI_Comm_rank
[node01:31106] *** on a NULL communicator
[node01:31106] *** Unknown error
[node01:31106] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
forrtl: severe (174): SIGSEGV, segmentation fault occurred
------------

The spawned compute process doesn't sense that it should skip the setting 
paffinity...


I saw the posting from last July about a similar problem (the problem that I 
mentioned on the bottom, with the slot=0:* notation not working). But that is a 
different problem (besides, that is still not working as it seems).

Best,
Daan

* on Saturday, 12.12.09 at 18:48, Ralph Castain <r...@open-mpi.org> wrote:

> This looks like an uninitialized variable that gnu c handles one way and 
> intel another. Someone recently contributed a patch to the ompi trunk to fix 
> just such a  thing in this code area - don't know if it addresses this 
> problem or not.
> 
> Can you try the ompi trunk (a nightly tarball from the last day or so 
> forward) and see if this still occurs?
> 
> Thanks
> Ralph
> 
> On Dec 11, 2009, at 4:06 PM, Daan van Rossum wrote:
> 
> > Hi all,
> > 
> > There's a problem with ompi 1.3.4 when compiled with the intel 11.1.059 c 
> > compiler, related with the built in processor binding functionallity. The 
> > problem does not occur when ompi is compiled with the gnu c compiler.
> > 
> > A mpi program execution fails (segfault) on mpi_init() when the following 
> > rank file is used:
> > rank 0=node01 slot=0-3
> > rank 1=node01 slot=0-3
> > but runs fine with:
> > rank 0=node01 slot=0
> > rank 1=node01 slot=1-3
> > and fine with:
> > rank 0=node01 slot=0-1
> > rank 1=node01 slot=1-3
> > but segfaults with:
> > rank 0=node01 slot=0-2
> > rank 1=node01 slot=1-3
> > 
> > This is on a two-processor quad-core opteron machine (occurs on all nodes 
> > of the cluster) with Ubuntu 8.10, kernel 2.6.27-16.
> > This is the siplest case that fails. Generally, I would like to bind 
> > processors to physical procs but always allow any core, like
> > rank 0=node01 slot=p0:0-3
> > rank 1=node01 slot=p0:0-3
> > rank 2=node01 slot=p0:0-3
> > rank 3=node01 slot=p0:0-3
> > rank 4=node01 slot=p1:0-3
> > rank 5=node01 slot=p1:0-3
> > rank 6=node01 slot=p1:0-3
> > rank 7=node01 slot=p1:0-3
> > which fails too.
> > 
> > This happens with a test code that contains only two lines of code, calling 
> > mpi_init and mpi_finalize subsequently, and happens in both fortran and in 
> > c.
> > 
> > One more interesting thing is, that the problem with setting the process 
> > affinity does not occur on our four-processor quad-core opteron nodes, with 
> > exactly the same OS etc.
> > 
> > 
> > Setting "--mca paffinity_base_verbose 5" shows what is going wrong for this 
> > rankfile:
> > rank 0=node01 slot=0-3
> > rank 1=node01 slot=0-3
> > ------------- WRONG -----------------
> > [node01:23174] mca:base:select:(paffinity) Querying component [linux]
> > [node01:23174] mca:base:select:(paffinity) Query of component [linux] set 
> > priority to 10
> > [node01:23174] mca:base:select:(paffinity) Selected component [linux]
> > [node01:23174] paffinity slot assignment: slot_list == 0-3
> > [node01:23174] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
> > [node01:23174] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
> > [node01:23174] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
> > [node01:23174] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
> > [node01:23174] paffinity slot assignment: slot_list == 0-3
> > [node01:23174] paffinity slot assignment: rank 1 runs on cpu #0 (#0)
> > [node01:23174] paffinity slot assignment: rank 1 runs on cpu #1 (#1)
> > [node01:23174] paffinity slot assignment: rank 1 runs on cpu #2 (#2)
> > [node01:23174] paffinity slot assignment: rank 1 runs on cpu #3 (#3)
> > [node01:23175] mca:base:select:(paffinity) Querying component [linux]
> > [node01:23175] mca:base:select:(paffinity) Query of component [linux] set 
> > priority to 10
> > [node01:23175] mca:base:select:(paffinity) Selected component [linux]
> > [node01:23176] mca:base:select:(paffinity) Querying component [linux]
> > [node01:23176] mca:base:select:(paffinity) Query of component [linux] set 
> > priority to 10
> > [node01:23176] mca:base:select:(paffinity) Selected component [linux]
> > [node01:23175] paffinity slot assignment: slot_list == 0-3
> > [node01:23175] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
> > [node01:23175] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
> > [node01:23175] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
> > [node01:23175] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
> > [node01:23176] paffinity slot assignment: slot_list == 0-3
> > [node01:23176] paffinity slot assignment: rank 1 runs on cpu #0 (#0)
> > [node01:23176] paffinity slot assignment: rank 1 runs on cpu #1 (#1)
> > [node01:23176] paffinity slot assignment: rank 1 runs on cpu #2 (#2)
> > [node01:23176] paffinity slot assignment: rank 1 runs on cpu #3 (#3)
> > [node01:23175] *** Process received signal ***
> > [node01:23176] *** Process received signal ***
> > [node01:23175] Signal: Segmentation fault (11)
> > [node01:23175] Signal code: Address not mapped (1)
> > [node01:23175] Failing at address: 0x30
> > [node01:23176] Signal: Segmentation fault (11)
> > [node01:23176] Signal code: Address not mapped (1)
> > [node01:23176] Failing at address: 0x30
> > ------------- WRONG -----------------
> > 
> > ------------- RIGHT -----------------
> > [node25:23241] mca:base:select:(paffinity) Querying component [linux]
> > [node25:23241] mca:base:select:(paffinity) Query of component [linux] set 
> > priority to 10
> > [node25:23241] mca:base:select:(paffinity) Selected component [linux]
> > [node25:23241] paffinity slot assignment: slot_list == 0-3
> > [node25:23241] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
> > [node25:23241] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
> > [node25:23241] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
> > [node25:23241] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
> > [node25:23241] paffinity slot assignment: slot_list == 0-3
> > [node25:23241] paffinity slot assignment: rank 1 runs on cpu #0 (#0)
> > [node25:23241] paffinity slot assignment: rank 1 runs on cpu #1 (#1)
> > [node25:23241] paffinity slot assignment: rank 1 runs on cpu #2 (#2)
> > [node25:23241] paffinity slot assignment: rank 1 runs on cpu #3 (#3)
> > [node25:23242] mca:base:select:(paffinity) Querying component [linux]
> > [node25:23242] mca:base:select:(paffinity) Query of component [linux] set 
> > priority to 10
> > [node25:23242] mca:base:select:(paffinity) Selected component [linux]
> > [node25:23243] mca:base:select:(paffinity) Querying component [linux]
> > [node25:23243] mca:base:select:(paffinity) Query of component [linux] set 
> > priority to 10
> > [node25:23243] mca:base:select:(paffinity) Selected component [linux]
> > ------------- RIGHT -----------------
> > 
> > Apparently, only a master process (ID [node01:23174] and [node25:23241]) 
> > set the paffinity in the RIGHT case, but in the WRONG case, also the 
> > compute processes ([node01:23175] and [node01:23176], rank0 and rank1) try 
> > to set the their own paffinity properties.
> > 
> > 
> > 
> > Note that for the rankfile also the notation does not work. But that seems 
> > to have a different origin, as it tries to bind to a core# 4, whereas there 
> > are just 0-3.
> > rank 0=node01 slot=0:*
> > rank 1=node01 slot=0:*
> > 
> > 
> > Thanks for your help on this!
> > 
> > --
> > Daan van Rossum
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
Daan van Rossum

University of Chicago
Department of Astronomy and Astrophysics
5640 S. Ellis Ave
Chicago, IL 60637
phone: 773-7020624

Reply via email to