Ah, this is a fairly kernel -- it does not support the topology stuff.
So in this case, logical and physical IDs should be the same. Hmm.
Need to think about that...
On Aug 22, 2008, at 8:47 AM, Camille Coti wrote:
inria@behemoth:~$ uname -a
Linux behemoth 2.6.5-7.283-sn2 #1 SMP Wed Nov 29 16:55:53 UTC 2006
ia64 ia64 ia64 GNU/Linux
I am not sure the output of plpa-info --topo gives good news...
inria@behemoth:~$ plpa-info --topo
Kernel affinity support: yes
Kernel topology support: no
Number of processor sockets: unknown
Kernel topology not supported -- cannot show topology information
Camille
Jeff Squyres a écrit :
Camile --
Can you also send the output of "uname -a"?
Also, just to be absoultely sure, let's check that PLPA is doing
the Right thing here (we don't think this is problem, but it's
worth checking). Grab the latest beta:
http://www.open-mpi.org/software/plpa/v1.2/
It's a very small package and easy to install under your $HOME (or
whatever).
Can you send the output of "plpa-info --topo"?
On Aug 22, 2008, at 7:00 AM, Camille Coti wrote:
Actually, I have tried with several versions, since you were
working on the affinity thing. I have tried with revision 19103 a
couple a weeks ago, the problem was already there.
Part of /proc/cpuinfo is below:
processor : 0
vendor : GenuineIntel
arch : IA-64
family : Itanium 2
model : 0
revision : 7
archrev : 0
features : branchlong
cpu number : 0
cpu regs : 4
cpu MHz : 900.000000
itc MHz : 900.000000
BogoMIPS : 1325.40
siblings : 1
The machine is a 60-way Altix machine, so you have 60 times this
information in /proc/cpuinfo (yes, 60, not 64).
Camille
Ralph Castain a écrit :
I believe I have found the problem, and it may indeed relate to
the change in paffinity. By any chance, do you have unfilled
sockets on that machine? Could you provide the output from
something like "cat /proc/cpuinfo" (or the equiv for your system)
so we could see what physical processors and sockets are present?
If I'm correct as to the problem, here is the issue. OMPI has
(until now) always assumed that the #logical processors (or
sockets, or cores) was the same as the #physical processors (or
sockets, or cores). As a result, several key subsystems were
written without making any distinction as to which (logical vs
physical) they were referring to. This was no problem until we
recently encountered systems with "holes" in their system - a
processor turned "off", or a socket unpopulated, etc.
In this case, the local processor id no longer matches the
physical processor id (ditto for sockets and cores). We adjusted
the paffinity subsystem to deal with it - took much more effort
than we would have liked, and exposed lots of inconsistencies in
how the base operating systems handle such situations.
Unfortunately, having gotten that straightened out, it is
possible that you have uncovered a similar inconsistency in
logical vs physical in another subsystem. I have asked better
eyes than mine to take a look at that now to confirm - if so, it
could take us a little while to fix.
My request for info was aimed at helping us to determine why your
system is seeing this problem, but our tests didn't. We have
tested the revised paffinity on both completely filled and on at
least one system with "holes", but differences in OS levels,
processor types, etc could have caused our tests to pass while
your system fails. I'm particularly suspicious of the old kernel
you are running and how our revised code will handle it.
For now, I would suggest you work with revisions lower than
r19391 - could you please confirm that r19390 or earlier works?
Thanks
Ralph
On Aug 22, 2008, at 7:21 AM, Camille Coti wrote:
OK, thank you!
Camille
Ralph Castain a écrit :
Okay, I'll look into it. I suspect the problem is due to the
redefinition of the paffinity API to clarify physical vs
logical processors - more than likely, the maffinity interface
suffers from the same problem we had to correct over there.
We'll report back later with an estimate of how quickly this
can be fixed.
Thanks
Ralph
On Aug 22, 2008, at 7:03 AM, Camille Coti wrote:
Ralph,
I compiled a clean checkout from the trunk (r19392), the
problem is still the same.
Camille
Ralph Castain a écrit :
Hi Camille
What OMPI version are you using? We just changed the
paffinity module last night, but did nothing to maffinity.
However, it is possible that the maffinity framework makes
some calls into paffinity that need to adjust.
So version number would help a great deal in this case.
Thanks
Ralph
On Aug 22, 2008, at 5:23 AM, Camille Coti wrote:
Hello,
I am trying to run applications on a shared-memory machine.
For the moment I am just trying to run tests on point-to-
point communications (a trivial token ring) and collective
operations (from the SkaMPI tests suite).
It runs smoothly if mpi_paffinity_alone is set to 0. For a
number of processes which is larger than about 10, global
communications just don't seem possible. Point-to-point
communications seem to be OK.
But when I specify --mca mpi_paffinity_alone 1 in my
command line, I get the following error:
mbind: Invalid argument
I looked into the code of maffinity/libnuma, and found out
the error comes from
numa_setlocal_memory(segments[i].mbs_start_addr,
segments[i].mbs_len);
in maffinity_libnuma_module.c.
The machine I am using is a Linux box running a 2.6.5-7
kernel.
Has anyone experienced a similar problem?
Camille
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
Cisco Systems