Re: [OMPI users] problem when mpi_paffinity_alone is set to 1

2008-08-22 Thread Jeff Squyres

Ah, this is a fairly kernel -- it does not support the topology stuff.

So in this case, logical and physical IDs should be the same.  Hmm.   
Need to think about that...



On Aug 22, 2008, at 8:47 AM, Camille Coti wrote:



inria@behemoth:~$ uname -a
Linux behemoth 2.6.5-7.283-sn2 #1 SMP Wed Nov 29 16:55:53 UTC 2006  
ia64 ia64 ia64 GNU/Linux


I am not sure the output of plpa-info --topo gives good news...

inria@behemoth:~$ plpa-info --topo
Kernel affinity support: yes
Kernel topology support: no
Number of processor sockets: unknown
Kernel topology not supported -- cannot show topology information

Camille

Jeff Squyres a écrit :

Camile --
Can you also send the output of "uname -a"?
Also, just to be absoultely sure, let's check that PLPA is doing  
the Right thing here (we don't think this is problem, but it's  
worth checking).  Grab the latest beta:

   http://www.open-mpi.org/software/plpa/v1.2/
It's a very small package and easy to install under your $HOME (or  
whatever).

Can you send the output of "plpa-info --topo"?
On Aug 22, 2008, at 7:00 AM, Camille Coti wrote:


Actually, I have tried with several versions, since you were  
working on the affinity thing. I have tried with revision 19103 a  
couple a weeks ago, the problem was already there.


Part of /proc/cpuinfo is below:
processor  : 0
vendor : GenuineIntel
arch   : IA-64
family : Itanium 2
model  : 0
revision   : 7
archrev: 0
features   : branchlong
cpu number : 0
cpu regs   : 4
cpu MHz: 900.00
itc MHz: 900.00
BogoMIPS   : 1325.40
siblings   : 1

The machine is a 60-way Altix machine, so you have 60 times this  
information in /proc/cpuinfo (yes, 60, not 64).


Camille



Ralph Castain a écrit :
I believe I have found the problem, and it may indeed relate to  
the change in paffinity. By any chance, do you have unfilled  
sockets on that machine? Could you provide the output from  
something like "cat /proc/cpuinfo" (or the equiv for your system)  
so we could see what physical processors and sockets are present?
If I'm correct as to the problem, here is the issue. OMPI has  
(until now) always assumed that the #logical processors (or  
sockets, or cores) was the same as the #physical processors (or  
sockets, or cores). As a result, several key subsystems were  
written without making any distinction as to which (logical vs  
physical) they were referring to. This was no problem until we  
recently encountered systems with "holes" in their system - a  
processor turned "off", or a socket unpopulated, etc.
In this case, the local processor id no longer matches the  
physical processor id (ditto for sockets and cores). We adjusted  
the paffinity subsystem to deal with it - took much more effort  
than we would have liked, and exposed lots of inconsistencies in  
how the base operating systems handle such situations.
Unfortunately, having gotten that straightened out, it is  
possible that you have uncovered a similar inconsistency in  
logical vs physical in another subsystem. I have asked better  
eyes than mine to take a look at that now to confirm - if so, it  
could take us a little while to fix.
My request for info was aimed at helping us to determine why your  
system is seeing this problem, but our tests didn't. We have  
tested the revised paffinity on both completely filled and on at  
least one system with "holes", but differences in OS levels,  
processor types, etc could have caused our tests to pass while  
your system fails. I'm particularly suspicious of the old kernel  
you are running and how our revised code will handle it.
For now, I would suggest you work with revisions lower than  
r19391 - could you please confirm that r19390 or earlier works?

Thanks
Ralph
On Aug 22, 2008, at 7:21 AM, Camille Coti wrote:


OK, thank you!

Camille

Ralph Castain a écrit :
Okay, I'll look into it. I suspect the problem is due to the  
redefinition of the paffinity API to clarify physical vs  
logical processors - more than likely, the maffinity interface  
suffers from the same problem we had to correct over there.
We'll report back later with an estimate of how quickly this  
can be fixed.

Thanks
Ralph
On Aug 22, 2008, at 7:03 AM, Camille Coti wrote:


Ralph,

I compiled a clean checkout from the trunk (r19392), the  
problem is still the same.


Camille


Ralph Castain a écrit :

Hi Camille
What OMPI version are you using? We just changed the  
paffinity module last night, but did nothing to maffinity.  
However, it is possible that the maffinity framework makes  
some calls into paffinity that need to adjust.

So version number would help a great deal in this case.
Thanks
Ralph
On Aug 22, 2008, at 5:23 AM, Camille Coti wrote:

Hello,

I am trying to run applications on a shared-memory machine.  
For the moment I am just trying to run tests on point-to- 
point communications (a  trivial token ring) and collective  
operations (from the SkaMPI tests suite).


It runs smoothly if 

Re: [OMPI users] problem when mpi_paffinity_alone is set to 1

2008-08-22 Thread Camille Coti


inria@behemoth:~$ uname -a
Linux behemoth 2.6.5-7.283-sn2 #1 SMP Wed Nov 29 16:55:53 UTC 2006 ia64 
ia64 ia64 GNU/Linux


I am not sure the output of plpa-info --topo gives good news...

inria@behemoth:~$ plpa-info --topo
Kernel affinity support: yes
Kernel topology support: no
Number of processor sockets: unknown
Kernel topology not supported -- cannot show topology information

Camille

Jeff Squyres a écrit :

Camile --

Can you also send the output of "uname -a"?

Also, just to be absoultely sure, let's check that PLPA is doing the 
Right thing here (we don't think this is problem, but it's worth 
checking).  Grab the latest beta:


http://www.open-mpi.org/software/plpa/v1.2/

It's a very small package and easy to install under your $HOME (or 
whatever).


Can you send the output of "plpa-info --topo"?



On Aug 22, 2008, at 7:00 AM, Camille Coti wrote:



Actually, I have tried with several versions, since you were working 
on the affinity thing. I have tried with revision 19103 a couple a 
weeks ago, the problem was already there.


Part of /proc/cpuinfo is below:
processor  : 0
vendor : GenuineIntel
arch   : IA-64
family : Itanium 2
model  : 0
revision   : 7
archrev: 0
features   : branchlong
cpu number : 0
cpu regs   : 4
cpu MHz: 900.00
itc MHz: 900.00
BogoMIPS   : 1325.40
siblings   : 1

The machine is a 60-way Altix machine, so you have 60 times this 
information in /proc/cpuinfo (yes, 60, not 64).


Camille



Ralph Castain a écrit :
I believe I have found the problem, and it may indeed relate to the 
change in paffinity. By any chance, do you have unfilled sockets on 
that machine? Could you provide the output from something like "cat 
/proc/cpuinfo" (or the equiv for your system) so we could see what 
physical processors and sockets are present?
If I'm correct as to the problem, here is the issue. OMPI has (until 
now) always assumed that the #logical processors (or sockets, or 
cores) was the same as the #physical processors (or sockets, or 
cores). As a result, several key subsystems were written without 
making any distinction as to which (logical vs physical) they were 
referring to. This was no problem until we recently encountered 
systems with "holes" in their system - a processor turned "off", or a 
socket unpopulated, etc.
In this case, the local processor id no longer matches the physical 
processor id (ditto for sockets and cores). We adjusted the paffinity 
subsystem to deal with it - took much more effort than we would have 
liked, and exposed lots of inconsistencies in how the base operating 
systems handle such situations.
Unfortunately, having gotten that straightened out, it is possible 
that you have uncovered a similar inconsistency in logical vs 
physical in another subsystem. I have asked better eyes than mine to 
take a look at that now to confirm - if so, it could take us a little 
while to fix.
My request for info was aimed at helping us to determine why your 
system is seeing this problem, but our tests didn't. We have tested 
the revised paffinity on both completely filled and on at least one 
system with "holes", but differences in OS levels, processor types, 
etc could have caused our tests to pass while your system fails. I'm 
particularly suspicious of the old kernel you are running and how our 
revised code will handle it.
For now, I would suggest you work with revisions lower than r19391 - 
could you please confirm that r19390 or earlier works?

Thanks
Ralph
On Aug 22, 2008, at 7:21 AM, Camille Coti wrote:


OK, thank you!

Camille

Ralph Castain a écrit :
Okay, I'll look into it. I suspect the problem is due to the 
redefinition of the paffinity API to clarify physical vs logical 
processors - more than likely, the maffinity interface suffers from 
the same problem we had to correct over there.
We'll report back later with an estimate of how quickly this can be 
fixed.

Thanks
Ralph
On Aug 22, 2008, at 7:03 AM, Camille Coti wrote:


Ralph,

I compiled a clean checkout from the trunk (r19392), the problem 
is still the same.


Camille


Ralph Castain a écrit :

Hi Camille
What OMPI version are you using? We just changed the paffinity 
module last night, but did nothing to maffinity. However, it is 
possible that the maffinity framework makes some calls into 
paffinity that need to adjust.

So version number would help a great deal in this case.
Thanks
Ralph
On Aug 22, 2008, at 5:23 AM, Camille Coti wrote:

Hello,

I am trying to run applications on a shared-memory machine. For 
the moment I am just trying to run tests on point-to-point 
communications (a  trivial token ring) and collective operations 
(from the SkaMPI tests suite).


It runs smoothly if mpi_paffinity_alone is set to 0. For a 
number of processes which is larger than about 10, global 
communications just don't seem possible. Point-to-point 
communications seem to be OK.


But when I specify  --mca mpi_paffinity_alone 1 in my command 
line, I get the 

Re: [OMPI users] problem when mpi_paffinity_alone is set to 1

2008-08-22 Thread Jeff Squyres

Camile --

Can you also send the output of "uname -a"?

Also, just to be absoultely sure, let's check that PLPA is doing the  
Right thing here (we don't think this is problem, but it's worth  
checking).  Grab the latest beta:


http://www.open-mpi.org/software/plpa/v1.2/

It's a very small package and easy to install under your $HOME (or  
whatever).


Can you send the output of "plpa-info --topo"?



On Aug 22, 2008, at 7:00 AM, Camille Coti wrote:



Actually, I have tried with several versions, since you were working  
on the affinity thing. I have tried with revision 19103 a couple a  
weeks ago, the problem was already there.


Part of /proc/cpuinfo is below:
processor  : 0
vendor : GenuineIntel
arch   : IA-64
family : Itanium 2
model  : 0
revision   : 7
archrev: 0
features   : branchlong
cpu number : 0
cpu regs   : 4
cpu MHz: 900.00
itc MHz: 900.00
BogoMIPS   : 1325.40
siblings   : 1

The machine is a 60-way Altix machine, so you have 60 times this  
information in /proc/cpuinfo (yes, 60, not 64).


Camille



Ralph Castain a écrit :
I believe I have found the problem, and it may indeed relate to the  
change in paffinity. By any chance, do you have unfilled sockets on  
that machine? Could you provide the output from something like  
"cat /proc/cpuinfo" (or the equiv for your system) so we could see  
what physical processors and sockets are present?
If I'm correct as to the problem, here is the issue. OMPI has  
(until now) always assumed that the #logical processors (or  
sockets, or cores) was the same as the #physical processors (or  
sockets, or cores). As a result, several key subsystems were  
written without making any distinction as to which (logical vs  
physical) they were referring to. This was no problem until we  
recently encountered systems with "holes" in their system - a  
processor turned "off", or a socket unpopulated, etc.
In this case, the local processor id no longer matches the physical  
processor id (ditto for sockets and cores). We adjusted the  
paffinity subsystem to deal with it - took much more effort than we  
would have liked, and exposed lots of inconsistencies in how the  
base operating systems handle such situations.
Unfortunately, having gotten that straightened out, it is possible  
that you have uncovered a similar inconsistency in logical vs  
physical in another subsystem. I have asked better eyes than mine  
to take a look at that now to confirm - if so, it could take us a  
little while to fix.
My request for info was aimed at helping us to determine why your  
system is seeing this problem, but our tests didn't. We have tested  
the revised paffinity on both completely filled and on at least one  
system with "holes", but differences in OS levels, processor types,  
etc could have caused our tests to pass while your system fails.  
I'm particularly suspicious of the old kernel you are running and  
how our revised code will handle it.
For now, I would suggest you work with revisions lower than r19391  
- could you please confirm that r19390 or earlier works?

Thanks
Ralph
On Aug 22, 2008, at 7:21 AM, Camille Coti wrote:


OK, thank you!

Camille

Ralph Castain a écrit :
Okay, I'll look into it. I suspect the problem is due to the  
redefinition of the paffinity API to clarify physical vs logical  
processors - more than likely, the maffinity interface suffers  
from the same problem we had to correct over there.
We'll report back later with an estimate of how quickly this can  
be fixed.

Thanks
Ralph
On Aug 22, 2008, at 7:03 AM, Camille Coti wrote:


Ralph,

I compiled a clean checkout from the trunk (r19392), the problem  
is still the same.


Camille


Ralph Castain a écrit :

Hi Camille
What OMPI version are you using? We just changed the paffinity  
module last night, but did nothing to maffinity. However, it is  
possible that the maffinity framework makes some calls into  
paffinity that need to adjust.

So version number would help a great deal in this case.
Thanks
Ralph
On Aug 22, 2008, at 5:23 AM, Camille Coti wrote:

Hello,

I am trying to run applications on a shared-memory machine.  
For the moment I am just trying to run tests on point-to-point  
communications (a  trivial token ring) and collective  
operations (from the SkaMPI tests suite).


It runs smoothly if mpi_paffinity_alone is set to 0. For a  
number of processes which is larger than about 10, global  
communications just don't seem possible. Point-to-point  
communications seem to be OK.


But when I specify  --mca mpi_paffinity_alone 1 in my command  
line, I get the following error:


mbind: Invalid argument

I looked into the code of maffinity/libnuma, and found out the  
error comes from


numa_setlocal_memory(segments[i].mbs_start_addr,
 segments[i].mbs_len);

in maffinity_libnuma_module.c.

The machine I am using is a Linux box running a 2.6.5-7 kernel.

Has anyone experienced a similar problem?


Re: [OMPI users] problem when mpi_paffinity_alone is set to 1

2008-08-22 Thread Ralph Castain
Thanks! Well, it -is- nice to know that we didn't -create- the problem  
with the paffinity change!


We'll have to think about this one a little to try and figure out why  
this is happening.

Ralph

On Aug 22, 2008, at 8:00 AM, Camille Coti wrote:



Actually, I have tried with several versions, since you were working  
on the affinity thing. I have tried with revision 19103 a couple a  
weeks ago, the problem was already there.


Part of /proc/cpuinfo is below:
processor  : 0
vendor : GenuineIntel
arch   : IA-64
family : Itanium 2
model  : 0
revision   : 7
archrev: 0
features   : branchlong
cpu number : 0
cpu regs   : 4
cpu MHz: 900.00
itc MHz: 900.00
BogoMIPS   : 1325.40
siblings   : 1

The machine is a 60-way Altix machine, so you have 60 times this  
information in /proc/cpuinfo (yes, 60, not 64).


Camille



Ralph Castain a écrit :
I believe I have found the problem, and it may indeed relate to the  
change in paffinity. By any chance, do you have unfilled sockets on  
that machine? Could you provide the output from something like  
"cat /proc/cpuinfo" (or the equiv for your system) so we could see  
what physical processors and sockets are present?
If I'm correct as to the problem, here is the issue. OMPI has  
(until now) always assumed that the #logical processors (or  
sockets, or cores) was the same as the #physical processors (or  
sockets, or cores). As a result, several key subsystems were  
written without making any distinction as to which (logical vs  
physical) they were referring to. This was no problem until we  
recently encountered systems with "holes" in their system - a  
processor turned "off", or a socket unpopulated, etc.
In this case, the local processor id no longer matches the physical  
processor id (ditto for sockets and cores). We adjusted the  
paffinity subsystem to deal with it - took much more effort than we  
would have liked, and exposed lots of inconsistencies in how the  
base operating systems handle such situations.
Unfortunately, having gotten that straightened out, it is possible  
that you have uncovered a similar inconsistency in logical vs  
physical in another subsystem. I have asked better eyes than mine  
to take a look at that now to confirm - if so, it could take us a  
little while to fix.
My request for info was aimed at helping us to determine why your  
system is seeing this problem, but our tests didn't. We have tested  
the revised paffinity on both completely filled and on at least one  
system with "holes", but differences in OS levels, processor types,  
etc could have caused our tests to pass while your system fails.  
I'm particularly suspicious of the old kernel you are running and  
how our revised code will handle it.
For now, I would suggest you work with revisions lower than r19391  
- could you please confirm that r19390 or earlier works?

Thanks
Ralph
On Aug 22, 2008, at 7:21 AM, Camille Coti wrote:


OK, thank you!

Camille

Ralph Castain a écrit :
Okay, I'll look into it. I suspect the problem is due to the  
redefinition of the paffinity API to clarify physical vs logical  
processors - more than likely, the maffinity interface suffers  
from the same problem we had to correct over there.
We'll report back later with an estimate of how quickly this can  
be fixed.

Thanks
Ralph
On Aug 22, 2008, at 7:03 AM, Camille Coti wrote:


Ralph,

I compiled a clean checkout from the trunk (r19392), the problem  
is still the same.


Camille


Ralph Castain a écrit :

Hi Camille
What OMPI version are you using? We just changed the paffinity  
module last night, but did nothing to maffinity. However, it is  
possible that the maffinity framework makes some calls into  
paffinity that need to adjust.

So version number would help a great deal in this case.
Thanks
Ralph
On Aug 22, 2008, at 5:23 AM, Camille Coti wrote:

Hello,

I am trying to run applications on a shared-memory machine.  
For the moment I am just trying to run tests on point-to-point  
communications (a  trivial token ring) and collective  
operations (from the SkaMPI tests suite).


It runs smoothly if mpi_paffinity_alone is set to 0. For a  
number of processes which is larger than about 10, global  
communications just don't seem possible. Point-to-point  
communications seem to be OK.


But when I specify  --mca mpi_paffinity_alone 1 in my command  
line, I get the following error:


mbind: Invalid argument

I looked into the code of maffinity/libnuma, and found out the  
error comes from


numa_setlocal_memory(segments[i].mbs_start_addr,
 segments[i].mbs_len);

in maffinity_libnuma_module.c.

The machine I am using is a Linux box running a 2.6.5-7 kernel.

Has anyone experienced a similar problem?

Camille
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list

Re: [OMPI users] problem when mpi_paffinity_alone is set to 1

2008-08-22 Thread Ralph Castain

Short answer is: yes.

Unfortunately, different systems store that info in different places.  
For Linux, we use the PLPA to help us discover the required info.  
Solaris, OSX, and Windows all have their own ways of providing it. The  
paffinity framework detects the type of system we are running on and  
"does the right thing" to get the info.


Where we simply cannot get it, we return an error and let you know  
that we cannot support processor affinity on this machine. You can  
still execute, of course - you just can't set mpi_paffinity_alone  
since we can't meet that request on such a system.


Ralph


On Aug 22, 2008, at 8:01 AM, Mi Yan wrote:


Ralph,

How does OpenMPI pick up the map between physical vs. logical  
processors? Does OMPI look into "/sys/devices/system/node/node  
for the cpu topology?



Thanks,
Mi Yan
Ralph Castain <r...@lanl.gov>


Ralph Castain <r...@lanl.gov>
Sent by: users-boun...@open-mpi.org
08/22/2008 09:16 AM
Please respond to
Open MPI Users <us...@open-mpi.org>

To

Open MPI Users <us...@open-mpi.org>

cc


Subject

Re: [OMPI users] problem when mpi_paffinity_alone is set to 1
 

Okay, I'll look into it. I suspect the problem is due to the
redefinition of the paffinity API to clarify physical vs logical
processors - more than likely, the maffinity interface suffers from
the same problem we had to correct over there.

We'll report back later with an estimate of how quickly this can be
fixed.

Thanks
Ralph

On Aug 22, 2008, at 7:03 AM, Camille Coti wrote:

>
> Ralph,
>
> I compiled a clean checkout from the trunk (r19392), the problem is
> still the same.
>
> Camille
>
>
> Ralph Castain a écrit :
>> Hi Camille
>> What OMPI version are you using? We just changed the paffinity
>> module last night, but did nothing to maffinity. However, it is
>> possible that the maffinity framework makes some calls into
>> paffinity that need to adjust.
>> So version number would help a great deal in this case.
>> Thanks
>> Ralph
>> On Aug 22, 2008, at 5:23 AM, Camille Coti wrote:
>>> Hello,
>>>
>>> I am trying to run applications on a shared-memory machine. For
>>> the moment I am just trying to run tests on point-to-point
>>> communications (a  trivial token ring) and collective operations
>>> (from the SkaMPI tests suite).
>>>
>>> It runs smoothly if mpi_paffinity_alone is set to 0. For a number
>>> of processes which is larger than about 10, global communications
>>> just don't seem possible. Point-to-point communications seem to be
>>> OK.
>>>
>>> But when I specify  --mca mpi_paffinity_alone 1 in my command
>>> line, I get the following error:
>>>
>>> mbind: Invalid argument
>>>
>>> I looked into the code of maffinity/libnuma, and found out the
>>> error comes from
>>>
>>>   numa_setlocal_memory(segments[i].mbs_start_addr,
>>>segments[i].mbs_len);
>>>
>>> in maffinity_libnuma_module.c.
>>>
>>> The machine I am using is a Linux box running a 2.6.5-7 kernel.
>>>
>>> Has anyone experienced a similar problem?
>>>
>>> Camille
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] problem when mpi_paffinity_alone is set to 1

2008-08-22 Thread Camille Coti


Actually, I have tried with several versions, since you were working on 
the affinity thing. I have tried with revision 19103 a couple a weeks 
ago, the problem was already there.


Part of /proc/cpuinfo is below:
processor  : 0
vendor : GenuineIntel
arch   : IA-64
family : Itanium 2
model  : 0
revision   : 7
archrev: 0
features   : branchlong
cpu number : 0
cpu regs   : 4
cpu MHz: 900.00
itc MHz: 900.00
BogoMIPS   : 1325.40
siblings   : 1

The machine is a 60-way Altix machine, so you have 60 times this 
information in /proc/cpuinfo (yes, 60, not 64).


Camille



Ralph Castain a écrit :
I believe I have found the problem, and it may indeed relate to the 
change in paffinity. By any chance, do you have unfilled sockets on that 
machine? Could you provide the output from something like "cat 
/proc/cpuinfo" (or the equiv for your system) so we could see what 
physical processors and sockets are present?


If I'm correct as to the problem, here is the issue. OMPI has (until 
now) always assumed that the #logical processors (or sockets, or cores) 
was the same as the #physical processors (or sockets, or cores). As a 
result, several key subsystems were written without making any 
distinction as to which (logical vs physical) they were referring to. 
This was no problem until we recently encountered systems with "holes" 
in their system - a processor turned "off", or a socket unpopulated, etc.


In this case, the local processor id no longer matches the physical 
processor id (ditto for sockets and cores). We adjusted the paffinity 
subsystem to deal with it - took much more effort than we would have 
liked, and exposed lots of inconsistencies in how the base operating 
systems handle such situations.


Unfortunately, having gotten that straightened out, it is possible that 
you have uncovered a similar inconsistency in logical vs physical in 
another subsystem. I have asked better eyes than mine to take a look at 
that now to confirm - if so, it could take us a little while to fix.


My request for info was aimed at helping us to determine why your system 
is seeing this problem, but our tests didn't. We have tested the revised 
paffinity on both completely filled and on at least one system with 
"holes", but differences in OS levels, processor types, etc could have 
caused our tests to pass while your system fails. I'm particularly 
suspicious of the old kernel you are running and how our revised code 
will handle it.


For now, I would suggest you work with revisions lower than r19391 - 
could you please confirm that r19390 or earlier works?


Thanks
Ralph

On Aug 22, 2008, at 7:21 AM, Camille Coti wrote:



OK, thank you!

Camille

Ralph Castain a écrit :
Okay, I'll look into it. I suspect the problem is due to the 
redefinition of the paffinity API to clarify physical vs logical 
processors - more than likely, the maffinity interface suffers from 
the same problem we had to correct over there.
We'll report back later with an estimate of how quickly this can be 
fixed.

Thanks
Ralph
On Aug 22, 2008, at 7:03 AM, Camille Coti wrote:


Ralph,

I compiled a clean checkout from the trunk (r19392), the problem is 
still the same.


Camille


Ralph Castain a écrit :

Hi Camille
What OMPI version are you using? We just changed the paffinity 
module last night, but did nothing to maffinity. However, it is 
possible that the maffinity framework makes some calls into 
paffinity that need to adjust.

So version number would help a great deal in this case.
Thanks
Ralph
On Aug 22, 2008, at 5:23 AM, Camille Coti wrote:

Hello,

I am trying to run applications on a shared-memory machine. For 
the moment I am just trying to run tests on point-to-point 
communications (a  trivial token ring) and collective operations 
(from the SkaMPI tests suite).


It runs smoothly if mpi_paffinity_alone is set to 0. For a number 
of processes which is larger than about 10, global communications 
just don't seem possible. Point-to-point communications seem to be 
OK.


But when I specify  --mca mpi_paffinity_alone 1 in my command 
line, I get the following error:


mbind: Invalid argument

I looked into the code of maffinity/libnuma, and found out the 
error comes from


 numa_setlocal_memory(segments[i].mbs_start_addr,
  segments[i].mbs_len);

in maffinity_libnuma_module.c.

The machine I am using is a Linux box running a 2.6.5-7 kernel.

Has anyone experienced a similar problem?

Camille
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list

Re: [OMPI users] problem when mpi_paffinity_alone is set to 1

2008-08-22 Thread Mi Yan

Ralph,

  How does OpenMPI pick up the map between physical vs.  logical
processors?Does OMPI  look into "/sys/devices/system/node/node for
the cpu topology?


Thanks,
Mi Yan


   
 Ralph Castain 
 
 Sent by:   To
 users-bounces@ope Open MPI Users 
 n-mpi.org  cc
   
   Subject
 08/22/2008 09:16  Re: [OMPI users] problem when   
 AMmpi_paffinity_alone is set to 1 
   
   
 Please respond to 
  Open MPI Users   
 
   
   




Okay, I'll look into it. I suspect the problem is due to the
redefinition of the paffinity API to clarify physical vs logical
processors - more than likely, the maffinity interface suffers from
the same problem we had to correct over there.

We'll report back later with an estimate of how quickly this can be
fixed.

Thanks
Ralph

On Aug 22, 2008, at 7:03 AM, Camille Coti wrote:

>
> Ralph,
>
> I compiled a clean checkout from the trunk (r19392), the problem is
> still the same.
>
> Camille
>
>
> Ralph Castain a écrit :
>> Hi Camille
>> What OMPI version are you using? We just changed the paffinity
>> module last night, but did nothing to maffinity. However, it is
>> possible that the maffinity framework makes some calls into
>> paffinity that need to adjust.
>> So version number would help a great deal in this case.
>> Thanks
>> Ralph
>> On Aug 22, 2008, at 5:23 AM, Camille Coti wrote:
>>> Hello,
>>>
>>> I am trying to run applications on a shared-memory machine. For
>>> the moment I am just trying to run tests on point-to-point
>>> communications (a  trivial token ring) and collective operations
>>> (from the SkaMPI tests suite).
>>>
>>> It runs smoothly if mpi_paffinity_alone is set to 0. For a number
>>> of processes which is larger than about 10, global communications
>>> just don't seem possible. Point-to-point communications seem to be
>>> OK.
>>>
>>> But when I specify  --mca mpi_paffinity_alone 1 in my command
>>> line, I get the following error:
>>>
>>> mbind: Invalid argument
>>>
>>> I looked into the code of maffinity/libnuma, and found out the
>>> error comes from
>>>
>>>   numa_setlocal_memory(segments[i].mbs_start_addr,
>>>segments[i].mbs_len);
>>>
>>> in maffinity_libnuma_module.c.
>>>
>>> The machine I am using is a Linux box running a 2.6.5-7 kernel.
>>>
>>> Has anyone experienced a similar problem?
>>>
>>> Camille
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] problem when mpi_paffinity_alone is set to 1

2008-08-22 Thread STUART PURVES
Back on Mon 1st Sept

If action is required before then ...
please contact Rob Giddings  (Catia/VPM/HDMS issues)

For Nastran/CAE technical S/W, Chris Catchpole can help.
For Elecricad, Chris Toyne can help.


Re: [OMPI users] problem when mpi_paffinity_alone is set to 1

2008-08-22 Thread Ralph Castain
I believe I have found the problem, and it may indeed relate to the  
change in paffinity. By any chance, do you have unfilled sockets on  
that machine? Could you provide the output from something like "cat / 
proc/cpuinfo" (or the equiv for your system) so we could see what  
physical processors and sockets are present?


If I'm correct as to the problem, here is the issue. OMPI has (until  
now) always assumed that the #logical processors (or sockets, or  
cores) was the same as the #physical processors (or sockets, or  
cores). As a result, several key subsystems were written without  
making any distinction as to which (logical vs physical) they were  
referring to. This was no problem until we recently encountered  
systems with "holes" in their system - a processor turned "off", or a  
socket unpopulated, etc.


In this case, the local processor id no longer matches the physical  
processor id (ditto for sockets and cores). We adjusted the paffinity  
subsystem to deal with it - took much more effort than we would have  
liked, and exposed lots of inconsistencies in how the base operating  
systems handle such situations.


Unfortunately, having gotten that straightened out, it is possible  
that you have uncovered a similar inconsistency in logical vs physical  
in another subsystem. I have asked better eyes than mine to take a  
look at that now to confirm - if so, it could take us a little while  
to fix.


My request for info was aimed at helping us to determine why your  
system is seeing this problem, but our tests didn't. We have tested  
the revised paffinity on both completely filled and on at least one  
system with "holes", but differences in OS levels, processor types,  
etc could have caused our tests to pass while your system fails. I'm  
particularly suspicious of the old kernel you are running and how our  
revised code will handle it.


For now, I would suggest you work with revisions lower than r19391 -  
could you please confirm that r19390 or earlier works?


Thanks
Ralph

On Aug 22, 2008, at 7:21 AM, Camille Coti wrote:



OK, thank you!

Camille

Ralph Castain a écrit :
Okay, I'll look into it. I suspect the problem is due to the  
redefinition of the paffinity API to clarify physical vs logical  
processors - more than likely, the maffinity interface suffers from  
the same problem we had to correct over there.
We'll report back later with an estimate of how quickly this can be  
fixed.

Thanks
Ralph
On Aug 22, 2008, at 7:03 AM, Camille Coti wrote:


Ralph,

I compiled a clean checkout from the trunk (r19392), the problem  
is still the same.


Camille


Ralph Castain a écrit :

Hi Camille
What OMPI version are you using? We just changed the paffinity  
module last night, but did nothing to maffinity. However, it is  
possible that the maffinity framework makes some calls into  
paffinity that need to adjust.

So version number would help a great deal in this case.
Thanks
Ralph
On Aug 22, 2008, at 5:23 AM, Camille Coti wrote:

Hello,

I am trying to run applications on a shared-memory machine. For  
the moment I am just trying to run tests on point-to-point  
communications (a  trivial token ring) and collective operations  
(from the SkaMPI tests suite).


It runs smoothly if mpi_paffinity_alone is set to 0. For a  
number of processes which is larger than about 10, global  
communications just don't seem possible. Point-to-point  
communications seem to be OK.


But when I specify  --mca mpi_paffinity_alone 1 in my command  
line, I get the following error:


mbind: Invalid argument

I looked into the code of maffinity/libnuma, and found out the  
error comes from


 numa_setlocal_memory(segments[i].mbs_start_addr,
  segments[i].mbs_len);

in maffinity_libnuma_module.c.

The machine I am using is a Linux box running a 2.6.5-7 kernel.

Has anyone experienced a similar problem?

Camille
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





Re: [OMPI users] problem when mpi_paffinity_alone is set to 1

2008-08-22 Thread Camille Coti


OK, thank you!

Camille

Ralph Castain a écrit :
Okay, I'll look into it. I suspect the problem is due to the 
redefinition of the paffinity API to clarify physical vs logical 
processors - more than likely, the maffinity interface suffers from the 
same problem we had to correct over there.


We'll report back later with an estimate of how quickly this can be fixed.

Thanks
Ralph

On Aug 22, 2008, at 7:03 AM, Camille Coti wrote:



Ralph,

I compiled a clean checkout from the trunk (r19392), the problem is 
still the same.


Camille


Ralph Castain a écrit :

Hi Camille
What OMPI version are you using? We just changed the paffinity module 
last night, but did nothing to maffinity. However, it is possible 
that the maffinity framework makes some calls into paffinity that 
need to adjust.

So version number would help a great deal in this case.
Thanks
Ralph
On Aug 22, 2008, at 5:23 AM, Camille Coti wrote:

Hello,

I am trying to run applications on a shared-memory machine. For the 
moment I am just trying to run tests on point-to-point 
communications (a  trivial token ring) and collective operations 
(from the SkaMPI tests suite).


It runs smoothly if mpi_paffinity_alone is set to 0. For a number of 
processes which is larger than about 10, global communications just 
don't seem possible. Point-to-point communications seem to be OK.


But when I specify  --mca mpi_paffinity_alone 1 in my command line, 
I get the following error:


mbind: Invalid argument

I looked into the code of maffinity/libnuma, and found out the error 
comes from


  numa_setlocal_memory(segments[i].mbs_start_addr,
   segments[i].mbs_len);

in maffinity_libnuma_module.c.

The machine I am using is a Linux box running a 2.6.5-7 kernel.

Has anyone experienced a similar problem?

Camille
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] problem when mpi_paffinity_alone is set to 1

2008-08-22 Thread Ralph Castain
Okay, I'll look into it. I suspect the problem is due to the  
redefinition of the paffinity API to clarify physical vs logical  
processors - more than likely, the maffinity interface suffers from  
the same problem we had to correct over there.


We'll report back later with an estimate of how quickly this can be  
fixed.


Thanks
Ralph

On Aug 22, 2008, at 7:03 AM, Camille Coti wrote:



Ralph,

I compiled a clean checkout from the trunk (r19392), the problem is  
still the same.


Camille


Ralph Castain a écrit :

Hi Camille
What OMPI version are you using? We just changed the paffinity  
module last night, but did nothing to maffinity. However, it is  
possible that the maffinity framework makes some calls into  
paffinity that need to adjust.

So version number would help a great deal in this case.
Thanks
Ralph
On Aug 22, 2008, at 5:23 AM, Camille Coti wrote:

Hello,

I am trying to run applications on a shared-memory machine. For  
the moment I am just trying to run tests on point-to-point  
communications (a  trivial token ring) and collective operations  
(from the SkaMPI tests suite).


It runs smoothly if mpi_paffinity_alone is set to 0. For a number  
of processes which is larger than about 10, global communications  
just don't seem possible. Point-to-point communications seem to be  
OK.


But when I specify  --mca mpi_paffinity_alone 1 in my command  
line, I get the following error:


mbind: Invalid argument

I looked into the code of maffinity/libnuma, and found out the  
error comes from


  numa_setlocal_memory(segments[i].mbs_start_addr,
   segments[i].mbs_len);

in maffinity_libnuma_module.c.

The machine I am using is a Linux box running a 2.6.5-7 kernel.

Has anyone experienced a similar problem?

Camille
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





Re: [OMPI users] problem when mpi_paffinity_alone is set to 1

2008-08-22 Thread Camille Coti


Ralph,

I compiled a clean checkout from the trunk (r19392), the problem is 
still the same.


Camille


Ralph Castain a écrit :

Hi Camille

What OMPI version are you using? We just changed the paffinity module 
last night, but did nothing to maffinity. However, it is possible that 
the maffinity framework makes some calls into paffinity that need to 
adjust.


So version number would help a great deal in this case.

Thanks
Ralph

On Aug 22, 2008, at 5:23 AM, Camille Coti wrote:


Hello,

I am trying to run applications on a shared-memory machine. For the 
moment I am just trying to run tests on point-to-point communications 
(a  trivial token ring) and collective operations (from the SkaMPI 
tests suite).


It runs smoothly if mpi_paffinity_alone is set to 0. For a number of 
processes which is larger than about 10, global communications just 
don't seem possible. Point-to-point communications seem to be OK.


But when I specify  --mca mpi_paffinity_alone 1 in my command line, I 
get the following error:


mbind: Invalid argument

I looked into the code of maffinity/libnuma, and found out the error 
comes from


   numa_setlocal_memory(segments[i].mbs_start_addr,
segments[i].mbs_len);

in maffinity_libnuma_module.c.

The machine I am using is a Linux box running a 2.6.5-7 kernel.

Has anyone experienced a similar problem?

Camille
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] problem when mpi_paffinity_alone is set to 1

2008-08-22 Thread Ralph Castain

Hi Camille

What OMPI version are you using? We just changed the paffinity module  
last night, but did nothing to maffinity. However, it is possible that  
the maffinity framework makes some calls into paffinity that need to  
adjust.


So version number would help a great deal in this case.

Thanks
Ralph

On Aug 22, 2008, at 5:23 AM, Camille Coti wrote:


Hello,

I am trying to run applications on a shared-memory machine. For the  
moment I am just trying to run tests on point-to-point  
communications (a  trivial token ring) and collective operations  
(from the SkaMPI tests suite).


It runs smoothly if mpi_paffinity_alone is set to 0. For a number of  
processes which is larger than about 10, global communications just  
don't seem possible. Point-to-point communications seem to be OK.


But when I specify  --mca mpi_paffinity_alone 1 in my command line,  
I get the following error:


mbind: Invalid argument

I looked into the code of maffinity/libnuma, and found out the error  
comes from


   numa_setlocal_memory(segments[i].mbs_start_addr,
segments[i].mbs_len);

in maffinity_libnuma_module.c.

The machine I am using is a Linux box running a 2.6.5-7 kernel.

Has anyone experienced a similar problem?

Camille
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users