from:"Brock Palen"

Re: [hwloc-users] hwloc Python3 Bindings - Correctly Grab number cores available

2020-08-31 Thread Brock Palen

Thanks,

yeah I was looking for an API that would take into consideration most
cases, like I find with hwloc-bind --get   where I can find the number the
process has access to.  Wether is cgroups,  other sorts of affinity setting
etc.

Brock Palen
IG: brockpalen1984
www.umich.edu/~brockp
Director Advanced Research Computing - TS
bro...@umich.edu
(734)936-1985


On Mon, Aug 31, 2020 at 12:37 PM Guy Streeter 
wrote:

> I forgot that the cpuset value is still available in cgroups v2. You
> would want the cpuset.cpus.effective value.
> More information is available here:
> https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html
>
> On Mon, Aug 31, 2020 at 11:19 AM Guy Streeter 
> wrote:
> >
> > As I said, cgroups doesn't limit the group to a number of cores, it
> > limits processing time, either as an absolute amount or as a share of
> > what is available.
> > A docker process can be restricted to a set of cores, but that is done
> > with cpu affinity, not cgroups.
> >
> > You could try to figure out an equivalency. For instance if you are
> > using cpu.shares to limit the cgroups, then figure the ratio of a
> > cgroup's share to the shares of all the cgroups at that level, and
> > apply that ratio to the number of available cores to get an estimated
> > number of threads you should start.
> >
> > On Mon, Aug 31, 2020 at 10:40 AM Brock Palen  wrote:
> > >
> > > Sorry if wasn't clear, I'm trying to find out what is available to my
> process before it starts up threads.  If the user is jailed in a cgroup
> (docker, slurm, other)  and the program tries to start 36 threads, when it
> only has access to 4 cores, it's probably not a huge deal, but not
> desirable.
> > >
> > > I do allow the user to specify number of threads, but would like to
> automate it for least astonishment.
> > >
> > > Brock Palen
> > > IG: brockpalen1984
> > > www.umich.edu/~brockp
> > > Director Advanced Research Computing - TS
> > > bro...@umich.edu
> > > (734)936-1985
> > >
> > >
> > > On Mon, Aug 31, 2020 at 11:34 AM Guy Streeter 
> wrote:
> > >>
> > >> My very basic understanding of cgroups is that it can be used to limit
> > >> cpu processing time for a group, and to ensure fair distribution of
> > >> processing time within the group, but I don't know of a way to use
> > >> cgroups to limit the number of CPUs available to a cgroup.
> > >>
> > >> On Mon, Aug 31, 2020 at 8:56 AM Brock Palen  wrote:
> > >> >
> > >> > Hello,
> > >> >
> > >> > I have a small utility,  it is currently using
> multiprocess.cpu_count()
> > >> > Which currently ignores cgroups etc.
> > >> >
> > >> > I see https://gitlab.com/guystreeter/python-hwloc
> > >> > But appears stale,
> > >> >
> > >> > How would you detect number of threads that are safe to start in a
> cgroup from Python3 ?
> > >> >
> > >> > Thanks!
> > >> >
> > >> > Brock Palen
> > >> > IG: brockpalen1984
> > >> > www.umich.edu/~brockp
> > >> > Director Advanced Research Computing - TS
> > >> > bro...@umich.edu
> > >> > (734)936-1985
> > >> > ___
> > >> > hwloc-users mailing list
> > >> > hwloc-users@lists.open-mpi.org
> > >> > https://lists.open-mpi.org/mailman/listinfo/hwloc-users
> > >> ___
> > >> hwloc-users mailing list
> > >> hwloc-users@lists.open-mpi.org
> > >> https://lists.open-mpi.org/mailman/listinfo/hwloc-users
> > >
> > > ___
> > > hwloc-users mailing list
> > > hwloc-users@lists.open-mpi.org
> > > https://lists.open-mpi.org/mailman/listinfo/hwloc-users
> ___
> hwloc-users mailing list
> hwloc-users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/hwloc-users
>
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users

Re: [hwloc-users] hwloc Python3 Bindings - Correctly Grab number cores available

2020-08-31 Thread Brock Palen

Sorry if wasn't clear, I'm trying to find out what is available to my
process before it starts up threads.  If the user is jailed in a cgroup
(docker, slurm, other)  and the program tries to start 36 threads, when it
only has access to 4 cores, it's probably not a huge deal, but not
desirable.

I do allow the user to specify number of threads, but would like to
automate it for least astonishment.

Brock Palen
IG: brockpalen1984
www.umich.edu/~brockp
Director Advanced Research Computing - TS
bro...@umich.edu
(734)936-1985


On Mon, Aug 31, 2020 at 11:34 AM Guy Streeter 
wrote:

> My very basic understanding of cgroups is that it can be used to limit
> cpu processing time for a group, and to ensure fair distribution of
> processing time within the group, but I don't know of a way to use
> cgroups to limit the number of CPUs available to a cgroup.
>
> On Mon, Aug 31, 2020 at 8:56 AM Brock Palen  wrote:
> >
> > Hello,
> >
> > I have a small utility,  it is currently using  multiprocess.cpu_count()
> > Which currently ignores cgroups etc.
> >
> > I see https://gitlab.com/guystreeter/python-hwloc
> > But appears stale,
> >
> > How would you detect number of threads that are safe to start in a
> cgroup from Python3 ?
> >
> > Thanks!
> >
> > Brock Palen
> > IG: brockpalen1984
> > www.umich.edu/~brockp
> > Director Advanced Research Computing - TS
> > bro...@umich.edu
> > (734)936-1985
> > ___
> > hwloc-users mailing list
> > hwloc-users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/hwloc-users
> ___
> hwloc-users mailing list
> hwloc-users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/hwloc-users
>
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users

[hwloc-users] hwloc Python3 Bindings - Correctly Grab number cores available

2020-08-31 Thread Brock Palen

Hello,

I have a small utility,  it is currently using  multiprocess.cpu_count()
Which currently ignores cgroups etc.

I see https://gitlab.com/guystreeter/python-hwloc
But appears stale,

How would you detect number of threads that are safe to start in a cgroup
from Python3 ?

Thanks!

Brock Palen
IG: brockpalen1984
www.umich.edu/~brockp
Director Advanced Research Computing - TS
bro...@umich.edu
(734)936-1985
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users

Re: [hwloc-users] Selecting real cores vs HT cores

2014-12-11 Thread Brock Palen

Right and that is what I figured, they quote performance metrics.  I'm almost 
trying to divine what mapping they use and if its static, 1:1 mapping. 

Thanks for the thoughts.

Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
XSEDE Campus Champion
bro...@umich.edu
(734)936-1985



> On Dec 11, 2014, at 4:41 PM, Jeff Squyres (jsquyres)  
> wrote:
> 
> On Dec 11, 2014, at 1:36 PM, Brock Palen  wrote:
> 
>> Ok let me expand then.  I don't have control over the bios.
>> 
>> The testing I am doing resides on a cloud provider and from our testing it 
>> appears that it has HT enabled.  It is ambiguous though to me what I see vs 
>> how they allocate on their hypervisor. 
> 
> Oh, if you're in a hypervisor, then what you're seeing has zero correlation 
> to reality.
> 
> If it's an HPC cloud provider, they *likely* paired cores in the hypervisor 
> with real/physical cores.  More specifically: they *probably* paired hyper 
> threads in the hypervisor with real/physical hyper threads (i.e., so that the 
> lstopo in the hypervisor is equivalent to lstopo outside the hypervisor).
> 
> But you'll need to ask them, because modern VMs let you do whatever you want 
> in terms of mapping VM cores/HTs to physical cores/HTs.
> 
> Consider: you can run dozens on web server VMs on a machine with 10 cores.  
> Each VM will say that it has, say, 1 or 2 cores.  But clearly, the sum of 
> number of cores in the VMs is larger than the total number of physical cores.
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
> Link to this post: 
> http://www.open-mpi.org/community/lists/hwloc-users/2014/12/1129.php

Re: [hwloc-users] Selecting real cores vs HT cores

2014-12-11 Thread Brock Palen

Ok let me expand then.  I don't have control over the bios.

The testing I am doing resides on a cloud provider and from our testing it 
appears that it has HT enabled.  It is ambiguous though to me what I see vs how 
they allocate on their hypervisor. 

I want to see if this has any effect. given the providers advertised CPU types 
they use vs my bare metal ones of the same types everything feels 'half as 
fast'  Thus the question about HT.

Here is the lstopo from the provider:

lstopo-no-graphics

Machine (7484MB)
  Socket L#0 + L3 L#0 (25MB)
L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
  PU L#0 (P#0)
  PU L#1 (P#2)
L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
  PU L#2 (P#1)
  PU L#3 (P#3)
  HostBridge L#0
PCI 8086:7010
PCI 1013:00b8
PCI 8086:10ed
  Net L#0 "eth0"


Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
XSEDE Campus Champion
bro...@umich.edu
(734)936-1985



> On Dec 11, 2014, at 4:12 PM, Jeff Squyres (jsquyres)  
> wrote:
> 
> I'm not sure you're asking a well-formed question.
> 
> When the BIOS is set to enable hyper threading, then several resources on the 
> core are split when the machine is booted up (e.g., some of the queue depths 
> for various processing units in the core are half the length that they are 
> when hyperthreading is disabled in the BIOS).
> 
> Hence, running a process on a core that only uses a single hyperthread (when 
> HT is enabled) is not quite the same thing as booting up with HT disabled and 
> running that same job on the core.
> 
> Make sense?
> 
> Meaning: if you want to test HT vs. non-HT performance, you really need to 
> change the BIOS settings and reboot, sorry.
> 
> Also, note that if you have HT enabled and you run a single-threaded app 
> bound to a core, it will only use 1 of those HTs -- the other HT will be 
> largely dormant. Meaning: don't expect that running a single-threaded app on 
> a core that has HT enabled will magically take advantage of some performance 
> benefit of aggressive automatic parallelization.  You really need multiple 
> threads in a process to get performance advantages out of HT.
> 
> 
> 
> On Dec 11, 2014, at 12:51 PM, Brock Palen  wrote:
> 
>> When a system has HT enabled is one core presented the real one and one the 
>> fake partner?  Or is that not the case?
>> 
>> If wanting to test behavior without messing with the bios how do I select 
>> just the 'real cores'  if this is the case?   
>> 
>> I am looking for the equivelent of 
>> 
>> hwloc-bind ALLREALCORES  my.exe
>> 
>> Doing some performance study type things.
>> 
>> Thanks,
>> 
>> Brock Palen
>> www.umich.edu/~brockp
>> CAEN Advanced Computing
>> XSEDE Campus Champion
>> bro...@umich.edu
>> (734)936-1985
>> 
>> 
>> 
>> ___
>> hwloc-users mailing list
>> hwloc-us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/hwloc-users/2014/12/1126.php
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
> Link to this post: 
> http://www.open-mpi.org/community/lists/hwloc-users/2014/12/1127.php

[hwloc-users] Selecting real cores vs HT cores

2014-12-11 Thread Brock Palen

When a system has HT enabled is one core presented the real one and one the 
fake partner?  Or is that not the case?

If wanting to test behavior without messing with the bios how do I select just 
the 'real cores'  if this is the case?   

I am looking for the equivelent of 

hwloc-bind ALLREALCORES  my.exe

Doing some performance study type things.

Thanks,

Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
XSEDE Campus Champion
bro...@umich.edu
(734)936-1985

Re: [hwloc-users] Using hwloc to map GPU layout on system

2014-02-14 Thread Brock Palen


On Feb 7, 2014, at 9:45 AM, Brice Goglin  wrote:

> Le 06/02/2014 21:31, Brock Palen a écrit :
>> Actually that did turn out to help. The nvml# devices appear to be numbered 
>> in the way that CUDA_VISABLE_DEVICES sees them, while the cuda# devices are 
>> in the order that PBS and nvidia-smi see them.
> 
> By the way, did you have CUDA_VISIBLE_DEVICES set during the lstopo below? 
> Was it set to 2,3,0,1 ? That would explain the reordering.

It was not set, and I have double checked it just now to be sure.

> 
> I am not sure in which order you want to do things in the end. One way that 
> could help is:
> * Get the locality of each GPU by doing CUDA_VISIBLE_DEVICES=x (for x in 
> 0..number of gpus-1). Each iteration gives a single GPU in hwloc, and you can 
> retrieve the corresponding locality from the cuda0 object.
> * Once you know which GPUs you want based on the locality info, take the 
> corresponding #x and put them in CUDA_VISIBLE_DEVICES=x,y before you run your 
> program. hwloc will create cuda0 for x and cuda1 for y.

The cuda ID's match the order if you run nvidia-smi  (which gives you PCI 
addresses)

The nvml id's  match the order in which they start.  That is 
CUDA_VISIBLE_DEVICES=0, cudaSetDevice(0) matches nvml0  which matches id 2 for 
CoProc cuda2 and for nvidia-smi id 2.

This appears to be very consistent between reboots.
te
> 
> If you don't set CUDA_VISIBLE_DEVICES, cuda* objects are basically 
> out-of-order. nvml objects are (a bit less likely) ordered by PCI bus is 
> (lstopo -v would confirm that).

Yes the nvml and what is ordering is by ascending PCI ID,  nvidia-smi shows 
this:

[root@nyx7500 ~]# nvidia-smi | grep Tesla
|   0  Tesla K20Xm Off  | :09:00.0 Off |0 |
|   1  Tesla K20Xm Off  | :0A:00.0 Off |0 |
|   2  Tesla K20Xm Off  | :0D:00.0 Off |0 |
|   3  Tesla K20Xm Off  | :0E:00.0 Off |0 |
|   4  Tesla K20Xm Off  | :28:00.0 Off |0 |
|   5  Tesla K20Xm Off  | :2B:00.0 Off |0 |
|   6  Tesla K20Xm Off  | :30:00.0 Off |0 |
|   7  Tesla K20Xm Off  | :33:00.0 Off |0 |

[root@nyx7500 ~]# lstopo -v
Machine (P#0 total=67073288KB DMIProductName="ProLiant SL270s Gen8   " 
DMIProductVersion= DMIProductSerial="USE3267A92  " 
DMIProductUUID=36353439-3437-5553-4533-323637413932 DMIBoardVendor=HP 
DMIBoardName= DMIBoardVersion= DMIBoardSerial="USE3267A92  " 
DMIBoardAssetTag="" DMIChassisVendor=HP DMIChassisType=25 
DMIChassisVersion= DMIChassisSerial="USE3267A90  " DMIChassisAssetTag=" 
   " DMIBIOSVendor=HP DMIBIOSVersion=P75 DMIBIOSDate=09/18/2013 DMISysVendor=HP 
Backend=Linux LinuxCgroup=/ OSName=Linux OSRelease=2.6.32-358.23.2.el6.x86_64 
OSVersion="#1 SMP Sat Sep 14 05:32:37 EDT 2013" 
HostName=nyx7500.engin.umich.edu Architecture=x86_64)
  NUMANode L#0 (P#0 local=33518860KB total=33518860KB)
Socket L#0 (P#0 CPUModel="Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz" 
CPUVendor=GenuineIntel CPUModelNumber=45 CPUFamilyNumber=6)
  L3Cache L#0 (size=20480KB linesize=64 ways=20)
L2Cache L#0 (size=256KB linesize=64 ways=8)
  L1dCache L#0 (size=32KB linesize=64 ways=8)
L1iCache L#0 (size=32KB linesize=64 ways=8)
  Core L#0 (P#0)
PU L#0 (P#0)
L2Cache L#1 (size=256KB linesize=64 ways=8)
  L1dCache L#1 (size=32KB linesize=64 ways=8)
L1iCache L#1 (size=32KB linesize=64 ways=8)
  Core L#1 (P#1)
PU L#1 (P#1)
L2Cache L#2 (size=256KB linesize=64 ways=8)
  L1dCache L#2 (size=32KB linesize=64 ways=8)
L1iCache L#2 (size=32KB linesize=64 ways=8)
  Core L#2 (P#2)
PU L#2 (P#2)
L2Cache L#3 (size=256KB linesize=64 ways=8)
  L1dCache L#3 (size=32KB linesize=64 ways=8)
L1iCache L#3 (size=32KB linesize=64 ways=8)
  Core L#3 (P#3)
PU L#3 (P#3)
L2Cache L#4 (size=256KB linesize=64 ways=8)
  L1dCache L#4 (size=32KB linesize=64 ways=8)
L1iCache L#4 (size=32KB linesize=64 ways=8)
  Core L#4 (P#4)
PU L#4 (P#4)
L2Cache L#5 (size=256KB linesize=64 ways=8)
  L1dCache L#5 (size=32KB linesize=64 ways=8)
L1iCache L#5 (size=32KB linesize=64 ways=8)
  Core L#5 (P#5)
PU L#5 (P#5)
L2Cache L#6 (size=256KB linesize=64 ways=8)
  L1dCache L#6 (size=32KB linesize=64 ways=8)
L1iCache L#6 (size=32KB linesize=64 ways=8)
  Core L#6 (P#6)
PU L#6 (P#6)
L2Cache L#7 (size

Re: [hwloc-users] Using hwloc to map GPU layout on system

2014-02-06 Thread Brock Palen

Actually that did turn out to help. The nvml# devices appear to be numbered in 
the way that CUDA_VISABLE_DEVICES sees them, while the cuda# devices are in the 
order that PBS and nvidia-smi see them.

  PCIBridge
PCIBridge
  PCIBridge
PCI 10de:1021
  CoProc L#2 "cuda0"
  GPU L#3 "nvml2"
  PCIBridge
PCI 10de:1021
  CoProc L#4 "cuda1"
  GPU L#5 "nvml3"
  PCIBridge
PCIBridge
  PCIBridge
PCI 10de:1021
  CoProc L#6 "cuda2"
  GPU L#7 "nvml0"
  PCIBridge
PCI 10de:1021
  CoProc L#8 "cuda3"
  GPU L#9 "nvml1"


Right now I am trying to create a python script that will take the XML output 
of lstopo and give me just the cuda and nvml devices in order. 

I dont' know if some value are deterministic though.  Could I  ignore the 
CoProc line and just use the:

  GPU L#3 "nvml2"
  GPU L#5 "nvml3"
  GPU L#7 "nvml0"
  GPU L#9 "nvml1"

Is the L# always going to be in the oder I would expect?  Because then I 
already have my map then. 

Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
XSEDE Campus Champion
bro...@umich.edu
(734)936-1985



On Feb 5, 2014, at 1:19 AM, Brice Goglin  wrote:

> Hello Brock,
> 
> Some people reported the same issue in the past and that's why we added the 
> "nvml" objects. CUDA reorders devices by "performance". Batch-schedulers are 
> somehow supposed to use "nvml" for managing GPUs without actually using them 
> with CUDA directly. And the "nvml" order is the "normal" order.
> 
> You need "tdk" (https://developer.nvidia.com/tesla-deployment-kit) to get 
> nvml library and development headers installed. Then hwloc can build its 
> "nvml" backend. Once ready, you'll see a hwloc "cudaX" and a hwloc "nvmlY" 
> object in each NVIDIA PCI devices, and you can get their locality as usual.
> 
> Does this help?
> 
> Brice
> 
> 
> 
> Le 05/02/2014 05:25, Brock Palen a écrit :
>> We are trying to build a system to mask users to the GPU's they were 
>> assigned by our batch system (torque).
>> 
>> The batch system sets the GPU's into thread exclusive mode when assigned to 
>> a job, so we want the GPU that the batch system assigns to be the one set in 
>> CUDA_VISIBLE_DEVICES,
>> 
>> Problem is on our nodes what the batch system sees as gpu 0  is not the same 
>> GPU that CUDA_VISIBLE_DEVICES sees as 0.   Actually 0  is 2.
>> 
>> You can see this behavior is you run 
>> 
>> nvidia-smi  and look at the PCI ID's of the devices.  You can then look at 
>> the PCI ID's outputed by deviceQuery from the SDK examples and see they are 
>> in a different order.
>> 
>> The ID's you would set in CUDA_VISIBLE_DEVICES matches the order that 
>> deviceQuery sees, not the order that nvida-smi sees.
>> 
>> Example (All values turned to decimal to match deviceQuery):
>> 
>> nvidia-smi order: 9, 10, 13, 14, 40, 43, 48, 51
>> dviceQuery order: 13, 14, 9, 10, 40, 43, 48, 51
>> 
>> 
>> Can hwloc help me with this?  Right now I am hacking a script based on the 
>> output of the two commands, and making a map, between the two and then set 
>> CUDA_VISIBLE_DEVICES
>> 
>> Any ideas would be great. Later as we currently also use CPU sets, we want 
>> to pass GPU locality information to the scheduler to make decisions to match 
>> GPU-> CPU Socket information, as performance of threads across QPI domains 
>> is very poor. 
>> 
>> Thanks
>> 
>> Machine (64GB)
>>   NUMANode L#0 (P#0 32GB)
>> Socket L#0 + L3 L#0 (20MB)
>>   L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 
>> (P#0)
>>   L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 
>> (P#1)
>>   L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 
>> (P#2)
>>   L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 
>> (P#3)
>>   L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 
>> (P#4)
>>   L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 
>> (P#5)
>>   L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 
>> (P#6)
>>   L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 
>> (P#7)
>> HostBridge L#0
>>

[hwloc-users] Using hwloc to map GPU layout on system

2014-02-04 Thread Brock Palen

We are trying to build a system to mask users to the GPU's they were assigned 
by our batch system (torque).

The batch system sets the GPU's into thread exclusive mode when assigned to a 
job, so we want the GPU that the batch system assigns to be the one set in 
CUDA_VISIBLE_DEVICES,

Problem is on our nodes what the batch system sees as gpu 0  is not the same 
GPU that CUDA_VISIBLE_DEVICES sees as 0.   Actually 0  is 2.

You can see this behavior is you run 

nvidia-smi  and look at the PCI ID's of the devices.  You can then look at the 
PCI ID's outputed by deviceQuery from the SDK examples and see they are in a 
different order.

The ID's you would set in CUDA_VISIBLE_DEVICES matches the order that 
deviceQuery sees, not the order that nvida-smi sees.

Example (All values turned to decimal to match deviceQuery):

nvidia-smi order: 9, 10, 13, 14, 40, 43, 48, 51
dviceQuery order: 13, 14, 9, 10, 40, 43, 48, 51


Can hwloc help me with this?  Right now I am hacking a script based on the 
output of the two commands, and making a map, between the two and then set 
CUDA_VISIBLE_DEVICES

Any ideas would be great. Later as we currently also use CPU sets, we want to 
pass GPU locality information to the scheduler to make decisions to match GPU-> 
CPU Socket information, as performance of threads across QPI domains is very 
poor. 

Thanks

Machine (64GB)
  NUMANode L#0 (P#0 32GB)
Socket L#0 + L3 L#0 (20MB)
  L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
  L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#1)
  L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#2)
  L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#3)
  L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#4)
  L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#5)
  L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#6)
  L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#7)
HostBridge L#0
  PCIBridge
PCI 1000:0087
  Block L#0 "sda"
  Block L#1 "sdb"
  PCIBridge
PCIBridge
  PCIBridge
PCI 10de:1021
  CoProc L#2 "cuda0"
  PCIBridge
PCI 10de:1021
  CoProc L#3 "cuda1"
  PCIBridge
PCIBridge
  PCIBridge
PCI 10de:1021
  CoProc L#4 "cuda2"
  PCIBridge
PCI 10de:1021
  CoProc L#5 "cuda3"
  PCIBridge
PCI 8086:1521
  Net L#6 "eth0"
PCI 8086:1521
  Net L#7 "eth1"
  PCIBridge
PCI 102b:0533
  PCI 8086:1d02
  NUMANode L#1 (P#1 32GB)
Socket L#1 + L3 L#1 (20MB)
  L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#8)
  L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#9)
  L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 + PU L#10 
(P#10)
  L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 + PU L#11 
(P#11)
  L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 + PU L#12 
(P#12)
  L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 + PU L#13 
(P#13)
  L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 + PU L#14 
(P#14)
  L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 + PU L#15 
(P#15)
HostBridge L#12
  PCIBridge
PCIBridge
  PCIBridge
PCI 15b3:1003
  Net L#8 "eth2"
  Net L#9 "ib0"
  Net L#10 "eoib0"
  OpenFabrics L#11 "mlx4_0"
  PCIBridge
PCIBridge
  PCIBridge
PCI 10de:1021
  CoProc L#12 "cuda4"
  PCIBridge
PCI 10de:1021
  CoProc L#13 "cuda5"
  PCIBridge
PCIBridge
  PCIBridge
PCI 10de:1021
  CoProc L#14 "cuda6"
  PCIBridge
PCI 10de:1021
  CoProc L#15 "cuda7"


Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
XSEDE Campus Champion
bro...@umich.edu
(734)936-1985





signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: [hwloc-users] Strange binding issue on 40 core nodes and cgroups

2012-11-06 Thread Brock Palen

Chis,

If you assume your Cpusets are correct, and you are not doing any hybrid 
thread+mpi I found the problem is avoided if you enable -bind-to-core with 
openmpi 1.6.x  

We just don't enable binding by default on our setup and thus far no users have 
been bit by this. 

Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
bro...@umich.edu
(734)936-1985



On Nov 5, 2012, at 9:00 PM, Christopher Samuel wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> On 06/11/12 08:57, Brock Palen wrote:
> 
>> Ok more information (had to build newer hwloc)  My job today only
>> 2 processes are running at half speed and they indeed are sharing
>> the same core:
> 
> We've seen the same occasionally using CentOS5/RHEL5 with jobs running
> under Torque with cpusets enabled.
> 
> Never been able to explain it and the most recent case was someone
> using a home compiled version of NAMD, the problem disappeared when
> they started using our provided builds.
> 
> I was fixing up the running problem jobs by hand by assigning procs to
> individual cores on the nodes with cpusets.  :-/
> 
> cheers,
> Chris
> - -- 
> Christopher SamuelSenior Systems Administrator
> VLSCI - Victorian Life Sciences Computation Initiative
> Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
> http://www.vlsci.org.au/  http://twitter.com/vlsci
> 
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1.4.11 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://www.enigmail.net/
> 
> iEYEARECAAYFAlCYb1sACgkQO2KABBYQAh/OGACeNL7bow7z26El31zIg16q+tPw
> toIAnigL5SHhZXM42DGY3M2Ewt6PUNIk
> =/bNA
> -END PGP SIGNATURE-
> ___
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users

Re: [hwloc-users] Strange binding issue on 40 core nodes and cgroups

2012-11-05 Thread Brock Palen

Ok more information (had to build newer hwloc)  My job today only 2 processes 
are running at half speed and they indeed are sharing the same core:

[root@nyx7000 ~]# for x in `cat /tmp/pids `; do echo -n "$x  "; hwloc-bind 
--get-last-cpu-location --pid $x; done | sort -k 2
1164  0x0001,0x0
1158  0x0010,0x0
1165  0x0010,0x0
1167  0x0020
1157  0x0200
1159  0x0400
1160  0x2000
1163  0x4000
1166  0x0002
1161  0x0004
1168  0x0020
1162  0x0040

 1157 brockp20   0 1885m 1.8g  456 R 99.6  0.2   9:49.55 stream 
  
 1159 brockp20   0 1885m 1.8g  456 R 99.6  0.2   8:10.91 stream 
  
 1161 brockp20   0 1885m 1.8g  456 R 99.6  0.2   9:49.55 stream 
  
 1162 brockp20   0 1885m 1.8g  456 R 99.6  0.2   9:49.54 stream 
  
 1163 brockp20   0 1885m 1.8g  456 R 99.6  0.2   9:49.55 stream 
  
 1164 brockp20   0 1885m 1.8g  456 R 99.6  0.2   9:49.53 stream 
  
 1160 brockp20   0 1885m 1.8g  456 R 97.7  0.2   9:49.54 stream 
  
 1166 brockp20   0 1885m 1.8g  456 R 97.7  0.2   9:49.53 stream 
  
 1167 brockp20   0 1885m 1.8g  456 R 97.7  0.2   9:49.46 stream 
  
 1168 brockp20   0 1885m 1.8g  456 R 97.7  0.2   8:10.86 stream 
  
 1158 brockp20   0 1885m 1.8g  456 R 48.9  0.2   4:54.78 stream 
  
 1165 brockp20   0 1885m 1.8g  456 R 48.9  0.2   4:54.76 stream 
  


This is very strange. Is there a way to ask hwloc to show me all processes that 
are using a given cpu?




Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
bro...@umich.edu
(734)936-1985



On Nov 2, 2012, at 4:30 PM, Brice Goglin wrote:

> Le 02/11/2012 21:22, Brice Goglin a écrit :
>> hwloc-bind --get-last-cpu-location --pid  should give the same
>> info but it seems broken on my machine right now, going to debug.
> 
> Actually, that works fine once you try it on a non-multithreaded program
> that uses all cores :)
> 
> So you can use top or hwloc-bind --get-last-cpu-location --pid  to
> find out where each process runs.
> 
> Brice
> 
> ___
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users

[hwloc-users] Strange binding issue on 40 core nodes and cgroups

2012-11-02 Thread Brock Palen

This isn't a hwloc problem exactly, but maybe you can shed some insight.

We have some 4 socket 10 core = 40 core nodes, HT off:

depth 0:1 Machine (type #1)
 depth 1:   4 NUMANodes (type #2)
  depth 2:  4 Sockets (type #3)
   depth 3: 4 Caches (type #4)
depth 4:40 Caches (type #4)
 depth 5:   40 Caches (type #4)
  depth 6:  40 Cores (type #5)
   depth 7: 40 PUs (type #6)


We run rhel 6.3  we use torque to create cgroups for jobs.  I get the following 
cgroup for this job  all 12 cores for the job are on one node:
cat /dev/cpuset/torque/8845236.nyx.engin.umich.edu/cpus 
0-1,4-5,8,12,16,20,24,28,32,36

Not all nicely spaced, but 12 cores

I then start a code, even a simple serial code with openmpi 1.6.0 on all 12 
cores:
mpirun ./stream

45521 brockp20   0 1885m 1.8g  456 R 100.0  0.2   4:02.72 stream
 
45522 brockp20   0 1885m 1.8g  456 R 100.0  0.2   1:46.08 stream
 
45525 brockp20   0 1885m 1.8g  456 R 100.0  0.2   4:02.72 stream
 
45526 brockp20   0 1885m 1.8g  456 R 100.0  0.2   1:46.07 stream
 
45527 brockp20   0 1885m 1.8g  456 R 100.0  0.2   4:02.71 stream
 
45528 brockp20   0 1885m 1.8g  456 R 100.0  0.2   4:02.71 stream
 
45532 brockp20   0 1885m 1.8g  456 R 100.0  0.2   1:46.05 stream
 
45529 brockp20   0 1885m 1.8g  456 R 99.2  0.2   4:02.70 stream 
 
45530 brockp20   0 1885m 1.8g  456 R 99.2  0.2   4:02.70 stream 
 
45531 brockp20   0 1885m 1.8g  456 R 33.6  0.2   1:20.89 stream 
 
45523 brockp20   0 1885m 1.8g  456 R 32.8  0.2   1:20.90 stream 
 
45524 brockp20   0 1885m 1.8g  456 R 32.8  0.2   1:20.89 stream   

Note the processes that are not running at 100% cpu, 

hwloc-bind  --get --pid 45523
0x0011,0x1133


hwloc-calc 0x0011,0x1133 --intersect PU
0,1,2,3,4,5,6,7,8,9,10,11

So all ranks in the job should see all 12 cores.  The same cgroup is reported 
by /proc//cgroup

Not only that I can make things work by forcing binding in the mpi launcher:
mpirun -bind-to-core ./stream

46886 brockp20   0 1885m 1.8g  456 R 99.8  0.2   0:15.49 stream 
 
46887 brockp20   0 1885m 1.8g  456 R 99.8  0.2   0:15.49 stream 
 
46888 brockp20   0 1885m 1.8g  456 R 99.8  0.2   0:15.48 stream 
 
46889 brockp20   0 1885m 1.8g  456 R 99.8  0.2   0:15.49 stream 
 
46890 brockp20   0 1885m 1.8g  456 R 99.8  0.2   0:15.48 stream 
 
46891 brockp20   0 1885m 1.8g  456 R 99.8  0.2   0:15.48 stream 
 
46892 brockp20   0 1885m 1.8g  456 R 99.8  0.2   0:15.47 stream 
 
46893 brockp20   0 1885m 1.8g  456 R 99.8  0.2   0:15.47 stream 
 
46894 brockp20   0 1885m 1.8g  456 R 99.8  0.2   0:15.47 stream 
 
46895 brockp20   0 1885m 1.8g  456 R 99.8  0.2   0:15.47 stream 
 
46896 brockp20   0 1885m 1.8g  456 R 99.8  0.2   0:15.46 stream 
 
46897 brockp20   0 1885m 1.8g  456 R 99.8  0.2   0:15.46 stream 

Things are now working as expected, and I should stress this is inside the same 
torque job and cgroup that I started with.

A multi threaded version of the code does use close to 12 cores as expected.

If I cervumvent out batch system and the cgroups a normal mpirun ./stream  does 
start 12 processes that consume a full 100% core. 

Thoughts?  This is really odd linux scheduler behavior.

Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
bro...@umich.edu
(734)936-1985

Re: [hwloc-users] HWLoc Documentation pages 404's

2012-08-10 Thread Brock Palen

Yep very odd,

Looks like torque wrote a wrapper then for some hwloc functions.

BTW working with cgroups/cpusets in our resource manager  hwloc-info --pid  is 
_wonderful_  

I think I am good to go.

Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
bro...@umich.edu
(734)936-1985



On Aug 10, 2012, at 5:14 PM, Jeff Squyres wrote:

> I don't know why Google is pointing you there...
> 
> I went back as far as as hwloc 1.3 and I cannot find a function named 
> hwloc_bitmap_displaylist() -- that's probably why you can't find any 
> reference to it in the docs.  :-)
> 
> 
> 
> On Aug 10, 2012, at 4:55 PM, Brock Palen wrote:
> 
>> Google is giving me this url:
>> www.open-mpi.org/projects/hwloc//doc/v1.5/a2.php
>> 
>> When i searched for hwloc_bitmap_displaylist()   (for which I can find 
>> nothing nor a manpage :-) )
>> 
>> Brock Palen
>> www.umich.edu/~brockp
>> CAEN Advanced Computing
>> bro...@umich.edu
>> (734)936-1985
>> 
>> 
>> 
>> On Aug 10, 2012, at 4:26 PM, Jeff Squyres wrote:
>> 
>>> Try looking here:
>>> 
>>> http://www.open-mpi.org/projects/hwloc/doc/
>>> 
>>> You have an extra "projects" in your URL.  How did you get to that URL?  Do 
>>> we have a bug in our web pages somewhere?
>>> 
>>> 
>>> On Aug 10, 2012, at 3:56 PM, Brock Palen wrote:
>>> 
>>>> http://www.open-mpi.org/projects/projects/hwloc/doc/
>>>> 
>>>> Oh noooss!!!
>>>> 
>>>> Brock Palen
>>>> www.umich.edu/~brockp
>>>> CAEN Advanced Computing
>>>> bro...@umich.edu
>>>> (734)936-1985
>>>> 
>>>> 
>>>> 
>>>> ___
>>>> hwloc-users mailing list
>>>> hwloc-us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
>>> 
>>> 
>>> -- 
>>> Jeff Squyres
>>> jsquy...@cisco.com
>>> For corporate legal information go to: 
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>> 
>>> 
>>> ___
>>> hwloc-users mailing list
>>> hwloc-us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
>> 
>> 
>> ___
>> hwloc-users mailing list
>> hwloc-us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users

Re: [hwloc-users] HWLoc Documentation pages 404's

2012-08-10 Thread Brock Palen

Google is giving me this url:
www.open-mpi.org/projects/hwloc//doc/v1.5/a2.php

When i searched for hwloc_bitmap_displaylist()   (for which I can find nothing 
nor a manpage :-) )

Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
bro...@umich.edu
(734)936-1985



On Aug 10, 2012, at 4:26 PM, Jeff Squyres wrote:

> Try looking here:
> 
>  http://www.open-mpi.org/projects/hwloc/doc/
> 
> You have an extra "projects" in your URL.  How did you get to that URL?  Do 
> we have a bug in our web pages somewhere?
> 
> 
> On Aug 10, 2012, at 3:56 PM, Brock Palen wrote:
> 
>> http://www.open-mpi.org/projects/projects/hwloc/doc/
>> 
>> Oh noooss!!!
>> 
>> Brock Palen
>> www.umich.edu/~brockp
>> CAEN Advanced Computing
>> bro...@umich.edu
>> (734)936-1985
>> 
>> 
>> 
>> ___
>> hwloc-users mailing list
>> hwloc-us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users

[hwloc-users] HWLoc Documentation pages 404's

2012-08-10 Thread Brock Palen

http://www.open-mpi.org/projects/projects/hwloc/doc/

Oh noooss!!!

Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
bro...@umich.edu
(734)936-1985

[hwloc-users] hwloc featured podcast released

2010-05-12 Thread Brock Palen


Thank you to Samuel for being on the show with Jeff and I.

You can find the show at:
http://www.rce-cast.com/Podcast/rce-33-hwloc-portable-hardware-locality.html

There you can subscribe by iTunes or by RSS feed.

Thank you again to the hwloc community, I am now using lstopo  
regularly to see the layout of our many nodes.  If you would like to  
see something on the rce-cast.com podcast, please contact me off list.


Thanks!

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985

Re: [hwloc-users] howloc with scalemp

2010-04-07 Thread Brock Palen




Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985



On Apr 7, 2010, at 4:51 PM, Brice Goglin wrote:


Brock Palen wrote:

[brockp@nyx0809 INTEL]$ lstopo -
System(79GB)
 Misc0
   Node#0(10GB) + Socket#1 + L3(8192KB)
 L2(256KB) + L1(32KB) + Core#0 + P#0
 L2(256KB) + L1(32KB) + Core#1 + P#1
 L2(256KB) + L1(32KB) + Core#2 + P#2
 L2(256KB) + L1(32KB) + Core#3 + P#3
   Node#1(10GB) + Socket#0 + L3(8192KB)
 L2(256KB) + L1(32KB) + Core#0 + P#4
 L2(256KB) + L1(32KB) + Core#1 + P#5
 L2(256KB) + L1(32KB) + Core#2 + P#6
 L2(256KB) + L1(32KB) + Core#3 + P#7
 Misc0
   Node#2(10GB) + Socket#3 + L3(8192KB)
 L2(256KB) + L1(32KB) + Core#0 + P#8
 L2(256KB) + L1(32KB) + Core#1 + P#9
 L2(256KB) + L1(32KB) + Core#2 + P#10
 L2(256KB) + L1(32KB) + Core#3 + P#11
   Node#3(10GB) + Socket#2 + L3(8192KB)
 L2(256KB) + L1(32KB) + Core#0 + P#12
 L2(256KB) + L1(32KB) + Core#1 + P#13
 L2(256KB) + L1(32KB) + Core#2 + P#14
 L2(256KB) + L1(32KB) + Core#3 + P#15
 Misc0
   Node#4(10GB) + Socket#5 + L3(8192KB)
 L2(256KB) + L1(32KB) + Core#0 + P#16
 L2(256KB) + L1(32KB) + Core#1 + P#17
 L2(256KB) + L1(32KB) + Core#2 + P#18
 L2(256KB) + L1(32KB) + Core#3 + P#19
   Node#5(10GB) + Socket#4 + L3(8192KB)
 L2(256KB) + L1(32KB) + Core#0 + P#20
 L2(256KB) + L1(32KB) + Core#1 + P#21
 L2(256KB) + L1(32KB) + Core#2 + P#22
 L2(256KB) + L1(32KB) + Core#3 + P#23
 Misc0
   Node#6(10GB) + Socket#7 + L3(8192KB)
 L2(256KB) + L1(32KB) + Core#0 + P#24
 L2(256KB) + L1(32KB) + Core#1 + P#25
 L2(256KB) + L1(32KB) + Core#2 + P#26
 L2(256KB) + L1(32KB) + Core#3 + P#27
   Node#7(10GB) + Socket#6 + L3(8192KB)
 L2(256KB) + L1(32KB) + Core#0 + P#28
 L2(256KB) + L1(32KB) + Core#1 + P#29
 L2(256KB) + L1(32KB) + Core#2 + P#30
 L2(256KB) + L1(32KB) + Core#3 + P#31

I don't know why they are all labeled Misc0  but it does see the  
extra

layer.

If you want other information let me know.


Great, there are probably some distance information in sysfs.

Can you send the output of
   cat /sys/devices/system/node/node*/distance


Sure:
[root@nyx0809 ~]# cat /sys/devices/system/node/node*/distance
10 20 254 254 254 254 254 254
20 10 254 254 254 254 254 254
254 254 10 20 254 254 254 254
254 254 20 10 254 254 254 254
254 254 254 254 10 20 254 254
254 254 254 254 20 10 254 254
254 254 254 254 254 254 10 20
254 254 254 254 254 254 20 10




Brice

___
hwloc-users mailing list
hwloc-us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users

Re: [hwloc-users] howloc with scalemp

2010-04-07 Thread Brock Palen


Brice Goglin wrote:


Brock Palen wrote:

has anyone done work with hwloc on scalemp systems?  They provide
their own tool numabind, but we are looking for a more generic
solution to process placement and control that works well inside our
MPI library (openMPI in most cases).

Any input on this would be great!


Hello Brock,


From what I remember, ScaleMP uses an hypervisor on each node that
virtually merges all of them into a fake big shared-memory machine.  
Then

a vanilla Linux kernel runs on top of it. So hwloc should just see
regular cores and NUMA node information, assuming the virtual "merged"
hardware reports all necessary information to the OS.



running lstopo 0.9.3  it appears that howloc does see the extra layer  
of complexity:


[brockp@nyx0809 INTEL]$ lstopo -
System(79GB)
  Misc0
Node#0(10GB) + Socket#1 + L3(8192KB)
  L2(256KB) + L1(32KB) + Core#0 + P#0
  L2(256KB) + L1(32KB) + Core#1 + P#1
  L2(256KB) + L1(32KB) + Core#2 + P#2
  L2(256KB) + L1(32KB) + Core#3 + P#3
Node#1(10GB) + Socket#0 + L3(8192KB)
  L2(256KB) + L1(32KB) + Core#0 + P#4
  L2(256KB) + L1(32KB) + Core#1 + P#5
  L2(256KB) + L1(32KB) + Core#2 + P#6
  L2(256KB) + L1(32KB) + Core#3 + P#7
  Misc0
Node#2(10GB) + Socket#3 + L3(8192KB)
  L2(256KB) + L1(32KB) + Core#0 + P#8
  L2(256KB) + L1(32KB) + Core#1 + P#9
  L2(256KB) + L1(32KB) + Core#2 + P#10
  L2(256KB) + L1(32KB) + Core#3 + P#11
Node#3(10GB) + Socket#2 + L3(8192KB)
  L2(256KB) + L1(32KB) + Core#0 + P#12
  L2(256KB) + L1(32KB) + Core#1 + P#13
  L2(256KB) + L1(32KB) + Core#2 + P#14
  L2(256KB) + L1(32KB) + Core#3 + P#15
  Misc0
Node#4(10GB) + Socket#5 + L3(8192KB)
  L2(256KB) + L1(32KB) + Core#0 + P#16
  L2(256KB) + L1(32KB) + Core#1 + P#17
  L2(256KB) + L1(32KB) + Core#2 + P#18
  L2(256KB) + L1(32KB) + Core#3 + P#19
Node#5(10GB) + Socket#4 + L3(8192KB)
  L2(256KB) + L1(32KB) + Core#0 + P#20
  L2(256KB) + L1(32KB) + Core#1 + P#21
  L2(256KB) + L1(32KB) + Core#2 + P#22
  L2(256KB) + L1(32KB) + Core#3 + P#23
  Misc0
Node#6(10GB) + Socket#7 + L3(8192KB)
  L2(256KB) + L1(32KB) + Core#0 + P#24
  L2(256KB) + L1(32KB) + Core#1 + P#25
  L2(256KB) + L1(32KB) + Core#2 + P#26
  L2(256KB) + L1(32KB) + Core#3 + P#27
Node#7(10GB) + Socket#6 + L3(8192KB)
  L2(256KB) + L1(32KB) + Core#0 + P#28
  L2(256KB) + L1(32KB) + Core#1 + P#29
  L2(256KB) + L1(32KB) + Core#2 + P#30
  L2(256KB) + L1(32KB) + Core#3 + P#31

I don't know why they are all labeled Misc0  but it does see the extra  
layer.


If you want other information let me know.


There's a bit of ScaleMP code in the Linux kernel, but it does pretty
much nothing, it does not seem to add anything to /proc or /sys for
instance. So I am not sure hwloc could get some specialized  
knowledge of

ScaleMP machines. Maybe their custom numabind tool knows that ScaleMP
machines only works on machines with some well-defined
types/counts/numbering of processors and NUMA nodes, and thus uses  
this

information to group sockets/NUMA-nodes depending on their physical
distance.

Brice

___
hwloc-users mailing list
hwloc-us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users

[hwloc-users] howloc with scalemp

2010-04-07 Thread Brock Palen

has anyone done work with hwloc on scalemp systems?  They provide  
their own tool numabind, but we are looking for a more generic  
solution to process placement and control that works well inside our  
MPI library (openMPI in most cases).


Any input on this would be great!

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985

Re: [hwloc-users] hwloc 0.9.3 not showing opt275 caches correctly?

2010-01-25 Thread Brock Palen


yes they all show up as 1024K

cat /sys/devices/system/cpu/cpu*/cache/index*/size
1024K
1024K
1024K
1024K
1024K
1024K
1024K
1024K
1024K
1024K
1024K
1024K

Thanks for the input.

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985



On Jan 23, 2010, at 2:22 PM, Samuel Thibault wrote:


Hello,

Brock Palen, le Sat 23 Jan 2010 13:51:09 -0500, a écrit :

System(7870MB)
 Node#0(3906MB) + Socket#0
   L2(1024KB) + L1(1024KB) + Core#0 + P#0
   L2(1024KB) + L1(1024KB) + Core#1 + P#1
 Node#1(4040MB) + Socket#1
   L2(1024KB) + L1(1024KB) + Core#0 + P#2
   L2(1024KB) + L1(1024KB) + Core#1 + P#3

If I am reading the AMD docs right, the L1 cache for each core should
be smaller, and in two parts, (data and instruction cache)
Also appears that L2 should be shared, as far as I can tell, it is  
not

shared in this case.

Am I looking at this wrong?


No, that's the right interpretation of lstopo's output.

However, the bug probably lies into your kernel's code. Could you  
check


/sys/devices/system/cpu/cpu*/cache/index*/size

?

Samuel
___
hwloc-users mailing list
hwloc-us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users

[hwloc-users] hwloc 0.9.3 not showing opt275 caches correctly?

2010-01-23 Thread Brock Palen


lstopo on our opteron 275  login node gives the following output


System(7870MB)
  Node#0(3906MB) + Socket#0
L2(1024KB) + L1(1024KB) + Core#0 + P#0
L2(1024KB) + L1(1024KB) + Core#1 + P#1
  Node#1(4040MB) + Socket#1
L2(1024KB) + L1(1024KB) + Core#0 + P#2
L2(1024KB) + L1(1024KB) + Core#1 + P#3

If I am reading the AMD docs right, the L1 cache for each core should  
be smaller, and in two parts, (data and instruction cache)
Also appears that L2 should be shared, as far as I can tell, it is not  
shared in this case.


Am I looking at this wrong?


Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985

Re: [hwloc-users] hwloc Python3 Bindings - Correctly Grab number cores available

Re: [hwloc-users] hwloc Python3 Bindings - Correctly Grab number cores available

[hwloc-users] hwloc Python3 Bindings - Correctly Grab number cores available

Re: [hwloc-users] Selecting real cores vs HT cores

Re: [hwloc-users] Selecting real cores vs HT cores

[hwloc-users] Selecting real cores vs HT cores

Re: [hwloc-users] Using hwloc to map GPU layout on system

Re: [hwloc-users] Using hwloc to map GPU layout on system

[hwloc-users] Using hwloc to map GPU layout on system

Re: [hwloc-users] Strange binding issue on 40 core nodes and cgroups

Re: [hwloc-users] Strange binding issue on 40 core nodes and cgroups

[hwloc-users] Strange binding issue on 40 core nodes and cgroups

Re: [hwloc-users] HWLoc Documentation pages 404's

Re: [hwloc-users] HWLoc Documentation pages 404's

[hwloc-users] HWLoc Documentation pages 404's

[hwloc-users] hwloc featured podcast released

Re: [hwloc-users] howloc with scalemp

Re: [hwloc-users] howloc with scalemp

[hwloc-users] howloc with scalemp

Re: [hwloc-users] hwloc 0.9.3 not showing opt275 caches correctly?

[hwloc-users] hwloc 0.9.3 not showing opt275 caches correctly?

21 matches

Site Navigation

Mail list logo

Footer information