Re: [hwloc-users] unusual memory binding results

2019-01-29 Thread Biddiscombe, John A.
The answer is "no", I don't have root access, but I suspect that that would be 
the right fix if it is currently set to [always] and either madvise or never 
would be good options. If it is of interest, I'll ask someone to try it and 
report back on what happens.

-Original Message-
From: Brice Goglin  
Sent: 29 January 2019 15:39
To: Biddiscombe, John A. ; Hardware locality user list 

Subject: Re: [hwloc-users] unusual memory binding results

Only the one in brackets is set, others are unset alternatives.

If you write "madvise" in that file, it'll become "always [madvise] never".

Brice


Le 29/01/2019 à 15:36, Biddiscombe, John A. a écrit :
> On the 8 numa node machine
>
> $cat /sys/kernel/mm/transparent_hugepage/enabled
> [always] madvise never
>
> is set already, so I'm not really sure what should go in there to disable it.
>
> JB
>
> -Original Message-
> From: Brice Goglin 
> Sent: 29 January 2019 15:29
> To: Biddiscombe, John A. ; Hardware locality user 
> list 
> Subject: Re: [hwloc-users] unusual memory binding results
>
> Oh, that's very good to know. I guess lots of people using first touch will 
> be affected by this issue. We may want to add a hwloc memory flag doing 
> something similar.
>
> Do you have root access to verify that writing "never" or "madvise" in 
> /sys/kernel/mm/transparent_hugepage/enabled fixes the issue too?
>
> Brice
>
>
>
> Le 29/01/2019 à 14:02, Biddiscombe, John A. a écrit :
>> Brice
>>
>> madvise(addr, n * sizeof(T), MADV_NOHUGEPAGE)
>>
>> seems to make things behave much more sensibly. I had no idea it was a 
>> thing, but one of my colleagues pointed me to it.
>>
>> Problem seems to be solved for now. Thank you very much for your insights 
>> and suggestions/help.
>>
>> JB
>>
>> -Original Message-
>> From: Brice Goglin 
>> Sent: 29 January 2019 10:35
>> To: Biddiscombe, John A. ; Hardware locality user 
>> list 
>> Subject: Re: [hwloc-users] unusual memory binding results
>>
>> Crazy idea: 512 pages could be replaced with a single 2MB huge page.
>> You're not requesting huge pages in your allocation but some systems 
>> have transparent huge pages enabled by default (e.g. RHEL
>> https://access.redhat.com/solutions/46111)
>>
>> This could explain why 512 pages get allocated on the same node, but it 
>> wouldn't explain crazy patterns you've seen in the past.
>>
>> Brice
>>
>>
>>
>>
>> Le 29/01/2019 à 10:23, Biddiscombe, John A. a écrit :
>>> I simplified things and instead of writing to a 2D array, I allocate a 1D 
>>> array of bytes and touch pages in a linear fashion.
>>> Then I call syscall(NR)move_pages, ) and retrieve a status array for 
>>> each page in the data.
>>>
>>> When I allocate 511 pages and touch alternate pages on alternate 
>>> numa nodes
>>>
>>> Numa page binding 511
>>> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 
>>> 0
>>> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 
>>> 1
>>> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 
>>> 0
>>> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 
>>> 1
>>> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 
>>> 0
>>> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 
>>> 1
>>> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 
>>> 0
>>> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 
>>> 1
>>> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 
>>> 0
>>> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 
>>> 1
>>> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 
>>> 0
>>> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 
>>> 1
>>> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 
>>> 0
>>> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 
>>> 1
>>> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
>>>
>>> but as soon as I increase to 512 pages, it breaks.
>>>
>>> Numa page binding 512
>>> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
>>> 0
>>> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
>>> 0
>>> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

Re: [hwloc-users] unusual memory binding results

2019-01-29 Thread Biddiscombe, John A.
On the 8 numa node machine

$cat /sys/kernel/mm/transparent_hugepage/enabled 
[always] madvise never

is set already, so I'm not really sure what should go in there to disable it.

JB

-Original Message-
From: Brice Goglin  
Sent: 29 January 2019 15:29
To: Biddiscombe, John A. ; Hardware locality user list 

Subject: Re: [hwloc-users] unusual memory binding results

Oh, that's very good to know. I guess lots of people using first touch will be 
affected by this issue. We may want to add a hwloc memory flag doing something 
similar.

Do you have root access to verify that writing "never" or "madvise" in 
/sys/kernel/mm/transparent_hugepage/enabled fixes the issue too?

Brice



Le 29/01/2019 à 14:02, Biddiscombe, John A. a écrit :
> Brice
>
> madvise(addr, n * sizeof(T), MADV_NOHUGEPAGE)
>
> seems to make things behave much more sensibly. I had no idea it was a thing, 
> but one of my colleagues pointed me to it.
>
> Problem seems to be solved for now. Thank you very much for your insights and 
> suggestions/help.
>
> JB
>
> -Original Message-
> From: Brice Goglin 
> Sent: 29 January 2019 10:35
> To: Biddiscombe, John A. ; Hardware locality user 
> list 
> Subject: Re: [hwloc-users] unusual memory binding results
>
> Crazy idea: 512 pages could be replaced with a single 2MB huge page.
> You're not requesting huge pages in your allocation but some systems 
> have transparent huge pages enabled by default (e.g. RHEL
> https://access.redhat.com/solutions/46111)
>
> This could explain why 512 pages get allocated on the same node, but it 
> wouldn't explain crazy patterns you've seen in the past.
>
> Brice
>
>
>
>
> Le 29/01/2019 à 10:23, Biddiscombe, John A. a écrit :
>> I simplified things and instead of writing to a 2D array, I allocate a 1D 
>> array of bytes and touch pages in a linear fashion.
>> Then I call syscall(NR)move_pages, ) and retrieve a status array for 
>> each page in the data.
>>
>> When I allocate 511 pages and touch alternate pages on alternate numa 
>> nodes
>>
>> Numa page binding 511
>> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
>> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
>> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
>> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
>> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
>> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
>> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
>> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
>> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
>> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
>> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
>> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
>> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
>> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
>> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
>>
>> but as soon as I increase to 512 pages, it breaks.
>>
>> Numa page binding 512
>> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>>
>> On the 8 numa node machine it sometimes gives the right answer even with 512 
>> pages.
>>
>> Still baffled
>>
>> JB
>>
>> -Original Message-
>> From: hwloc-users  On Behalf Of 
>> Biddiscombe, John A.
>> Sent: 28 January 2019 16:14
>> To: Brice 

Re: [hwloc-users] unusual memory binding results

2019-01-29 Thread Biddiscombe, John A.
Brice

madvise(addr, n * sizeof(T), MADV_NOHUGEPAGE)

seems to make things behave much more sensibly. I had no idea it was a thing, 
but one of my colleagues pointed me to it.

Problem seems to be solved for now. Thank you very much for your insights and 
suggestions/help.

JB

-Original Message-
From: Brice Goglin  
Sent: 29 January 2019 10:35
To: Biddiscombe, John A. ; Hardware locality user list 

Subject: Re: [hwloc-users] unusual memory binding results

Crazy idea: 512 pages could be replaced with a single 2MB huge page.
You're not requesting huge pages in your allocation but some systems have 
transparent huge pages enabled by default (e.g. RHEL
https://access.redhat.com/solutions/46111)

This could explain why 512 pages get allocated on the same node, but it 
wouldn't explain crazy patterns you've seen in the past.

Brice




Le 29/01/2019 à 10:23, Biddiscombe, John A. a écrit :
> I simplified things and instead of writing to a 2D array, I allocate a 1D 
> array of bytes and touch pages in a linear fashion.
> Then I call syscall(NR)move_pages, ) and retrieve a status array for each 
> page in the data.
>
> When I allocate 511 pages and touch alternate pages on alternate numa 
> nodes
>
> Numa page binding 511
> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 
> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 
> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 
> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 
> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 
> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 
> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 
> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 
> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 
> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 
> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 
> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 
> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 
> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 
> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
>
> but as soon as I increase to 512 pages, it breaks.
>
> Numa page binding 512
> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>
> On the 8 numa node machine it sometimes gives the right answer even with 512 
> pages.
>
> Still baffled
>
> JB
>
> -Original Message-
> From: hwloc-users  On Behalf Of 
> Biddiscombe, John A.
> Sent: 28 January 2019 16:14
> To: Brice Goglin 
> Cc: Hardware locality user list 
> Subject: Re: [hwloc-users] unusual memory binding results
>
> Brice
>
>> Can you print the pattern before and after thread 1 touched its pages, or 
>> even in the middle ?
>> It looks like somebody is touching too many pages here.
> Experimenting with different threads touching one or more pages, I get 
> unpredicatable results
>
> here on the 8 numa node device, the result is perfect. I am only 
> allowing thread 3 and 7 to write a single memory location
>
> get_numa_domain() 8 Domain Numa pattern
> 
> 
> 
> 3---
> 
> 
> 
> 7---
> 
>
> 
> Contents of memory locations
> 0 0 0 0 0 0 0 0
> 0 0 0 0 0 0 0 0
> 0 0 0 0 0 0 0 0
> 26 0 0 0 0 0 0 0
> 0 0 0 0 0 0 0 0
> 0 0 0 0 0 0 0 0
> 0 0 0 0 0 0 0 0
> 63 0 0 0 0 0 0 0
> 
>
> you can see that core 26 (numa domain 3) wrote to memory, and so did 
> core 63 

Re: [hwloc-users] unusual memory binding results

2019-01-29 Thread Biddiscombe, John A.
I wondered something similar. The crazy patterns usually happen on columns of 
the 2D matrix and as it is column major, it does loosely fit the idea (most of 
the time).

I will play some more (though I'm fed up with it now).

JB

-Original Message-
From: Brice Goglin  
Sent: 29 January 2019 10:35
To: Biddiscombe, John A. ; Hardware locality user list 

Subject: Re: [hwloc-users] unusual memory binding results

Crazy idea: 512 pages could be replaced with a single 2MB huge page.
You're not requesting huge pages in your allocation but some systems have 
transparent huge pages enabled by default (e.g. RHEL
https://access.redhat.com/solutions/46111)

This could explain why 512 pages get allocated on the same node, but it 
wouldn't explain crazy patterns you've seen in the past.

Brice




Le 29/01/2019 à 10:23, Biddiscombe, John A. a écrit :
> I simplified things and instead of writing to a 2D array, I allocate a 1D 
> array of bytes and touch pages in a linear fashion.
> Then I call syscall(NR)move_pages, ) and retrieve a status array for each 
> page in the data.
>
> When I allocate 511 pages and touch alternate pages on alternate numa 
> nodes
>
> Numa page binding 511
> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 
> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 
> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 
> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 
> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 
> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 
> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 
> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 
> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 
> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 
> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 
> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 
> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 
> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 
> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
>
> but as soon as I increase to 512 pages, it breaks.
>
> Numa page binding 512
> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>
> On the 8 numa node machine it sometimes gives the right answer even with 512 
> pages.
>
> Still baffled
>
> JB
>
> -Original Message-
> From: hwloc-users  On Behalf Of 
> Biddiscombe, John A.
> Sent: 28 January 2019 16:14
> To: Brice Goglin 
> Cc: Hardware locality user list 
> Subject: Re: [hwloc-users] unusual memory binding results
>
> Brice
>
>> Can you print the pattern before and after thread 1 touched its pages, or 
>> even in the middle ?
>> It looks like somebody is touching too many pages here.
> Experimenting with different threads touching one or more pages, I get 
> unpredicatable results
>
> here on the 8 numa node device, the result is perfect. I am only 
> allowing thread 3 and 7 to write a single memory location
>
> get_numa_domain() 8 Domain Numa pattern
> 
> 
> 
> 3---
> 
> 
> 
> 7---
> 
>
> 
> Contents of memory locations
> 0 0 0 0 0 0 0 0
> 0 0 0 0 0 0 0 0
> 0 0 0 0 0 0 0 0
> 26 0 0 0 0 0 0 0
> 0 0 0 0 0 0 0 0
> 0 0 0 0 0 0 0 0
> 0 0 0 0 0 0 0 0
> 63 0 0 0 0 0 0 0
> 
>
> you can see that core 26 (numa domain 3) wrote to memory, and so did 
> core 63 (domain 8)
>
> Now I run it a second time 

Re: [hwloc-users] unusual memory binding results

2019-01-29 Thread Biddiscombe, John A.
I simplified things and instead of writing to a 2D array, I allocate a 1D array 
of bytes and touch pages in a linear fashion.
Then I call syscall(NR)move_pages, ) and retrieve a status array for each 
page in the data.

When I allocate 511 pages and touch alternate pages on alternate numa nodes

Numa page binding 511
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 
 1 0 1 0 1 0 1 0 1 0 1 0

but as soon as I increase to 512 pages, it breaks.

Numa page binding 512
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
 0 0 0 0 0 0 0 0 0 0 0 0 0

On the 8 numa node machine it sometimes gives the right answer even with 512 
pages.

Still baffled

JB

-Original Message-
From: hwloc-users  On Behalf Of 
Biddiscombe, John A.
Sent: 28 January 2019 16:14
To: Brice Goglin 
Cc: Hardware locality user list 
Subject: Re: [hwloc-users] unusual memory binding results

Brice

>Can you print the pattern before and after thread 1 touched its pages, or even 
>in the middle ?
>It looks like somebody is touching too many pages here.

Experimenting with different threads touching one or more pages, I get 
unpredicatable results

here on the 8 numa node device, the result is perfect. I am only allowing 
thread 3 and 7 to write a single memory location

get_numa_domain() 8 Domain Numa pattern



3---



7---



Contents of memory locations
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
26 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
63 0 0 0 0 0 0 0 


you can see that core 26 (numa domain 3) wrote to memory, and so did core 63 
(domain 8)

Now I run it a second time and look, its rubbish

get_numa_domain() 8 Domain Numa pattern
3---
3---
3---
3---
3---
3---
3---
3---



Contents of memory locations
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
26 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
63 0 0 0 0 0 0 0 


after allowing the data to be read by a random thread

3777
3777
3777
3777
3777
3777
3777
3777

I'm baffled.

JB

___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users


Re: [hwloc-users] unusual memory binding results

2019-01-28 Thread Biddiscombe, John A.
Brice

>Can you print the pattern before and after thread 1 touched its pages, or even 
>in the middle ?
>It looks like somebody is touching too many pages here.

Experimenting with different threads touching one or more pages, I get 
unpredicatable results

here on the 8 numa node device, the result is perfect. I am only allowing 
thread 3 and 7 to write a single memory location

get_numa_domain() 8 Domain Numa pattern



3---



7---



Contents of memory locations
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
26 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
63 0 0 0 0 0 0 0 


you can see that core 26 (numa domain 3) wrote to memory, and so did core 63 
(domain 8)

Now I run it a second time and look, its rubbish

get_numa_domain() 8 Domain Numa pattern
3---
3---
3---
3---
3---
3---
3---
3---



Contents of memory locations
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
26 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
63 0 0 0 0 0 0 0 


after allowing the data to be read by a random thread

3777
3777
3777
3777
3777
3777
3777
3777

I'm baffled.

JB

___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users


Re: [hwloc-users] unusual memory binding results

2019-01-28 Thread Biddiscombe, John A.
Brice

I might have been using the wrong params to hwloc_get_area_memlocation in my 
original version, but I bypassed it and have been calling

int get_numa_domain(void *page)
{
HPX_ASSERT( (std::size_t(page) & 4095) ==0 );

void *pages[1] = { page };
int  status[1] = { -1 };
if (syscall(__NR_move_pages, 0, 1, pages, nullptr, status, 0) == 0) 
{
if (status[0]>=0 && status[0]<=HPX_HAVE_MAX_NUMA_DOMAIN_COUNT) {
return status[0];
}
return -1;
}
throw std::runtime_error("Failed to get numa node for page");
}

this function instead. Just testing one page address at a time. I still see 
this kind of pattern
00101101010010101001010101011010011011010101110101110111010101010101
00101101010010101001010101011010011011010101110101110111010101010101
00101101010010101001010101011010011011010101110101110111010101010101
00101101010010101001010101011010011011010101110101110111010101010101
00101101010010101001010101011010011011010101110101110111010101010101
00101101010010101001010101011010011011010101110101110111010101010101
00101101010010101001010101011010011011010101110101110111010101010101
00101101010010101001010101011010011011010101110101110111010101010101
00101101010010101001010101011010011011010101110101110111010101010101
00101101010010101001010101011010011011010101110101110111010101010101
00101101010010101001010101011010011011010101110101110111010101010101
when I should see
01010101010101010101010101010101010101010101010101010101010101010101010101010101
10101010101010101010101010101010101010101010101010101010101010101010101010101010
01010101010101010101010101010101010101010101010101010101010101010101010101010101
10101010101010101010101010101010101010101010101010101010101010101010101010101010
01010101010101010101010101010101010101010101010101010101010101010101010101010101
10101010101010101010101010101010101010101010101010101010101010101010101010101010
01010101010101010101010101010101010101010101010101010101010101010101010101010101
10101010101010101010101010101010101010101010101010101010101010101010101010101010
01010101010101010101010101010101010101010101010101010101010101010101010101010101
10101010101010101010101010101010101010101010101010101010101010101010101010101010

I am deeply troubled by this and can't think of what to try next since I can 
see the memory contents hold the correct CPU ID of the thread that touched the 
memory, so either the syscall is wrong, or the kernel is doing something else. 
I welcome any suggestions on what might be wrong.

Thanks for trying to help.

JB

-Original Message-
From: Brice Goglin  
Sent: 26 January 2019 10:19
To: Biddiscombe, John A. 
Cc: Hardware locality user list 
Subject: Re: [hwloc-users] unusual memory binding results

Le 25/01/2019 à 23:16, Biddiscombe, John A. a écrit :
>> move_pages() returning 0 with -14 in the status array? As opposed to 
>> move_pages() returning -1 with errno set to 14, which would definitely be a 
>> bug in hwloc.
> I think it was move_pages returning zero with -14 in the status array, and 
> then hwloc returning 0 with an empty nodeset (which I then messed up by 
> calling get bitmap first and assuming 0 meant numa node zero and not checking 
> for an empty nodeset).
>
> I'm not sure why I get -EFAULT status rather than -NOENT, but that's what I'm 
> seeing in the status field when I pass the pointer returned from the 
> alloc_membind call.

The only reason I see for getting -EFAULT there would be that you pass the 
buffer to move_pages (what hwloc_get_area_memlocation() wants, a start pointer 
and length) instead of a pointer to an array of page addresses (move_pages 
wants a void** pointing to individual pages).

Brice


___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users

Re: [hwloc-users] unusual memory binding results

2019-01-25 Thread Biddiscombe, John A.
Dear List/Brice

I experimented with disabling the memory touch on threads except for N=1,2,3,4 
etc and found a problem in hwloc, which is that the function 
hwloc_get_area_memlocation was returning '0' when the status of the memory null 
move operation was -14 (#define EFAULT 14 /* Bad address */). This was when I 
call get area memlocation immediately after allocating and then 'not' touching. 
I think if the status is an error, then the function should probably return -1, 
but anyway. I'll file a bug and send a patch if this is considered to be a bug.

I then modified the test routine to write the value returned from sched_getcpu 
into the touched memory location to verify that the thread binding was doing 
the right thing. The output below from the AMD 8 numanode machine looks good 
with threads 0,8,16 etc each touching memory which follows the pattern expected 
from the 8 numanode test. my get numa domain function however, does not reflect 
the right numanode. It looks correct for the first column (matrices are stored 
in column major order), but after that it falls to pieces. In this test, I'm 
allocating tiles as 512x512 doubles, so 4096 bytes per tile giving one tile 
column per page and I do 512 pages per tile. All the memory locations check out 
and the patters seem fine, but the call to 
// edited version of the one in hwloc source
syscall(__NR_move_pages, 0, 1, pages, nullptr, status, 0) == 0) 
is not returning the numanode that I expect to see from the first touch when it 
is enabled.

Either the syscall is wrong, or the first touch/nexttouch doesn't work (could 
the alloc routine be wrong?)
hwloc_alloc_membind(topo, len, bitmap->get_bmp(),
(hwloc_membind_policy_t)(policy),
flags | HWLOC_MEMBIND_BYNODESET);
where the nodeset should match the numanode mask (I'd will double check that 
right now).

Any ideas on what to try next?

Thanks

JB

get_numa_domain() 8 Domain Numa pattern
00740640
10740640
20740640
30740640
40740640
50740640
60740640
70740640



Contents of memory locations = sched_getcpu()
0 8 16 24 32 40 48 56 
8 16 24 32 40 48 56 0 
16 24 32 40 48 56 0 8 
24 32 40 48 56 0 8 16 
32 40 48 56 0 8 16 24 
40 48 56 0 8 16 24 32 
48 56 0 8 16 24 32 40 
56 0 8 16 24 32 40 48 



Expected 8 Domain Numa pattern
01234567
12345670
23456701
34567012
45670123
56701234
67012345
70123456

___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users


Re: [hwloc-users] unusual memory binding results

2019-01-21 Thread Biddiscombe, John A.
Brice

Apologies, I didn't explain it very well, I do make sure that if the tile size 
256*8 < 4096 (pagesize), then I double the number of tiles per page, I just 
wanted to keep the explanation simple. 

here are some code snippets to give you the flavour of it 

initializing the helper sruct

matrix_numa_binder(std::size_t Ncols, std::size_t Nrows,
   std::size_t Ntile, std::size_t Ntiles_per_domain,
   std::size_t Ncolprocs=1, std::size_t Nrowprocs=1,
   std::string pool_name="default"
   )
: cols_(Ncols), rows_(Nrows),
  tile_size_(Ntile), tiles_per_domain_(Ntiles_per_domain),
  colprocs_(Ncolprocs), rowprocs_(Nrowprocs)
{
using namespace hpx::compute::host;
binding_helper::pool_name_ = pool_name;
const int CACHE_LINE_SIZE = sysconf (_SC_LEVEL1_DCACHE_LINESIZE);
const int PAGE_SIZE   = sysconf(_SC_PAGE_SIZE);
const int ALIGNMENT   = std::max(PAGE_SIZE,CACHE_LINE_SIZE);
const int ELEMS_ALIGN = (ALIGNMENT/sizeof(T));
rows_page_= ELEMS_ALIGN;
leading_dim_  = ELEMS_ALIGN*((rows_*sizeof(T) + 
ALIGNMENT-1)/ALIGNMENT);
tiles_per_domain_ = std::max(tiles_per_domain_, ELEMS_ALIGN/tile_size_);
}

operator called by allocator which returns the domain index to bind a page to

virtual std::size_t operator ()(
const T * const base_ptr, const T * const page_ptr,
const std::size_t pagesize, const std::size_t domains) const 
override
{
std::size_t offset  = (page_ptr - base_ptr);
std::size_t col = (offset / leading_dim_);
std::size_t row = (offset % leading_dim_);
std::size_t index   = (col / (tile_size_ * tiles_per_domain_));

if ((tile_size_*tiles_per_domain_*sizeof(T))>=pagesize) {
index += (row / (tile_size_ * tiles_per_domain_));
}
else {
HPX_ASSERT(0);
}
return index % domains;
}

this function is called by each thread (one per numa domain) and if the domain 
returned by the page query matches the domain ID of the thread/task then the 
first memory location on the page is written to

for (size_type i=0; ioperator()(p, page_ptr, pagesize, 
nodesets.size());
if (dom==numa_domain) {
// trigger a memory read and rewrite without changing 
contents
volatile char* vaddr = (volatile char*) page_ptr;
*vaddr = T(0); // *vaddr;
}
page_ptr += pageN;
}

All of this has been debugged quite extensively and I can write numbers to 
memory and read them back and the patterns always match the domains expected.

This function is called after all data is written to attempt to verify (and 
display the patterns above)

int topology::get_numa_domain(const void *addr) const
{
#if HWLOC_API_VERSION >= 0x00010b06
hpx_hwloc_bitmap_wrapper *nodeset = topology::bitmap_storage_.get();
if (nullptr == nodeset)
{
hwloc_bitmap_t nodeset_ = hwloc_bitmap_alloc();
topology::bitmap_storage_.reset(new 
hpx_hwloc_bitmap_wrapper(nodeset_));
nodeset = topology::bitmap_storage_.get();
}
//
hwloc_nodeset_t ns = 
reinterpret_cast(nodeset->get_bmp());

int ret = hwloc_get_area_memlocation(topo, addr, 1,  ns,
HWLOC_MEMBIND_BYNODESET);
if (ret<0) {
std::string msg(strerror(errno));
HPX_THROW_EXCEPTION(kernel_error
  , "hpx::threads::topology::get_numa_domain"
  , "hwloc_get_area_memlocation failed " + msg);
return -1;
}
// this uses hwloc directly
//int bit = hwloc_bitmap_first(ns);
//return bit
// this uses an alternative method, both give the same result AFAICT
threads::mask_type mask = bitmap_to_mask(ns, HWLOC_OBJ_NUMANODE);
return static_cast(threads::find_first(mask));
#else
return 0;
#endif
}

Thanks for taking the time to look it over

JB
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users


Re: [hwloc-users] question about hwloc_set_area_membind_nodeset

2017-11-15 Thread Biddiscombe, John A.
Running my test on another machine (fedora 7.2) I get a 
hwloc_get_area_memlocation fail
with strerror = "Function not implemented"

Does this mean that the OS has not implemented it (I'm using 1.11.8 hwloc 
version - on the primary test machine I used 1.11.17) - am I doomed? - or will 
things magically work if I upgrade to hwloc 2.0 etc etc

Thanks

JB


-Original Message-
From: hwloc-users [mailto:hwloc-users-boun...@lists.open-mpi.org] On Behalf Of 
Biddiscombe, John A.
Sent: 13 November 2017 15:37
To: Hardware locality user list <hwloc-users@lists.open-mpi.org>
Subject: Re: [hwloc-users] question about hwloc_set_area_membind_nodeset

It's working and I'm seeing the binding pattern I hoped for.

Thanks again

JB


From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf of Brice 
Goglin [brice.gog...@inria.fr]
Sent: 13 November 2017 15:32
To: Hardware locality user list
Subject: Re: [hwloc-users] question about hwloc_set_area_membind_nodeset

The doc is wrong, flags are used, only for BY_NODESET. I actually fixed that in 
git very recently.

Brice



Le 13/11/2017 07:24, Biddiscombe, John A. a écrit :
> In the documentation for get_area_memlocation it says "If 
> HWLOC_MEMBIND_BYNODESET is specified, set is considered a nodeset. Otherwise 
> it's a cpuset."
>
> but it also says "Flags are currently unused."
>
> so where should the BY_NODESET policy be used? Does it have to be used with 
> the original alloc call?
>
> thanks
>
> JB
>
> 
> From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf 
> of Biddiscombe, John A. [biddi...@cscs.ch]
> Sent: 13 November 2017 14:59
> To: Hardware locality user list
> Subject: Re: [hwloc-users] question about 
> hwloc_set_area_membind_nodeset
>
> Brice
>
> aha. thanks. I knew I'd seen a function for that, but couldn't remember what 
> it was.
>
> Cheers
>
> JB
> 
> From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf 
> of Brice Goglin [brice.gog...@inria.fr]
> Sent: 13 November 2017 14:57
> To: Hardware locality user list
> Subject: Re: [hwloc-users] question about 
> hwloc_set_area_membind_nodeset
>
> Use get_area_memlocation()
>
> membind() returns where the pages are *allowed* to go (anywhere)
> memlocation() returns where the pages are actually allocated.
>
> Brice
>
>
>
>
> Le 13/11/2017 06:52, Biddiscombe, John A. a écrit :
>> Thank you to you both.
>>
>> I modified the allocator to allocate one large block using 
>> hwloc_alloc and then use one thread per numa domain to  touch each 
>> page according to the tiling pattern - unfortunately, I hadn't 
>> appreciated that now hwloc_get_area_membind_nodeset always returns 
>> the full machine numa mask - and not the numa domain that the page 
>> was touched by (I guess it only gives the expected answer when 
>> set_area_membind is used first)
>>
>> I had hoped to use a dynamic query of the pages (using the first one of a 
>> given tile) to schedule each task that operates on a given tile to run on 
>> the numa node that touched it.
>>
>> I can work around this by using a matrix offset calculation to get the numa 
>> node, but if there's a way of querying the page directly - then please let 
>> me know.
>>
>> Thanks
>>
>> JB
>> 
>> From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf 
>> of Samuel Thibault [samuel.thiba...@inria.fr]
>> Sent: 12 November 2017 10:48
>> To: Hardware locality user list
>> Subject: Re: [hwloc-users] question about 
>> hwloc_set_area_membind_nodeset
>>
>> Brice Goglin, on dim. 12 nov. 2017 05:19:37 +0100, wrote:
>>> That's likely what's happening. Each set_area() may be creating a 
>>> new "virtual memory area". The kernel tries to merge them with 
>>> neighbors if they go to the same NUMA node. Otherwise it creates a new VMA.
>> Mmmm, that sucks. Ideally we'd have a way to ask the kernel not to 
>> strictly bind the memory, but just to allocate on a given memory 
>> node, and just hope that the allocation will not go away (e.g. due to 
>> swapping), which thus doesn't need a VMA to record the information. 
>> As you describe below, first-touch achieves that but it's not 
>> necessarily so convenient.
>>
>>> I can't find the exact limit but it's something like 64k so I guess 
>>> you're exhausting that.
>> It's sysctl vm.max_map_count
>>
>>> Question 2 : Is there a better way of achiev

Re: [hwloc-users] question about hwloc_set_area_membind_nodeset

2017-11-13 Thread Biddiscombe, John A.
In the documentation for get_area_memlocation it says
"If HWLOC_MEMBIND_BYNODESET is specified, set is considered a nodeset. 
Otherwise it's a cpuset."

but it also says "Flags are currently unused."

so where should the BY_NODESET policy be used? Does it have to be used with the 
original alloc call?

thanks

JB


From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf of 
Biddiscombe, John A. [biddi...@cscs.ch]
Sent: 13 November 2017 14:59
To: Hardware locality user list
Subject: Re: [hwloc-users] question about hwloc_set_area_membind_nodeset

Brice

aha. thanks. I knew I'd seen a function for that, but couldn't remember what it 
was.

Cheers

JB

From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf of Brice 
Goglin [brice.gog...@inria.fr]
Sent: 13 November 2017 14:57
To: Hardware locality user list
Subject: Re: [hwloc-users] question about hwloc_set_area_membind_nodeset

Use get_area_memlocation()

membind() returns where the pages are *allowed* to go (anywhere)
memlocation() returns where the pages are actually allocated.

Brice




Le 13/11/2017 06:52, Biddiscombe, John A. a écrit :
> Thank you to you both.
>
> I modified the allocator to allocate one large block using hwloc_alloc and 
> then use one thread per numa domain to  touch each page according to the 
> tiling pattern - unfortunately, I hadn't appreciated that now
> hwloc_get_area_membind_nodeset
> always returns the full machine numa mask - and not the numa domain that the 
> page was touched by (I guess it only gives the expected answer when 
> set_area_membind is used first)
>
> I had hoped to use a dynamic query of the pages (using the first one of a 
> given tile) to schedule each task that operates on a given tile to run on the 
> numa node that touched it.
>
> I can work around this by using a matrix offset calculation to get the numa 
> node, but if there's a way of querying the page directly - then please let me 
> know.
>
> Thanks
>
> JB
> 
> From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf of 
> Samuel Thibault [samuel.thiba...@inria.fr]
> Sent: 12 November 2017 10:48
> To: Hardware locality user list
> Subject: Re: [hwloc-users] question about hwloc_set_area_membind_nodeset
>
> Brice Goglin, on dim. 12 nov. 2017 05:19:37 +0100, wrote:
>> That's likely what's happening. Each set_area() may be creating a new 
>> "virtual
>> memory area". The kernel tries to merge them with neighbors if they go to the
>> same NUMA node. Otherwise it creates a new VMA.
> Mmmm, that sucks. Ideally we'd have a way to ask the kernel not to
> strictly bind the memory, but just to allocate on a given memory
> node, and just hope that the allocation will not go away (e.g. due to
> swapping), which thus doesn't need a VMA to record the information. As
> you describe below, first-touch achieves that but it's not necessarily
> so convenient.
>
>> I can't find the exact limit but it's something like 64k so I guess
>> you're exhausting that.
> It's sysctl vm.max_map_count
>
>> Question 2 : Is there a better way of achieving the result I'm looking 
>> for
>> (such as a call to membind with a stride of some kind to say put N pages 
>> in
>> a row on each domain in alternation).
>>
>>
>> Unfortunately, the interleave policy doesn't have a stride argument. It's one
>> page on node 0, one page on node 1, etc.
>>
>> The only idea I have is to use the first-touch policy: Make sure your buffer
>> isn't is physical memory yet, and have a thread on node 0 read the "0" pages,
>> and another thread on node 1 read the "1" page.
> Or "next-touch" if that was to ever get merged into mainline Linux :)
>
> Samuel
> ___
> hwloc-users mailing list
> hwloc-users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/hwloc-users
> ___
> hwloc-users mailing list
> hwloc-users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/hwloc-users

___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users


Re: [hwloc-users] question about hwloc_set_area_membind_nodeset

2017-11-13 Thread Biddiscombe, John A.
Brice

aha. thanks. I knew I'd seen a function for that, but couldn't remember what it 
was.

Cheers

JB

From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf of Brice 
Goglin [brice.gog...@inria.fr]
Sent: 13 November 2017 14:57
To: Hardware locality user list
Subject: Re: [hwloc-users] question about hwloc_set_area_membind_nodeset

Use get_area_memlocation()

membind() returns where the pages are *allowed* to go (anywhere)
memlocation() returns where the pages are actually allocated.

Brice




Le 13/11/2017 06:52, Biddiscombe, John A. a écrit :
> Thank you to you both.
>
> I modified the allocator to allocate one large block using hwloc_alloc and 
> then use one thread per numa domain to  touch each page according to the 
> tiling pattern - unfortunately, I hadn't appreciated that now
> hwloc_get_area_membind_nodeset
> always returns the full machine numa mask - and not the numa domain that the 
> page was touched by (I guess it only gives the expected answer when 
> set_area_membind is used first)
>
> I had hoped to use a dynamic query of the pages (using the first one of a 
> given tile) to schedule each task that operates on a given tile to run on the 
> numa node that touched it.
>
> I can work around this by using a matrix offset calculation to get the numa 
> node, but if there's a way of querying the page directly - then please let me 
> know.
>
> Thanks
>
> JB
> 
> From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf of 
> Samuel Thibault [samuel.thiba...@inria.fr]
> Sent: 12 November 2017 10:48
> To: Hardware locality user list
> Subject: Re: [hwloc-users] question about hwloc_set_area_membind_nodeset
>
> Brice Goglin, on dim. 12 nov. 2017 05:19:37 +0100, wrote:
>> That's likely what's happening. Each set_area() may be creating a new 
>> "virtual
>> memory area". The kernel tries to merge them with neighbors if they go to the
>> same NUMA node. Otherwise it creates a new VMA.
> Mmmm, that sucks. Ideally we'd have a way to ask the kernel not to
> strictly bind the memory, but just to allocate on a given memory
> node, and just hope that the allocation will not go away (e.g. due to
> swapping), which thus doesn't need a VMA to record the information. As
> you describe below, first-touch achieves that but it's not necessarily
> so convenient.
>
>> I can't find the exact limit but it's something like 64k so I guess
>> you're exhausting that.
> It's sysctl vm.max_map_count
>
>> Question 2 : Is there a better way of achieving the result I'm looking 
>> for
>> (such as a call to membind with a stride of some kind to say put N pages 
>> in
>> a row on each domain in alternation).
>>
>>
>> Unfortunately, the interleave policy doesn't have a stride argument. It's one
>> page on node 0, one page on node 1, etc.
>>
>> The only idea I have is to use the first-touch policy: Make sure your buffer
>> isn't is physical memory yet, and have a thread on node 0 read the "0" pages,
>> and another thread on node 1 read the "1" page.
> Or "next-touch" if that was to ever get merged into mainline Linux :)
>
> Samuel
> ___
> hwloc-users mailing list
> hwloc-users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/hwloc-users
> ___
> hwloc-users mailing list
> hwloc-users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/hwloc-users

___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users


Re: [hwloc-users] question about hwloc_set_area_membind_nodeset

2017-11-13 Thread Biddiscombe, John A.
Thank you to you both.

I modified the allocator to allocate one large block using hwloc_alloc and then 
use one thread per numa domain to  touch each page according to the tiling 
pattern - unfortunately, I hadn't appreciated that now
hwloc_get_area_membind_nodeset
always returns the full machine numa mask - and not the numa domain that the 
page was touched by (I guess it only gives the expected answer when 
set_area_membind is used first)

I had hoped to use a dynamic query of the pages (using the first one of a given 
tile) to schedule each task that operates on a given tile to run on the numa 
node that touched it.

I can work around this by using a matrix offset calculation to get the numa 
node, but if there's a way of querying the page directly - then please let me 
know.

Thanks

JB 

From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf of Samuel 
Thibault [samuel.thiba...@inria.fr]
Sent: 12 November 2017 10:48
To: Hardware locality user list
Subject: Re: [hwloc-users] question about hwloc_set_area_membind_nodeset

Brice Goglin, on dim. 12 nov. 2017 05:19:37 +0100, wrote:
> That's likely what's happening. Each set_area() may be creating a new "virtual
> memory area". The kernel tries to merge them with neighbors if they go to the
> same NUMA node. Otherwise it creates a new VMA.

Mmmm, that sucks. Ideally we'd have a way to ask the kernel not to
strictly bind the memory, but just to allocate on a given memory
node, and just hope that the allocation will not go away (e.g. due to
swapping), which thus doesn't need a VMA to record the information. As
you describe below, first-touch achieves that but it's not necessarily
so convenient.

> I can't find the exact limit but it's something like 64k so I guess
> you're exhausting that.

It's sysctl vm.max_map_count

> Question 2 : Is there a better way of achieving the result I'm looking for
> (such as a call to membind with a stride of some kind to say put N pages 
> in
> a row on each domain in alternation).
>
>
> Unfortunately, the interleave policy doesn't have a stride argument. It's one
> page on node 0, one page on node 1, etc.
>
> The only idea I have is to use the first-touch policy: Make sure your buffer
> isn't is physical memory yet, and have a thread on node 0 read the "0" pages,
> and another thread on node 1 read the "1" page.

Or "next-touch" if that was to ever get merged into mainline Linux :)

Samuel
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users


Re: [hwloc-users] HWLOC_VERSION

2017-10-30 Thread Biddiscombe, John A.
Brice

aha. ok thanks I tried 0x00010b03 but my 1.11.7 build only has 0x00010b00 set 
so I assumed patch releases were not marked in there.

I'll see what I can do about a configure time check.

JB


-Original Message-
From: hwloc-users [mailto:hwloc-users-boun...@lists.open-mpi.org] On Behalf Of 
Brice Goglin
Sent: 30 October 2017 09:26
To: hwloc-users@lists.open-mpi.org
Subject: Re: [hwloc-users] HWLOC_VERSION

Hello

It should have been 0x00010b03 but I forgot to increase it unfortunately (and 
again in 1.11.6).
I need to add this to my release-TODO-list.

The upcoming 1.11.9 will have the proper HWLOC_API_VERSION (0x00010b06 unless 
we had something) so that people can at least check for these features in newer 
releases...

If you have some configure checks for hwloc, you could add something there to 
workaround the issue.

Sorry
Brice







Le 30/10/2017 09:09, Biddiscombe, John A. a écrit :
> Dear list
>
> According to the release notes Add HWLOC_MEMBIND_BYNODESET flag was 
> added in 1.11.3 - if I protect some code with
>
> #if HWLOC_API_VERSION >= 0x00010b00
> then versions 1.11.0, 1.11.1, 1.11.2 still cause build failures.
>
> is there some VERSION flag that distinguishes between the patch version 
> releases?
>
> thanks
>
> JB
>
>

___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users


[hwloc-users] HWLOC_VERSION

2017-10-30 Thread Biddiscombe, John A.
Dear list

According to the release notes Add HWLOC_MEMBIND_BYNODESET flag was added in 
1.11.3 - if I protect some code with 

#if HWLOC_API_VERSION >= 0x00010b00
then versions 1.11.0, 1.11.1, 1.11.2 still cause build failures.

is there some VERSION flag that distinguishes between the patch version 
releases?

thanks

JB


-- 
John Biddiscombe,email:biddisco @.at.@ cscs.ch
http://www.cscs.ch/
CSCS, Swiss National Supercomputing Centre  | Tel:  +41 (91) 610.82.07
Via Trevano 131, 6900 Lugano, Switzerland   | Fax:  +41 (91) 610.82.82

___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users


Re: [hwloc-users] BGQ question.

2014-03-28 Thread Biddiscombe, John A.
Just as a follow up to this thread. I spoke with someone from IBM and they tell 
me that 2 cores of 4 hardware threads each are hidden from the kernel (how do 
they do that?) and used for the custom HS4 cards we have installed on the IO 
nodes, which explains why I see only 60 instead of 68 threads. the 2 bgvrnic 
tasks I see spinning at 100% run on threads 58/59 and service the connection 
from ION to CN.

It looks as though everything is reporting as expected - as long as I compile 
hwloc on the ION itself, it seems to be correct.

Thanks and sorry for any misunderstanding

JB


> -Original Message-
> From: hwloc-users [mailto:hwloc-users-boun...@open-mpi.org] On Behalf
> Of Chris Samuel
> Sent: 26 March 2014 13:42
> To: Hardware locality user list
> Subject: Re: [hwloc-users] BGQ question.
> 
> On Wed, 26 Mar 2014 11:56:08 AM Biddiscombe, John A. wrote:
> 
> > I can’t test this as the system is down for maintenance, but if memory
> > serves me correctly, the GCC compiled lstopo also showed 60 cores
> > instead of 64/68.
> 
> It can only report what the kernel reports and it appears your kernel is not
> reporting the same number of cores on an IO node as ours.
> 
> It would be interesting to compare kernel version and boot command line.
> 
> Ours are:
> 
> -bash-4.1# uname -a
> Linux r00-id-j01.pcf.vlsci.unimelb.edu.au 2.6.32-
> 279.14.1.bgq.el6_V1R2M0_36.ppc64 #1 SMP Tue Jun 11 15:50:53 CDT 2013
> ppc64 ppc64 ppc64 GNU/Linux
> 
> 
> -bash-4.1# cat /proc/cmdline
> root=/dev/ram0 rdinit=/init raid=noautodetect loglevel=5
> 
> 
> This is the end of our /proc/cpuinfo showing 68 hardware threads
> (17 cores exposed).
> 
> -bash-4.1# tail -n 9 /proc/cpuinfo
> 
> processor   : 67
> cpu : A2 (Blue Gene/Q)
> clock   : 1600.00MHz
> revision: 2.0 (pvr 0049 0200)
> 
> timebase: 16
> platform: Blue Gene/Q
> model   : ibm,bluegeneq
> 
> 
> > I am not certain if this gcc was in any was ‘special’ for bgq.
> 
> There is a GCC cross compiler, but it's not the /usr/bin/gcc one.
> 
> cheers!
> Chris
> --
>  Christopher SamuelSenior Systems Administrator
>  VLSCI - Victorian Life Sciences Computation Initiative
>  Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
>  http://www.vlsci.org.au/  http://twitter.com/vlsci
> 
> ___
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users


Re: [hwloc-users] BGQ question.

2014-03-25 Thread Biddiscombe, John A.
Chris

> Out of interest, why on an I/O node?

I'm targeting the BGQ BGAS nodes with flash cards installed. We've done tests 
with GPFS mounted on the flash and are trying to get comparable results with an 
in-house driver.

JB



Re: [hwloc-users] BGQ question.

2014-03-25 Thread Biddiscombe, John A.
Brice

Looking at /proc/cpuinfo on the io node itself, I see only 60 cores listed. I 
wonder if they’ve reserved one socket of 4 cores for IO purposes and in fact 
hwloc is seeing the correct information.

Attached is the foo zip of the run just now (assuming it doesn’t bounce)

JB

From: Brice Goglin [mailto:brice.gog...@inria.fr]
Sent: 25 March 2014 09:28
To: Biddiscombe, John A.; Hardware locality user list
Subject: Re: [hwloc-users] BGQ question.

Can you run hwloc-gather-topology foo and send the resulting foo.tar.bz2 ?
If the tarball is too bug, feel free to send it to me in a private mail.

Brice



Le 25/03/2014 08:55, Biddiscombe, John A. a écrit :
Brice,

Correct : The IO nodes are running a  full linux install (RHE 6.4) on the same 
hardware as the CNK nodes.

On vesta I do not have an account and I am not certain the IO nodes are 
available for direct login. I’m using the BGQ at CSCS which is an EPFL machine. 
The IO nodes are open for some special projects where we are trying to 
customise the IO.

JB

From: Brice Goglin [mailto:brice.gog...@inria.fr]
Sent: 25 March 2014 08:43
To: Hardware locality user list; Biddiscombe, John A.
Subject: Re: [hwloc-users] BGQ question.

Wait, I missed the "io node" part of your first mail. The bgq support is for 
compute nodes running cnk. Are io nodes running linux on same hardware as the 
compute nodes?

I have an account on vesta. Where should I logon to have a look?
Brice


On 25 mars 2014 08:12:58 UTC+01:00, "Biddiscombe, John A." 
<biddi...@cscs.ch<mailto:biddi...@cscs.ch>> wrote:
Brice,


lstopo --whole-system


gives the same output and setting env var BG_THREADMODEL=2 does not appear to 
make any visible difference.


my configure command for compiling hwloc had no special options,
./configure --prefix=/gpfs/bbp.cscs.ch/home/biddisco/apps/clang/hwloc-1.8.1


should I rerun with something set?


Thanks


JB




From: hwloc-users [mailto:hwloc-users-boun...@open-mpi.org] On Behalf Of Brice 
Goglin
Sent: 25 March 2014 08:04
To: Hardware locality user list
Subject: Re: [hwloc-users] BGQ question.


Le 25/03/2014 07:51, Biddiscombe, John A. a écrit :
I’m compiling hwloc using clang (bgclang++11 from ANL) to run on IO nodes af a 
BGQ. It seems to have compiled ok, and when I run lstopo, I get an output like 
this (below), which looks reasonable, but there are 15 sockets instead of 16. 
I’m a little worried because the first time I compiled, I had problems where 
apps would report an error from HWLOC on start and tell me to set 
HWLOC_FORCE_BGQ=1. when I did set this env var, it would then report that 
“topology became empty” and the app would segfault due to the unexpected return 
from hwloc presumably.

Can you give a bit more details on what you did there? I'd like to check if 
that case should be better supported or not.


I wiped everything and recompiled (not sure what I did differently), and now it 
behaves more sensibly, but with 15 instead of 16 sockets.

Should IO be worried?

The topology detection is hardwired so you shouldn't worried on the hardware 
side.
The problem could be related to how you reserved resources before running 
lstopo.
Does lstopo --whole-system see more sockets?
Does BG_THREADMODEL=2 help?

Brice





hwloc-users mailing list

hwloc-us...@open-mpi.org<mailto:hwloc-us...@open-mpi.org>

http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users



foo.tar.bz2
Description: foo.tar.bz2


Re: [hwloc-users] BGQ question.

2014-03-25 Thread Biddiscombe, John A.
Brice,

Correct : The IO nodes are running a  full linux install (RHE 6.4) on the same 
hardware as the CNK nodes.

On vesta I do not have an account and I am not certain the IO nodes are 
available for direct login. I’m using the BGQ at CSCS which is an EPFL machine. 
The IO nodes are open for some special projects where we are trying to 
customise the IO.

JB

From: Brice Goglin [mailto:brice.gog...@inria.fr]
Sent: 25 March 2014 08:43
To: Hardware locality user list; Biddiscombe, John A.
Subject: Re: [hwloc-users] BGQ question.

Wait, I missed the "io node" part of your first mail. The bgq support is for 
compute nodes running cnk. Are io nodes running linux on same hardware as the 
compute nodes?

I have an account on vesta. Where should I logon to have a look?
Brice

On 25 mars 2014 08:12:58 UTC+01:00, "Biddiscombe, John A." 
<biddi...@cscs.ch<mailto:biddi...@cscs.ch>> wrote:
Brice,


lstopo --whole-system


gives the same output and setting env var BG_THREADMODEL=2 does not appear to 
make any visible difference.


my configure command for compiling hwloc had no special options,
./configure --prefix=/gpfs/bbp.cscs.ch/home/biddisco/apps/clang/hwloc-1.8.1


should I rerun with something set?


Thanks


JB




From: hwloc-users [mailto:hwloc-users-boun...@open-mpi.org] On Behalf Of Brice 
Goglin
Sent: 25 March 2014 08:04
To: Hardware locality user list
Subject: Re: [hwloc-users] BGQ question.


Le 25/03/2014 07:51, Biddiscombe, John A. a écrit :
I’m compiling hwloc using clang (bgclang++11 from ANL) to run on IO nodes af a 
BGQ. It seems to have compiled ok, and when I run lstopo, I get an output like 
this (below), which looks reasonable, but there are 15 sockets instead of 16. 
I’m a little worried because the first time I compiled, I had problems where 
apps would report an error from HWLOC on start and tell me to set 
HWLOC_FORCE_BGQ=1. when I did set this env var, it would then report that 
“topology became empty” and the app would segfault due to the unexpected return 
from hwloc presumably.

Can you give a bit more details on what you did there? I'd like to check if 
that case should be better supported or not.

I wiped everything and recompiled (not sure what I did differently), and now it 
behaves more sensibly, but with 15 instead of 16 sockets.

Should IO be worried?

The topology detection is hardwired so you shouldn't worried on the hardware 
side.
The problem could be related to how you reserved resources before running 
lstopo.
Does lstopo --whole-system see more sockets?
Does BG_THREADMODEL=2 help?

Brice



hwloc-users mailing list
hwloc-us...@open-mpi.org<mailto:hwloc-us...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users


Re: [hwloc-users] BGQ question.

2014-03-25 Thread Biddiscombe, John A.
Brice,

lstopo --whole-system

gives the same output and setting env var BG_THREADMODEL=2 does not appear to 
make any visible difference.

my configure command for compiling hwloc had no special options,
./configure --prefix=/gpfs/bbp.cscs.ch/home/biddisco/apps/clang/hwloc-1.8.1

should I rerun with something set?

Thanks

JB


From: hwloc-users [mailto:hwloc-users-boun...@open-mpi.org] On Behalf Of Brice 
Goglin
Sent: 25 March 2014 08:04
To: Hardware locality user list
Subject: Re: [hwloc-users] BGQ question.

Le 25/03/2014 07:51, Biddiscombe, John A. a écrit :
I'm compiling hwloc using clang (bgclang++11 from ANL) to run on IO nodes af a 
BGQ. It seems to have compiled ok, and when I run lstopo, I get an output like 
this (below), which looks reasonable, but there are 15 sockets instead of 16. 
I'm a little worried because the first time I compiled, I had problems where 
apps would report an error from HWLOC on start and tell me to set 
HWLOC_FORCE_BGQ=1. when I did set this env var, it would then report that 
"topology became empty" and the app would segfault due to the unexpected return 
from hwloc presumably.

Can you give a bit more details on what you did there? I'd like to check if 
that case should be better supported or not.


I wiped everything and recompiled (not sure what I did differently), and now it 
behaves more sensibly, but with 15 instead of 16 sockets.

Should IO be worried?

The topology detection is hardwired so you shouldn't worried on the hardware 
side.
The problem could be related to how you reserved resources before running 
lstopo.
Does lstopo --whole-system see more sockets?
Does BG_THREADMODEL=2 help?

Brice


[hwloc-users] BGQ question.

2014-03-25 Thread Biddiscombe, John A.
I'm compiling hwloc using clang (bgclang++11 from ANL) to run on IO nodes af a 
BGQ. It seems to have compiled ok, and when I run lstopo, I get an output like 
this (below), which looks reasonable, but there are 15 sockets instead of 16. 
I'm a little worried because the first time I compiled, I had problems where 
apps would report an error from HWLOC on start and tell me to set 
HWLOC_FORCE_BGQ=1. when I did set this env var, it would then report that 
"topology became empty" and the app would segfault due to the unexpected return 
from hwloc presumably.

I wiped everything and recompiled (not sure what I did differently), and now it 
behaves more sensibly, but with 15 instead of 16 sockets.

Should IO be worried?

Thanks

JB


Machine (15GB)
  Socket L#0 + L1d L#0 (16KB) + L1i L#0 (16KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#1)
PU L#2 (P#2)
PU L#3 (P#3)
  Socket L#1 + L1d L#1 (16KB) + L1i L#1 (16KB) + Core L#1
PU L#4 (P#4)
PU L#5 (P#5)
PU L#6 (P#6)
PU L#7 (P#7)
  Socket L#2 + L1d L#2 (16KB) + L1i L#2 (16KB) + Core L#2
PU L#8 (P#8)
PU L#9 (P#9)
PU L#10 (P#10)
PU L#11 (P#11)
  Socket L#3 + L1d L#3 (16KB) + L1i L#3 (16KB) + Core L#3
PU L#12 (P#12)
PU L#13 (P#13)
PU L#14 (P#14)
PU L#15 (P#15)
  Socket L#4 + L1d L#4 (16KB) + L1i L#4 (16KB) + Core L#4
PU L#16 (P#16)
PU L#17 (P#17)
PU L#18 (P#18)
PU L#19 (P#19)
  Socket L#5 + L1d L#5 (16KB) + L1i L#5 (16KB) + Core L#5
PU L#20 (P#20)
PU L#21 (P#21)
PU L#22 (P#22)
PU L#23 (P#23)
  Socket L#6 + L1d L#6 (16KB) + L1i L#6 (16KB) + Core L#6
PU L#24 (P#24)
PU L#25 (P#25)
PU L#26 (P#26)
PU L#27 (P#27)
  Socket L#7 + L1d L#7 (16KB) + L1i L#7 (16KB) + Core L#7
PU L#28 (P#28)
PU L#29 (P#29)
PU L#30 (P#30)
PU L#31 (P#31)
  Socket L#8 + L1d L#8 (16KB) + L1i L#8 (16KB) + Core L#8
PU L#32 (P#32)
PU L#33 (P#33)
PU L#34 (P#34)
PU L#35 (P#35)
  Socket L#9 + L1d L#9 (16KB) + L1i L#9 (16KB) + Core L#9
PU L#36 (P#36)
PU L#37 (P#37)
PU L#38 (P#38)
PU L#39 (P#39)
  Socket L#10 + L1d L#10 (16KB) + L1i L#10 (16KB) + Core L#10
PU L#40 (P#40)
PU L#41 (P#41)
PU L#42 (P#42)
PU L#43 (P#43)
  Socket L#11 + L1d L#11 (16KB) + L1i L#11 (16KB) + Core L#11
PU L#44 (P#44)
PU L#45 (P#45)
PU L#46 (P#46)
PU L#47 (P#47)
  Socket L#12 + L1d L#12 (16KB) + L1i L#12 (16KB) + Core L#12
PU L#48 (P#48)
PU L#49 (P#49)
PU L#50 (P#50)
PU L#51 (P#51)
  Socket L#13 + L1d L#13 (16KB) + L1i L#13 (16KB) + Core L#13
PU L#52 (P#52)
PU L#53 (P#53)
PU L#54 (P#54)
PU L#55 (P#55)
  Socket L#14 + L1d L#14 (16KB) + L1i L#14 (16KB) + Core L#14
PU L#56 (P#56)
PU L#57 (P#57)
PU L#58 (P#58)
PU L#59 (P#59)
  HostBridge L#0
PCIBridge
  PCI 1014:0023

--
John Biddiscombe,email:biddisco @.at.@ cscs.ch
http://www.cscs.ch/
CSCS, Swiss National Supercomputing Centre  | Tel:  +41 (91) 610.82.07
Via Trevano 131, 6900 Lugano, Switzerland   | Fax:  +41 (91) 610.82.82