Re: [hwloc-users] unusual memory binding results
The answer is "no", I don't have root access, but I suspect that that would be the right fix if it is currently set to [always] and either madvise or never would be good options. If it is of interest, I'll ask someone to try it and report back on what happens. -Original Message- From: Brice Goglin Sent: 29 January 2019 15:39 To: Biddiscombe, John A. ; Hardware locality user list Subject: Re: [hwloc-users] unusual memory binding results Only the one in brackets is set, others are unset alternatives. If you write "madvise" in that file, it'll become "always [madvise] never". Brice Le 29/01/2019 à 15:36, Biddiscombe, John A. a écrit : > On the 8 numa node machine > > $cat /sys/kernel/mm/transparent_hugepage/enabled > [always] madvise never > > is set already, so I'm not really sure what should go in there to disable it. > > JB > > -Original Message- > From: Brice Goglin > Sent: 29 January 2019 15:29 > To: Biddiscombe, John A. ; Hardware locality user > list > Subject: Re: [hwloc-users] unusual memory binding results > > Oh, that's very good to know. I guess lots of people using first touch will > be affected by this issue. We may want to add a hwloc memory flag doing > something similar. > > Do you have root access to verify that writing "never" or "madvise" in > /sys/kernel/mm/transparent_hugepage/enabled fixes the issue too? > > Brice > > > > Le 29/01/2019 à 14:02, Biddiscombe, John A. a écrit : >> Brice >> >> madvise(addr, n * sizeof(T), MADV_NOHUGEPAGE) >> >> seems to make things behave much more sensibly. I had no idea it was a >> thing, but one of my colleagues pointed me to it. >> >> Problem seems to be solved for now. Thank you very much for your insights >> and suggestions/help. >> >> JB >> >> -Original Message- >> From: Brice Goglin >> Sent: 29 January 2019 10:35 >> To: Biddiscombe, John A. ; Hardware locality user >> list >> Subject: Re: [hwloc-users] unusual memory binding results >> >> Crazy idea: 512 pages could be replaced with a single 2MB huge page. >> You're not requesting huge pages in your allocation but some systems >> have transparent huge pages enabled by default (e.g. RHEL >> https://access.redhat.com/solutions/46111) >> >> This could explain why 512 pages get allocated on the same node, but it >> wouldn't explain crazy patterns you've seen in the past. >> >> Brice >> >> >> >> >> Le 29/01/2019 à 10:23, Biddiscombe, John A. a écrit : >>> I simplified things and instead of writing to a 2D array, I allocate a 1D >>> array of bytes and touch pages in a linear fashion. >>> Then I call syscall(NR)move_pages, ) and retrieve a status array for >>> each page in the data. >>> >>> When I allocate 511 pages and touch alternate pages on alternate >>> numa nodes >>> >>> Numa page binding 511 >>> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 >>> 0 >>> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >>> 1 >>> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 >>> 0 >>> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >>> 1 >>> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 >>> 0 >>> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >>> 1 >>> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 >>> 0 >>> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >>> 1 >>> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 >>> 0 >>> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >>> 1 >>> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 >>> 0 >>> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >>> 1 >>> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 >>> 0 >>> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >>> 1 >>> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >>> >>> but as soon as I increase to 512 pages, it breaks. >>> >>> Numa page binding 512 >>> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >>> 0 >>> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >>> 0 >>> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Re: [hwloc-users] unusual memory binding results
On the 8 numa node machine $cat /sys/kernel/mm/transparent_hugepage/enabled [always] madvise never is set already, so I'm not really sure what should go in there to disable it. JB -Original Message- From: Brice Goglin Sent: 29 January 2019 15:29 To: Biddiscombe, John A. ; Hardware locality user list Subject: Re: [hwloc-users] unusual memory binding results Oh, that's very good to know. I guess lots of people using first touch will be affected by this issue. We may want to add a hwloc memory flag doing something similar. Do you have root access to verify that writing "never" or "madvise" in /sys/kernel/mm/transparent_hugepage/enabled fixes the issue too? Brice Le 29/01/2019 à 14:02, Biddiscombe, John A. a écrit : > Brice > > madvise(addr, n * sizeof(T), MADV_NOHUGEPAGE) > > seems to make things behave much more sensibly. I had no idea it was a thing, > but one of my colleagues pointed me to it. > > Problem seems to be solved for now. Thank you very much for your insights and > suggestions/help. > > JB > > -Original Message- > From: Brice Goglin > Sent: 29 January 2019 10:35 > To: Biddiscombe, John A. ; Hardware locality user > list > Subject: Re: [hwloc-users] unusual memory binding results > > Crazy idea: 512 pages could be replaced with a single 2MB huge page. > You're not requesting huge pages in your allocation but some systems > have transparent huge pages enabled by default (e.g. RHEL > https://access.redhat.com/solutions/46111) > > This could explain why 512 pages get allocated on the same node, but it > wouldn't explain crazy patterns you've seen in the past. > > Brice > > > > > Le 29/01/2019 à 10:23, Biddiscombe, John A. a écrit : >> I simplified things and instead of writing to a 2D array, I allocate a 1D >> array of bytes and touch pages in a linear fashion. >> Then I call syscall(NR)move_pages, ) and retrieve a status array for >> each page in the data. >> >> When I allocate 511 pages and touch alternate pages on alternate numa >> nodes >> >> Numa page binding 511 >> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 >> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 >> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 >> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 >> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 >> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 >> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 >> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >> >> but as soon as I increase to 512 pages, it breaks. >> >> Numa page binding 512 >> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> >> On the 8 numa node machine it sometimes gives the right answer even with 512 >> pages. >> >> Still baffled >> >> JB >> >> -Original Message- >> From: hwloc-users On Behalf Of >> Biddiscombe, John A. >> Sent: 28 January 2019 16:14 >> To: Brice
Re: [hwloc-users] unusual memory binding results
Brice madvise(addr, n * sizeof(T), MADV_NOHUGEPAGE) seems to make things behave much more sensibly. I had no idea it was a thing, but one of my colleagues pointed me to it. Problem seems to be solved for now. Thank you very much for your insights and suggestions/help. JB -Original Message- From: Brice Goglin Sent: 29 January 2019 10:35 To: Biddiscombe, John A. ; Hardware locality user list Subject: Re: [hwloc-users] unusual memory binding results Crazy idea: 512 pages could be replaced with a single 2MB huge page. You're not requesting huge pages in your allocation but some systems have transparent huge pages enabled by default (e.g. RHEL https://access.redhat.com/solutions/46111) This could explain why 512 pages get allocated on the same node, but it wouldn't explain crazy patterns you've seen in the past. Brice Le 29/01/2019 à 10:23, Biddiscombe, John A. a écrit : > I simplified things and instead of writing to a 2D array, I allocate a 1D > array of bytes and touch pages in a linear fashion. > Then I call syscall(NR)move_pages, ) and retrieve a status array for each > page in the data. > > When I allocate 511 pages and touch alternate pages on alternate numa > nodes > > Numa page binding 511 > 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 > 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 > 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 > 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 > 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 > 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 > 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 > 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 > 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 > 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 > 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 > 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 > 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 > 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 > 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 > > but as soon as I increase to 512 pages, it breaks. > > Numa page binding 512 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > > On the 8 numa node machine it sometimes gives the right answer even with 512 > pages. > > Still baffled > > JB > > -Original Message- > From: hwloc-users On Behalf Of > Biddiscombe, John A. > Sent: 28 January 2019 16:14 > To: Brice Goglin > Cc: Hardware locality user list > Subject: Re: [hwloc-users] unusual memory binding results > > Brice > >> Can you print the pattern before and after thread 1 touched its pages, or >> even in the middle ? >> It looks like somebody is touching too many pages here. > Experimenting with different threads touching one or more pages, I get > unpredicatable results > > here on the 8 numa node device, the result is perfect. I am only > allowing thread 3 and 7 to write a single memory location > > get_numa_domain() 8 Domain Numa pattern > > > > 3--- > > > > 7--- > > > > Contents of memory locations > 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 26 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 63 0 0 0 0 0 0 0 > > > you can see that core 26 (numa domain 3) wrote to memory, and so did > core 63
Re: [hwloc-users] unusual memory binding results
I wondered something similar. The crazy patterns usually happen on columns of the 2D matrix and as it is column major, it does loosely fit the idea (most of the time). I will play some more (though I'm fed up with it now). JB -Original Message- From: Brice Goglin Sent: 29 January 2019 10:35 To: Biddiscombe, John A. ; Hardware locality user list Subject: Re: [hwloc-users] unusual memory binding results Crazy idea: 512 pages could be replaced with a single 2MB huge page. You're not requesting huge pages in your allocation but some systems have transparent huge pages enabled by default (e.g. RHEL https://access.redhat.com/solutions/46111) This could explain why 512 pages get allocated on the same node, but it wouldn't explain crazy patterns you've seen in the past. Brice Le 29/01/2019 à 10:23, Biddiscombe, John A. a écrit : > I simplified things and instead of writing to a 2D array, I allocate a 1D > array of bytes and touch pages in a linear fashion. > Then I call syscall(NR)move_pages, ) and retrieve a status array for each > page in the data. > > When I allocate 511 pages and touch alternate pages on alternate numa > nodes > > Numa page binding 511 > 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 > 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 > 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 > 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 > 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 > 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 > 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 > 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 > 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 > 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 > 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 > 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 > 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 > 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 > 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 > > but as soon as I increase to 512 pages, it breaks. > > Numa page binding 512 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > > On the 8 numa node machine it sometimes gives the right answer even with 512 > pages. > > Still baffled > > JB > > -Original Message- > From: hwloc-users On Behalf Of > Biddiscombe, John A. > Sent: 28 January 2019 16:14 > To: Brice Goglin > Cc: Hardware locality user list > Subject: Re: [hwloc-users] unusual memory binding results > > Brice > >> Can you print the pattern before and after thread 1 touched its pages, or >> even in the middle ? >> It looks like somebody is touching too many pages here. > Experimenting with different threads touching one or more pages, I get > unpredicatable results > > here on the 8 numa node device, the result is perfect. I am only > allowing thread 3 and 7 to write a single memory location > > get_numa_domain() 8 Domain Numa pattern > > > > 3--- > > > > 7--- > > > > Contents of memory locations > 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 26 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 63 0 0 0 0 0 0 0 > > > you can see that core 26 (numa domain 3) wrote to memory, and so did > core 63 (domain 8) > > Now I run it a second time
Re: [hwloc-users] unusual memory binding results
I simplified things and instead of writing to a 2D array, I allocate a 1D array of bytes and touch pages in a linear fashion. Then I call syscall(NR)move_pages, ) and retrieve a status array for each page in the data. When I allocate 511 pages and touch alternate pages on alternate numa nodes Numa page binding 511 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 but as soon as I increase to 512 pages, it breaks. Numa page binding 512 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 On the 8 numa node machine it sometimes gives the right answer even with 512 pages. Still baffled JB -Original Message- From: hwloc-users On Behalf Of Biddiscombe, John A. Sent: 28 January 2019 16:14 To: Brice Goglin Cc: Hardware locality user list Subject: Re: [hwloc-users] unusual memory binding results Brice >Can you print the pattern before and after thread 1 touched its pages, or even >in the middle ? >It looks like somebody is touching too many pages here. Experimenting with different threads touching one or more pages, I get unpredicatable results here on the 8 numa node device, the result is perfect. I am only allowing thread 3 and 7 to write a single memory location get_numa_domain() 8 Domain Numa pattern 3--- 7--- Contents of memory locations 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 26 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 63 0 0 0 0 0 0 0 you can see that core 26 (numa domain 3) wrote to memory, and so did core 63 (domain 8) Now I run it a second time and look, its rubbish get_numa_domain() 8 Domain Numa pattern 3--- 3--- 3--- 3--- 3--- 3--- 3--- 3--- Contents of memory locations 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 26 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 63 0 0 0 0 0 0 0 after allowing the data to be read by a random thread 3777 3777 3777 3777 3777 3777 3777 3777 I'm baffled. JB ___ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/hwloc-users ___ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/hwloc-users
Re: [hwloc-users] unusual memory binding results
Brice >Can you print the pattern before and after thread 1 touched its pages, or even >in the middle ? >It looks like somebody is touching too many pages here. Experimenting with different threads touching one or more pages, I get unpredicatable results here on the 8 numa node device, the result is perfect. I am only allowing thread 3 and 7 to write a single memory location get_numa_domain() 8 Domain Numa pattern 3--- 7--- Contents of memory locations 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 26 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 63 0 0 0 0 0 0 0 you can see that core 26 (numa domain 3) wrote to memory, and so did core 63 (domain 8) Now I run it a second time and look, its rubbish get_numa_domain() 8 Domain Numa pattern 3--- 3--- 3--- 3--- 3--- 3--- 3--- 3--- Contents of memory locations 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 26 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 63 0 0 0 0 0 0 0 after allowing the data to be read by a random thread 3777 3777 3777 3777 3777 3777 3777 3777 I'm baffled. JB ___ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/hwloc-users
Re: [hwloc-users] unusual memory binding results
Brice I might have been using the wrong params to hwloc_get_area_memlocation in my original version, but I bypassed it and have been calling int get_numa_domain(void *page) { HPX_ASSERT( (std::size_t(page) & 4095) ==0 ); void *pages[1] = { page }; int status[1] = { -1 }; if (syscall(__NR_move_pages, 0, 1, pages, nullptr, status, 0) == 0) { if (status[0]>=0 && status[0]<=HPX_HAVE_MAX_NUMA_DOMAIN_COUNT) { return status[0]; } return -1; } throw std::runtime_error("Failed to get numa node for page"); } this function instead. Just testing one page address at a time. I still see this kind of pattern 00101101010010101001010101011010011011010101110101110111010101010101 00101101010010101001010101011010011011010101110101110111010101010101 00101101010010101001010101011010011011010101110101110111010101010101 00101101010010101001010101011010011011010101110101110111010101010101 00101101010010101001010101011010011011010101110101110111010101010101 00101101010010101001010101011010011011010101110101110111010101010101 00101101010010101001010101011010011011010101110101110111010101010101 00101101010010101001010101011010011011010101110101110111010101010101 00101101010010101001010101011010011011010101110101110111010101010101 00101101010010101001010101011010011011010101110101110111010101010101 00101101010010101001010101011010011011010101110101110111010101010101 when I should see 01010101010101010101010101010101010101010101010101010101010101010101010101010101 10101010101010101010101010101010101010101010101010101010101010101010101010101010 01010101010101010101010101010101010101010101010101010101010101010101010101010101 10101010101010101010101010101010101010101010101010101010101010101010101010101010 01010101010101010101010101010101010101010101010101010101010101010101010101010101 10101010101010101010101010101010101010101010101010101010101010101010101010101010 01010101010101010101010101010101010101010101010101010101010101010101010101010101 10101010101010101010101010101010101010101010101010101010101010101010101010101010 01010101010101010101010101010101010101010101010101010101010101010101010101010101 10101010101010101010101010101010101010101010101010101010101010101010101010101010 I am deeply troubled by this and can't think of what to try next since I can see the memory contents hold the correct CPU ID of the thread that touched the memory, so either the syscall is wrong, or the kernel is doing something else. I welcome any suggestions on what might be wrong. Thanks for trying to help. JB -Original Message- From: Brice Goglin Sent: 26 January 2019 10:19 To: Biddiscombe, John A. Cc: Hardware locality user list Subject: Re: [hwloc-users] unusual memory binding results Le 25/01/2019 à 23:16, Biddiscombe, John A. a écrit : >> move_pages() returning 0 with -14 in the status array? As opposed to >> move_pages() returning -1 with errno set to 14, which would definitely be a >> bug in hwloc. > I think it was move_pages returning zero with -14 in the status array, and > then hwloc returning 0 with an empty nodeset (which I then messed up by > calling get bitmap first and assuming 0 meant numa node zero and not checking > for an empty nodeset). > > I'm not sure why I get -EFAULT status rather than -NOENT, but that's what I'm > seeing in the status field when I pass the pointer returned from the > alloc_membind call. The only reason I see for getting -EFAULT there would be that you pass the buffer to move_pages (what hwloc_get_area_memlocation() wants, a start pointer and length) instead of a pointer to an array of page addresses (move_pages wants a void** pointing to individual pages). Brice ___ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/hwloc-users
Re: [hwloc-users] unusual memory binding results
Dear List/Brice I experimented with disabling the memory touch on threads except for N=1,2,3,4 etc and found a problem in hwloc, which is that the function hwloc_get_area_memlocation was returning '0' when the status of the memory null move operation was -14 (#define EFAULT 14 /* Bad address */). This was when I call get area memlocation immediately after allocating and then 'not' touching. I think if the status is an error, then the function should probably return -1, but anyway. I'll file a bug and send a patch if this is considered to be a bug. I then modified the test routine to write the value returned from sched_getcpu into the touched memory location to verify that the thread binding was doing the right thing. The output below from the AMD 8 numanode machine looks good with threads 0,8,16 etc each touching memory which follows the pattern expected from the 8 numanode test. my get numa domain function however, does not reflect the right numanode. It looks correct for the first column (matrices are stored in column major order), but after that it falls to pieces. In this test, I'm allocating tiles as 512x512 doubles, so 4096 bytes per tile giving one tile column per page and I do 512 pages per tile. All the memory locations check out and the patters seem fine, but the call to // edited version of the one in hwloc source syscall(__NR_move_pages, 0, 1, pages, nullptr, status, 0) == 0) is not returning the numanode that I expect to see from the first touch when it is enabled. Either the syscall is wrong, or the first touch/nexttouch doesn't work (could the alloc routine be wrong?) hwloc_alloc_membind(topo, len, bitmap->get_bmp(), (hwloc_membind_policy_t)(policy), flags | HWLOC_MEMBIND_BYNODESET); where the nodeset should match the numanode mask (I'd will double check that right now). Any ideas on what to try next? Thanks JB get_numa_domain() 8 Domain Numa pattern 00740640 10740640 20740640 30740640 40740640 50740640 60740640 70740640 Contents of memory locations = sched_getcpu() 0 8 16 24 32 40 48 56 8 16 24 32 40 48 56 0 16 24 32 40 48 56 0 8 24 32 40 48 56 0 8 16 32 40 48 56 0 8 16 24 40 48 56 0 8 16 24 32 48 56 0 8 16 24 32 40 56 0 8 16 24 32 40 48 Expected 8 Domain Numa pattern 01234567 12345670 23456701 34567012 45670123 56701234 67012345 70123456 ___ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/hwloc-users
Re: [hwloc-users] unusual memory binding results
Brice Apologies, I didn't explain it very well, I do make sure that if the tile size 256*8 < 4096 (pagesize), then I double the number of tiles per page, I just wanted to keep the explanation simple. here are some code snippets to give you the flavour of it initializing the helper sruct matrix_numa_binder(std::size_t Ncols, std::size_t Nrows, std::size_t Ntile, std::size_t Ntiles_per_domain, std::size_t Ncolprocs=1, std::size_t Nrowprocs=1, std::string pool_name="default" ) : cols_(Ncols), rows_(Nrows), tile_size_(Ntile), tiles_per_domain_(Ntiles_per_domain), colprocs_(Ncolprocs), rowprocs_(Nrowprocs) { using namespace hpx::compute::host; binding_helper::pool_name_ = pool_name; const int CACHE_LINE_SIZE = sysconf (_SC_LEVEL1_DCACHE_LINESIZE); const int PAGE_SIZE = sysconf(_SC_PAGE_SIZE); const int ALIGNMENT = std::max(PAGE_SIZE,CACHE_LINE_SIZE); const int ELEMS_ALIGN = (ALIGNMENT/sizeof(T)); rows_page_= ELEMS_ALIGN; leading_dim_ = ELEMS_ALIGN*((rows_*sizeof(T) + ALIGNMENT-1)/ALIGNMENT); tiles_per_domain_ = std::max(tiles_per_domain_, ELEMS_ALIGN/tile_size_); } operator called by allocator which returns the domain index to bind a page to virtual std::size_t operator ()( const T * const base_ptr, const T * const page_ptr, const std::size_t pagesize, const std::size_t domains) const override { std::size_t offset = (page_ptr - base_ptr); std::size_t col = (offset / leading_dim_); std::size_t row = (offset % leading_dim_); std::size_t index = (col / (tile_size_ * tiles_per_domain_)); if ((tile_size_*tiles_per_domain_*sizeof(T))>=pagesize) { index += (row / (tile_size_ * tiles_per_domain_)); } else { HPX_ASSERT(0); } return index % domains; } this function is called by each thread (one per numa domain) and if the domain returned by the page query matches the domain ID of the thread/task then the first memory location on the page is written to for (size_type i=0; ioperator()(p, page_ptr, pagesize, nodesets.size()); if (dom==numa_domain) { // trigger a memory read and rewrite without changing contents volatile char* vaddr = (volatile char*) page_ptr; *vaddr = T(0); // *vaddr; } page_ptr += pageN; } All of this has been debugged quite extensively and I can write numbers to memory and read them back and the patterns always match the domains expected. This function is called after all data is written to attempt to verify (and display the patterns above) int topology::get_numa_domain(const void *addr) const { #if HWLOC_API_VERSION >= 0x00010b06 hpx_hwloc_bitmap_wrapper *nodeset = topology::bitmap_storage_.get(); if (nullptr == nodeset) { hwloc_bitmap_t nodeset_ = hwloc_bitmap_alloc(); topology::bitmap_storage_.reset(new hpx_hwloc_bitmap_wrapper(nodeset_)); nodeset = topology::bitmap_storage_.get(); } // hwloc_nodeset_t ns = reinterpret_cast(nodeset->get_bmp()); int ret = hwloc_get_area_memlocation(topo, addr, 1, ns, HWLOC_MEMBIND_BYNODESET); if (ret<0) { std::string msg(strerror(errno)); HPX_THROW_EXCEPTION(kernel_error , "hpx::threads::topology::get_numa_domain" , "hwloc_get_area_memlocation failed " + msg); return -1; } // this uses hwloc directly //int bit = hwloc_bitmap_first(ns); //return bit // this uses an alternative method, both give the same result AFAICT threads::mask_type mask = bitmap_to_mask(ns, HWLOC_OBJ_NUMANODE); return static_cast(threads::find_first(mask)); #else return 0; #endif } Thanks for taking the time to look it over JB ___ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/hwloc-users
Re: [hwloc-users] question about hwloc_set_area_membind_nodeset
Running my test on another machine (fedora 7.2) I get a hwloc_get_area_memlocation fail with strerror = "Function not implemented" Does this mean that the OS has not implemented it (I'm using 1.11.8 hwloc version - on the primary test machine I used 1.11.17) - am I doomed? - or will things magically work if I upgrade to hwloc 2.0 etc etc Thanks JB -Original Message- From: hwloc-users [mailto:hwloc-users-boun...@lists.open-mpi.org] On Behalf Of Biddiscombe, John A. Sent: 13 November 2017 15:37 To: Hardware locality user list <hwloc-users@lists.open-mpi.org> Subject: Re: [hwloc-users] question about hwloc_set_area_membind_nodeset It's working and I'm seeing the binding pattern I hoped for. Thanks again JB From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf of Brice Goglin [brice.gog...@inria.fr] Sent: 13 November 2017 15:32 To: Hardware locality user list Subject: Re: [hwloc-users] question about hwloc_set_area_membind_nodeset The doc is wrong, flags are used, only for BY_NODESET. I actually fixed that in git very recently. Brice Le 13/11/2017 07:24, Biddiscombe, John A. a écrit : > In the documentation for get_area_memlocation it says "If > HWLOC_MEMBIND_BYNODESET is specified, set is considered a nodeset. Otherwise > it's a cpuset." > > but it also says "Flags are currently unused." > > so where should the BY_NODESET policy be used? Does it have to be used with > the original alloc call? > > thanks > > JB > > > From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf > of Biddiscombe, John A. [biddi...@cscs.ch] > Sent: 13 November 2017 14:59 > To: Hardware locality user list > Subject: Re: [hwloc-users] question about > hwloc_set_area_membind_nodeset > > Brice > > aha. thanks. I knew I'd seen a function for that, but couldn't remember what > it was. > > Cheers > > JB > > From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf > of Brice Goglin [brice.gog...@inria.fr] > Sent: 13 November 2017 14:57 > To: Hardware locality user list > Subject: Re: [hwloc-users] question about > hwloc_set_area_membind_nodeset > > Use get_area_memlocation() > > membind() returns where the pages are *allowed* to go (anywhere) > memlocation() returns where the pages are actually allocated. > > Brice > > > > > Le 13/11/2017 06:52, Biddiscombe, John A. a écrit : >> Thank you to you both. >> >> I modified the allocator to allocate one large block using >> hwloc_alloc and then use one thread per numa domain to touch each >> page according to the tiling pattern - unfortunately, I hadn't >> appreciated that now hwloc_get_area_membind_nodeset always returns >> the full machine numa mask - and not the numa domain that the page >> was touched by (I guess it only gives the expected answer when >> set_area_membind is used first) >> >> I had hoped to use a dynamic query of the pages (using the first one of a >> given tile) to schedule each task that operates on a given tile to run on >> the numa node that touched it. >> >> I can work around this by using a matrix offset calculation to get the numa >> node, but if there's a way of querying the page directly - then please let >> me know. >> >> Thanks >> >> JB >> >> From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf >> of Samuel Thibault [samuel.thiba...@inria.fr] >> Sent: 12 November 2017 10:48 >> To: Hardware locality user list >> Subject: Re: [hwloc-users] question about >> hwloc_set_area_membind_nodeset >> >> Brice Goglin, on dim. 12 nov. 2017 05:19:37 +0100, wrote: >>> That's likely what's happening. Each set_area() may be creating a >>> new "virtual memory area". The kernel tries to merge them with >>> neighbors if they go to the same NUMA node. Otherwise it creates a new VMA. >> Mmmm, that sucks. Ideally we'd have a way to ask the kernel not to >> strictly bind the memory, but just to allocate on a given memory >> node, and just hope that the allocation will not go away (e.g. due to >> swapping), which thus doesn't need a VMA to record the information. >> As you describe below, first-touch achieves that but it's not >> necessarily so convenient. >> >>> I can't find the exact limit but it's something like 64k so I guess >>> you're exhausting that. >> It's sysctl vm.max_map_count >> >>> Question 2 : Is there a better way of achiev
Re: [hwloc-users] question about hwloc_set_area_membind_nodeset
In the documentation for get_area_memlocation it says "If HWLOC_MEMBIND_BYNODESET is specified, set is considered a nodeset. Otherwise it's a cpuset." but it also says "Flags are currently unused." so where should the BY_NODESET policy be used? Does it have to be used with the original alloc call? thanks JB From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf of Biddiscombe, John A. [biddi...@cscs.ch] Sent: 13 November 2017 14:59 To: Hardware locality user list Subject: Re: [hwloc-users] question about hwloc_set_area_membind_nodeset Brice aha. thanks. I knew I'd seen a function for that, but couldn't remember what it was. Cheers JB From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf of Brice Goglin [brice.gog...@inria.fr] Sent: 13 November 2017 14:57 To: Hardware locality user list Subject: Re: [hwloc-users] question about hwloc_set_area_membind_nodeset Use get_area_memlocation() membind() returns where the pages are *allowed* to go (anywhere) memlocation() returns where the pages are actually allocated. Brice Le 13/11/2017 06:52, Biddiscombe, John A. a écrit : > Thank you to you both. > > I modified the allocator to allocate one large block using hwloc_alloc and > then use one thread per numa domain to touch each page according to the > tiling pattern - unfortunately, I hadn't appreciated that now > hwloc_get_area_membind_nodeset > always returns the full machine numa mask - and not the numa domain that the > page was touched by (I guess it only gives the expected answer when > set_area_membind is used first) > > I had hoped to use a dynamic query of the pages (using the first one of a > given tile) to schedule each task that operates on a given tile to run on the > numa node that touched it. > > I can work around this by using a matrix offset calculation to get the numa > node, but if there's a way of querying the page directly - then please let me > know. > > Thanks > > JB > > From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf of > Samuel Thibault [samuel.thiba...@inria.fr] > Sent: 12 November 2017 10:48 > To: Hardware locality user list > Subject: Re: [hwloc-users] question about hwloc_set_area_membind_nodeset > > Brice Goglin, on dim. 12 nov. 2017 05:19:37 +0100, wrote: >> That's likely what's happening. Each set_area() may be creating a new >> "virtual >> memory area". The kernel tries to merge them with neighbors if they go to the >> same NUMA node. Otherwise it creates a new VMA. > Mmmm, that sucks. Ideally we'd have a way to ask the kernel not to > strictly bind the memory, but just to allocate on a given memory > node, and just hope that the allocation will not go away (e.g. due to > swapping), which thus doesn't need a VMA to record the information. As > you describe below, first-touch achieves that but it's not necessarily > so convenient. > >> I can't find the exact limit but it's something like 64k so I guess >> you're exhausting that. > It's sysctl vm.max_map_count > >> Question 2 : Is there a better way of achieving the result I'm looking >> for >> (such as a call to membind with a stride of some kind to say put N pages >> in >> a row on each domain in alternation). >> >> >> Unfortunately, the interleave policy doesn't have a stride argument. It's one >> page on node 0, one page on node 1, etc. >> >> The only idea I have is to use the first-touch policy: Make sure your buffer >> isn't is physical memory yet, and have a thread on node 0 read the "0" pages, >> and another thread on node 1 read the "1" page. > Or "next-touch" if that was to ever get merged into mainline Linux :) > > Samuel > ___ > hwloc-users mailing list > hwloc-users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/hwloc-users > ___ > hwloc-users mailing list > hwloc-users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/hwloc-users ___ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/hwloc-users ___ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/hwloc-users ___ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/hwloc-users
Re: [hwloc-users] question about hwloc_set_area_membind_nodeset
Brice aha. thanks. I knew I'd seen a function for that, but couldn't remember what it was. Cheers JB From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf of Brice Goglin [brice.gog...@inria.fr] Sent: 13 November 2017 14:57 To: Hardware locality user list Subject: Re: [hwloc-users] question about hwloc_set_area_membind_nodeset Use get_area_memlocation() membind() returns where the pages are *allowed* to go (anywhere) memlocation() returns where the pages are actually allocated. Brice Le 13/11/2017 06:52, Biddiscombe, John A. a écrit : > Thank you to you both. > > I modified the allocator to allocate one large block using hwloc_alloc and > then use one thread per numa domain to touch each page according to the > tiling pattern - unfortunately, I hadn't appreciated that now > hwloc_get_area_membind_nodeset > always returns the full machine numa mask - and not the numa domain that the > page was touched by (I guess it only gives the expected answer when > set_area_membind is used first) > > I had hoped to use a dynamic query of the pages (using the first one of a > given tile) to schedule each task that operates on a given tile to run on the > numa node that touched it. > > I can work around this by using a matrix offset calculation to get the numa > node, but if there's a way of querying the page directly - then please let me > know. > > Thanks > > JB > > From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf of > Samuel Thibault [samuel.thiba...@inria.fr] > Sent: 12 November 2017 10:48 > To: Hardware locality user list > Subject: Re: [hwloc-users] question about hwloc_set_area_membind_nodeset > > Brice Goglin, on dim. 12 nov. 2017 05:19:37 +0100, wrote: >> That's likely what's happening. Each set_area() may be creating a new >> "virtual >> memory area". The kernel tries to merge them with neighbors if they go to the >> same NUMA node. Otherwise it creates a new VMA. > Mmmm, that sucks. Ideally we'd have a way to ask the kernel not to > strictly bind the memory, but just to allocate on a given memory > node, and just hope that the allocation will not go away (e.g. due to > swapping), which thus doesn't need a VMA to record the information. As > you describe below, first-touch achieves that but it's not necessarily > so convenient. > >> I can't find the exact limit but it's something like 64k so I guess >> you're exhausting that. > It's sysctl vm.max_map_count > >> Question 2 : Is there a better way of achieving the result I'm looking >> for >> (such as a call to membind with a stride of some kind to say put N pages >> in >> a row on each domain in alternation). >> >> >> Unfortunately, the interleave policy doesn't have a stride argument. It's one >> page on node 0, one page on node 1, etc. >> >> The only idea I have is to use the first-touch policy: Make sure your buffer >> isn't is physical memory yet, and have a thread on node 0 read the "0" pages, >> and another thread on node 1 read the "1" page. > Or "next-touch" if that was to ever get merged into mainline Linux :) > > Samuel > ___ > hwloc-users mailing list > hwloc-users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/hwloc-users > ___ > hwloc-users mailing list > hwloc-users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/hwloc-users ___ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/hwloc-users ___ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/hwloc-users
Re: [hwloc-users] question about hwloc_set_area_membind_nodeset
Thank you to you both. I modified the allocator to allocate one large block using hwloc_alloc and then use one thread per numa domain to touch each page according to the tiling pattern - unfortunately, I hadn't appreciated that now hwloc_get_area_membind_nodeset always returns the full machine numa mask - and not the numa domain that the page was touched by (I guess it only gives the expected answer when set_area_membind is used first) I had hoped to use a dynamic query of the pages (using the first one of a given tile) to schedule each task that operates on a given tile to run on the numa node that touched it. I can work around this by using a matrix offset calculation to get the numa node, but if there's a way of querying the page directly - then please let me know. Thanks JB From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf of Samuel Thibault [samuel.thiba...@inria.fr] Sent: 12 November 2017 10:48 To: Hardware locality user list Subject: Re: [hwloc-users] question about hwloc_set_area_membind_nodeset Brice Goglin, on dim. 12 nov. 2017 05:19:37 +0100, wrote: > That's likely what's happening. Each set_area() may be creating a new "virtual > memory area". The kernel tries to merge them with neighbors if they go to the > same NUMA node. Otherwise it creates a new VMA. Mmmm, that sucks. Ideally we'd have a way to ask the kernel not to strictly bind the memory, but just to allocate on a given memory node, and just hope that the allocation will not go away (e.g. due to swapping), which thus doesn't need a VMA to record the information. As you describe below, first-touch achieves that but it's not necessarily so convenient. > I can't find the exact limit but it's something like 64k so I guess > you're exhausting that. It's sysctl vm.max_map_count > Question 2 : Is there a better way of achieving the result I'm looking for > (such as a call to membind with a stride of some kind to say put N pages > in > a row on each domain in alternation). > > > Unfortunately, the interleave policy doesn't have a stride argument. It's one > page on node 0, one page on node 1, etc. > > The only idea I have is to use the first-touch policy: Make sure your buffer > isn't is physical memory yet, and have a thread on node 0 read the "0" pages, > and another thread on node 1 read the "1" page. Or "next-touch" if that was to ever get merged into mainline Linux :) Samuel ___ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/hwloc-users ___ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/hwloc-users
Re: [hwloc-users] HWLOC_VERSION
Brice aha. ok thanks I tried 0x00010b03 but my 1.11.7 build only has 0x00010b00 set so I assumed patch releases were not marked in there. I'll see what I can do about a configure time check. JB -Original Message- From: hwloc-users [mailto:hwloc-users-boun...@lists.open-mpi.org] On Behalf Of Brice Goglin Sent: 30 October 2017 09:26 To: hwloc-users@lists.open-mpi.org Subject: Re: [hwloc-users] HWLOC_VERSION Hello It should have been 0x00010b03 but I forgot to increase it unfortunately (and again in 1.11.6). I need to add this to my release-TODO-list. The upcoming 1.11.9 will have the proper HWLOC_API_VERSION (0x00010b06 unless we had something) so that people can at least check for these features in newer releases... If you have some configure checks for hwloc, you could add something there to workaround the issue. Sorry Brice Le 30/10/2017 09:09, Biddiscombe, John A. a écrit : > Dear list > > According to the release notes Add HWLOC_MEMBIND_BYNODESET flag was > added in 1.11.3 - if I protect some code with > > #if HWLOC_API_VERSION >= 0x00010b00 > then versions 1.11.0, 1.11.1, 1.11.2 still cause build failures. > > is there some VERSION flag that distinguishes between the patch version > releases? > > thanks > > JB > > ___ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/hwloc-users ___ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/hwloc-users
[hwloc-users] HWLOC_VERSION
Dear list According to the release notes Add HWLOC_MEMBIND_BYNODESET flag was added in 1.11.3 - if I protect some code with #if HWLOC_API_VERSION >= 0x00010b00 then versions 1.11.0, 1.11.1, 1.11.2 still cause build failures. is there some VERSION flag that distinguishes between the patch version releases? thanks JB -- John Biddiscombe,email:biddisco @.at.@ cscs.ch http://www.cscs.ch/ CSCS, Swiss National Supercomputing Centre | Tel: +41 (91) 610.82.07 Via Trevano 131, 6900 Lugano, Switzerland | Fax: +41 (91) 610.82.82 ___ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/hwloc-users
Re: [hwloc-users] BGQ question.
Just as a follow up to this thread. I spoke with someone from IBM and they tell me that 2 cores of 4 hardware threads each are hidden from the kernel (how do they do that?) and used for the custom HS4 cards we have installed on the IO nodes, which explains why I see only 60 instead of 68 threads. the 2 bgvrnic tasks I see spinning at 100% run on threads 58/59 and service the connection from ION to CN. It looks as though everything is reporting as expected - as long as I compile hwloc on the ION itself, it seems to be correct. Thanks and sorry for any misunderstanding JB > -Original Message- > From: hwloc-users [mailto:hwloc-users-boun...@open-mpi.org] On Behalf > Of Chris Samuel > Sent: 26 March 2014 13:42 > To: Hardware locality user list > Subject: Re: [hwloc-users] BGQ question. > > On Wed, 26 Mar 2014 11:56:08 AM Biddiscombe, John A. wrote: > > > I can’t test this as the system is down for maintenance, but if memory > > serves me correctly, the GCC compiled lstopo also showed 60 cores > > instead of 64/68. > > It can only report what the kernel reports and it appears your kernel is not > reporting the same number of cores on an IO node as ours. > > It would be interesting to compare kernel version and boot command line. > > Ours are: > > -bash-4.1# uname -a > Linux r00-id-j01.pcf.vlsci.unimelb.edu.au 2.6.32- > 279.14.1.bgq.el6_V1R2M0_36.ppc64 #1 SMP Tue Jun 11 15:50:53 CDT 2013 > ppc64 ppc64 ppc64 GNU/Linux > > > -bash-4.1# cat /proc/cmdline > root=/dev/ram0 rdinit=/init raid=noautodetect loglevel=5 > > > This is the end of our /proc/cpuinfo showing 68 hardware threads > (17 cores exposed). > > -bash-4.1# tail -n 9 /proc/cpuinfo > > processor : 67 > cpu : A2 (Blue Gene/Q) > clock : 1600.00MHz > revision: 2.0 (pvr 0049 0200) > > timebase: 16 > platform: Blue Gene/Q > model : ibm,bluegeneq > > > > I am not certain if this gcc was in any was ‘special’ for bgq. > > There is a GCC cross compiler, but it's not the /usr/bin/gcc one. > > cheers! > Chris > -- > Christopher SamuelSenior Systems Administrator > VLSCI - Victorian Life Sciences Computation Initiative > Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 > http://www.vlsci.org.au/ http://twitter.com/vlsci > > ___ > hwloc-users mailing list > hwloc-us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
Re: [hwloc-users] BGQ question.
Chris > Out of interest, why on an I/O node? I'm targeting the BGQ BGAS nodes with flash cards installed. We've done tests with GPFS mounted on the flash and are trying to get comparable results with an in-house driver. JB
Re: [hwloc-users] BGQ question.
Brice Looking at /proc/cpuinfo on the io node itself, I see only 60 cores listed. I wonder if they’ve reserved one socket of 4 cores for IO purposes and in fact hwloc is seeing the correct information. Attached is the foo zip of the run just now (assuming it doesn’t bounce) JB From: Brice Goglin [mailto:brice.gog...@inria.fr] Sent: 25 March 2014 09:28 To: Biddiscombe, John A.; Hardware locality user list Subject: Re: [hwloc-users] BGQ question. Can you run hwloc-gather-topology foo and send the resulting foo.tar.bz2 ? If the tarball is too bug, feel free to send it to me in a private mail. Brice Le 25/03/2014 08:55, Biddiscombe, John A. a écrit : Brice, Correct : The IO nodes are running a full linux install (RHE 6.4) on the same hardware as the CNK nodes. On vesta I do not have an account and I am not certain the IO nodes are available for direct login. I’m using the BGQ at CSCS which is an EPFL machine. The IO nodes are open for some special projects where we are trying to customise the IO. JB From: Brice Goglin [mailto:brice.gog...@inria.fr] Sent: 25 March 2014 08:43 To: Hardware locality user list; Biddiscombe, John A. Subject: Re: [hwloc-users] BGQ question. Wait, I missed the "io node" part of your first mail. The bgq support is for compute nodes running cnk. Are io nodes running linux on same hardware as the compute nodes? I have an account on vesta. Where should I logon to have a look? Brice On 25 mars 2014 08:12:58 UTC+01:00, "Biddiscombe, John A." <biddi...@cscs.ch<mailto:biddi...@cscs.ch>> wrote: Brice, lstopo --whole-system gives the same output and setting env var BG_THREADMODEL=2 does not appear to make any visible difference. my configure command for compiling hwloc had no special options, ./configure --prefix=/gpfs/bbp.cscs.ch/home/biddisco/apps/clang/hwloc-1.8.1 should I rerun with something set? Thanks JB From: hwloc-users [mailto:hwloc-users-boun...@open-mpi.org] On Behalf Of Brice Goglin Sent: 25 March 2014 08:04 To: Hardware locality user list Subject: Re: [hwloc-users] BGQ question. Le 25/03/2014 07:51, Biddiscombe, John A. a écrit : I’m compiling hwloc using clang (bgclang++11 from ANL) to run on IO nodes af a BGQ. It seems to have compiled ok, and when I run lstopo, I get an output like this (below), which looks reasonable, but there are 15 sockets instead of 16. I’m a little worried because the first time I compiled, I had problems where apps would report an error from HWLOC on start and tell me to set HWLOC_FORCE_BGQ=1. when I did set this env var, it would then report that “topology became empty” and the app would segfault due to the unexpected return from hwloc presumably. Can you give a bit more details on what you did there? I'd like to check if that case should be better supported or not. I wiped everything and recompiled (not sure what I did differently), and now it behaves more sensibly, but with 15 instead of 16 sockets. Should IO be worried? The topology detection is hardwired so you shouldn't worried on the hardware side. The problem could be related to how you reserved resources before running lstopo. Does lstopo --whole-system see more sockets? Does BG_THREADMODEL=2 help? Brice hwloc-users mailing list hwloc-us...@open-mpi.org<mailto:hwloc-us...@open-mpi.org> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users foo.tar.bz2 Description: foo.tar.bz2
Re: [hwloc-users] BGQ question.
Brice, Correct : The IO nodes are running a full linux install (RHE 6.4) on the same hardware as the CNK nodes. On vesta I do not have an account and I am not certain the IO nodes are available for direct login. I’m using the BGQ at CSCS which is an EPFL machine. The IO nodes are open for some special projects where we are trying to customise the IO. JB From: Brice Goglin [mailto:brice.gog...@inria.fr] Sent: 25 March 2014 08:43 To: Hardware locality user list; Biddiscombe, John A. Subject: Re: [hwloc-users] BGQ question. Wait, I missed the "io node" part of your first mail. The bgq support is for compute nodes running cnk. Are io nodes running linux on same hardware as the compute nodes? I have an account on vesta. Where should I logon to have a look? Brice On 25 mars 2014 08:12:58 UTC+01:00, "Biddiscombe, John A." <biddi...@cscs.ch<mailto:biddi...@cscs.ch>> wrote: Brice, lstopo --whole-system gives the same output and setting env var BG_THREADMODEL=2 does not appear to make any visible difference. my configure command for compiling hwloc had no special options, ./configure --prefix=/gpfs/bbp.cscs.ch/home/biddisco/apps/clang/hwloc-1.8.1 should I rerun with something set? Thanks JB From: hwloc-users [mailto:hwloc-users-boun...@open-mpi.org] On Behalf Of Brice Goglin Sent: 25 March 2014 08:04 To: Hardware locality user list Subject: Re: [hwloc-users] BGQ question. Le 25/03/2014 07:51, Biddiscombe, John A. a écrit : I’m compiling hwloc using clang (bgclang++11 from ANL) to run on IO nodes af a BGQ. It seems to have compiled ok, and when I run lstopo, I get an output like this (below), which looks reasonable, but there are 15 sockets instead of 16. I’m a little worried because the first time I compiled, I had problems where apps would report an error from HWLOC on start and tell me to set HWLOC_FORCE_BGQ=1. when I did set this env var, it would then report that “topology became empty” and the app would segfault due to the unexpected return from hwloc presumably. Can you give a bit more details on what you did there? I'd like to check if that case should be better supported or not. I wiped everything and recompiled (not sure what I did differently), and now it behaves more sensibly, but with 15 instead of 16 sockets. Should IO be worried? The topology detection is hardwired so you shouldn't worried on the hardware side. The problem could be related to how you reserved resources before running lstopo. Does lstopo --whole-system see more sockets? Does BG_THREADMODEL=2 help? Brice hwloc-users mailing list hwloc-us...@open-mpi.org<mailto:hwloc-us...@open-mpi.org> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
Re: [hwloc-users] BGQ question.
Brice, lstopo --whole-system gives the same output and setting env var BG_THREADMODEL=2 does not appear to make any visible difference. my configure command for compiling hwloc had no special options, ./configure --prefix=/gpfs/bbp.cscs.ch/home/biddisco/apps/clang/hwloc-1.8.1 should I rerun with something set? Thanks JB From: hwloc-users [mailto:hwloc-users-boun...@open-mpi.org] On Behalf Of Brice Goglin Sent: 25 March 2014 08:04 To: Hardware locality user list Subject: Re: [hwloc-users] BGQ question. Le 25/03/2014 07:51, Biddiscombe, John A. a écrit : I'm compiling hwloc using clang (bgclang++11 from ANL) to run on IO nodes af a BGQ. It seems to have compiled ok, and when I run lstopo, I get an output like this (below), which looks reasonable, but there are 15 sockets instead of 16. I'm a little worried because the first time I compiled, I had problems where apps would report an error from HWLOC on start and tell me to set HWLOC_FORCE_BGQ=1. when I did set this env var, it would then report that "topology became empty" and the app would segfault due to the unexpected return from hwloc presumably. Can you give a bit more details on what you did there? I'd like to check if that case should be better supported or not. I wiped everything and recompiled (not sure what I did differently), and now it behaves more sensibly, but with 15 instead of 16 sockets. Should IO be worried? The topology detection is hardwired so you shouldn't worried on the hardware side. The problem could be related to how you reserved resources before running lstopo. Does lstopo --whole-system see more sockets? Does BG_THREADMODEL=2 help? Brice
[hwloc-users] BGQ question.
I'm compiling hwloc using clang (bgclang++11 from ANL) to run on IO nodes af a BGQ. It seems to have compiled ok, and when I run lstopo, I get an output like this (below), which looks reasonable, but there are 15 sockets instead of 16. I'm a little worried because the first time I compiled, I had problems where apps would report an error from HWLOC on start and tell me to set HWLOC_FORCE_BGQ=1. when I did set this env var, it would then report that "topology became empty" and the app would segfault due to the unexpected return from hwloc presumably. I wiped everything and recompiled (not sure what I did differently), and now it behaves more sensibly, but with 15 instead of 16 sockets. Should IO be worried? Thanks JB Machine (15GB) Socket L#0 + L1d L#0 (16KB) + L1i L#0 (16KB) + Core L#0 PU L#0 (P#0) PU L#1 (P#1) PU L#2 (P#2) PU L#3 (P#3) Socket L#1 + L1d L#1 (16KB) + L1i L#1 (16KB) + Core L#1 PU L#4 (P#4) PU L#5 (P#5) PU L#6 (P#6) PU L#7 (P#7) Socket L#2 + L1d L#2 (16KB) + L1i L#2 (16KB) + Core L#2 PU L#8 (P#8) PU L#9 (P#9) PU L#10 (P#10) PU L#11 (P#11) Socket L#3 + L1d L#3 (16KB) + L1i L#3 (16KB) + Core L#3 PU L#12 (P#12) PU L#13 (P#13) PU L#14 (P#14) PU L#15 (P#15) Socket L#4 + L1d L#4 (16KB) + L1i L#4 (16KB) + Core L#4 PU L#16 (P#16) PU L#17 (P#17) PU L#18 (P#18) PU L#19 (P#19) Socket L#5 + L1d L#5 (16KB) + L1i L#5 (16KB) + Core L#5 PU L#20 (P#20) PU L#21 (P#21) PU L#22 (P#22) PU L#23 (P#23) Socket L#6 + L1d L#6 (16KB) + L1i L#6 (16KB) + Core L#6 PU L#24 (P#24) PU L#25 (P#25) PU L#26 (P#26) PU L#27 (P#27) Socket L#7 + L1d L#7 (16KB) + L1i L#7 (16KB) + Core L#7 PU L#28 (P#28) PU L#29 (P#29) PU L#30 (P#30) PU L#31 (P#31) Socket L#8 + L1d L#8 (16KB) + L1i L#8 (16KB) + Core L#8 PU L#32 (P#32) PU L#33 (P#33) PU L#34 (P#34) PU L#35 (P#35) Socket L#9 + L1d L#9 (16KB) + L1i L#9 (16KB) + Core L#9 PU L#36 (P#36) PU L#37 (P#37) PU L#38 (P#38) PU L#39 (P#39) Socket L#10 + L1d L#10 (16KB) + L1i L#10 (16KB) + Core L#10 PU L#40 (P#40) PU L#41 (P#41) PU L#42 (P#42) PU L#43 (P#43) Socket L#11 + L1d L#11 (16KB) + L1i L#11 (16KB) + Core L#11 PU L#44 (P#44) PU L#45 (P#45) PU L#46 (P#46) PU L#47 (P#47) Socket L#12 + L1d L#12 (16KB) + L1i L#12 (16KB) + Core L#12 PU L#48 (P#48) PU L#49 (P#49) PU L#50 (P#50) PU L#51 (P#51) Socket L#13 + L1d L#13 (16KB) + L1i L#13 (16KB) + Core L#13 PU L#52 (P#52) PU L#53 (P#53) PU L#54 (P#54) PU L#55 (P#55) Socket L#14 + L1d L#14 (16KB) + L1i L#14 (16KB) + Core L#14 PU L#56 (P#56) PU L#57 (P#57) PU L#58 (P#58) PU L#59 (P#59) HostBridge L#0 PCIBridge PCI 1014:0023 -- John Biddiscombe,email:biddisco @.at.@ cscs.ch http://www.cscs.ch/ CSCS, Swiss National Supercomputing Centre | Tel: +41 (91) 610.82.07 Via Trevano 131, 6900 Lugano, Switzerland | Fax: +41 (91) 610.82.82