Can you try again disabling the touching in one thread to check whether
the other thread only touched its own pages? (others' status should be
-2 (ENOENT))

Recent kernels have ways to migrate memory at runtime
(CONFIG_NUMA_BALANCING) but this should only occur when it detects that
some thread does a lot of remote access, which shouldn't be the case
here, at least at the beginning of the program.

Brice



Le 28/01/2019 à 10:35, Biddiscombe, John A. a écrit :
> Brice
>
> I might have been using the wrong params to hwloc_get_area_memlocation in my 
> original version, but I bypassed it and have been calling
>
>         int get_numa_domain(void *page)
>         {
>             HPX_ASSERT( (std::size_t(page) & 4095) ==0 );
>
>             void *pages[1] = { page };
>             int  status[1] = { -1 };
>             if (syscall(__NR_move_pages, 0, 1, pages, nullptr, status, 0) == 
> 0) {
>                 if (status[0]>=0 && 
> status[0]<=HPX_HAVE_MAX_NUMA_DOMAIN_COUNT) {
>                     return status[0];
>                 }
>                 return -1;
>             }
>             throw std::runtime_error("Failed to get numa node for page");
>         }
>
> this function instead. Just testing one page address at a time. I still see 
> this kind of pattern
> 00101101010111101010100101010101101001101101010111010111011101010100000101010000
> 00101101010111101010100101010101101001101101010111010111011101010100000101010000
> 00101101010111101010100101010101101001101101010111010111011101010100000101010000
> 00101101010111101010100101010101101001101101010111010111011101010100000101010000
> 00101101010111101010100101010101101001101101010111010111011101010100000101010000
> 00101101010111101010100101010101101001101101010111010111011101010100000101010000
> 00101101010111101010100101010101101001101101010111010111011101010100000101010000
> 00101101010111101010100101010101101001101101010111010111011101010100000101010000
> 00101101010111101010100101010101101001101101010111010111011101010100000101010000
> 00101101010111101010100101010101101001101101010111010111011101010100000101010000
> 00101101010111101010100101010101101001101101010111010111011101010100000101010000
> when I should see
> 01010101010101010101010101010101010101010101010101010101010101010101010101010101
> 10101010101010101010101010101010101010101010101010101010101010101010101010101010
> 01010101010101010101010101010101010101010101010101010101010101010101010101010101
> 10101010101010101010101010101010101010101010101010101010101010101010101010101010
> 01010101010101010101010101010101010101010101010101010101010101010101010101010101
> 10101010101010101010101010101010101010101010101010101010101010101010101010101010
> 01010101010101010101010101010101010101010101010101010101010101010101010101010101
> 10101010101010101010101010101010101010101010101010101010101010101010101010101010
> 01010101010101010101010101010101010101010101010101010101010101010101010101010101
> 10101010101010101010101010101010101010101010101010101010101010101010101010101010
>
> I am deeply troubled by this and can't think of what to try next since I can 
> see the memory contents hold the correct CPU ID of the thread that touched 
> the memory, so either the syscall is wrong, or the kernel is doing something 
> else. I welcome any suggestions on what might be wrong.
>
> Thanks for trying to help.
>
> JB
>
> -----Original Message-----
> From: Brice Goglin <brice.gog...@inria.fr> 
> Sent: 26 January 2019 10:19
> To: Biddiscombe, John A. <biddi...@cscs.ch>
> Cc: Hardware locality user list <hwloc-users@lists.open-mpi.org>
> Subject: Re: [hwloc-users] unusual memory binding results
>
> Le 25/01/2019 à 23:16, Biddiscombe, John A. a écrit :
>>> move_pages() returning 0 with -14 in the status array? As opposed to 
>>> move_pages() returning -1 with errno set to 14, which would definitely be a 
>>> bug in hwloc.
>> I think it was move_pages returning zero with -14 in the status array, and 
>> then hwloc returning 0 with an empty nodeset (which I then messed up by 
>> calling get bitmap first and assuming 0 meant numa node zero and not 
>> checking for an empty nodeset).
>>
>> I'm not sure why I get -EFAULT status rather than -NOENT, but that's what 
>> I'm seeing in the status field when I pass the pointer returned from the 
>> alloc_membind call.
> The only reason I see for getting -EFAULT there would be that you pass the 
> buffer to move_pages (what hwloc_get_area_memlocation() wants, a start 
> pointer and length) instead of a pointer to an array of page addresses 
> (move_pages wants a void** pointing to individual pages).
>
> Brice
>
>
_______________________________________________
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users

Reply via email to