Le 28/01/2019 à 11:28, Biddiscombe, John A. a écrit :
> If I disable thread 0 and allow thread 1 then I get this pattern on 1 machine 
> (clearly wrong)
> 11111111111111111111111111111111111111111111111111111111111111111111111111111111
> 11111111111111111111111111111111111111111111111111111111111111111111111111111111
> 11111111111111111111111111111111111111111111111111111111111111111111111111111111
> 11111111111111111111111111111111111111111111111111111111111111111111111111111111
> 11111111111111111111111111111111111111111111111111111111111111111111111111111111


Can you print the pattern before and after thread 1 touched its pages,
or even in the middle ?

It looks like somebody is touching too many pages here.

Brice


> and on another I get
> -1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1
> 1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-
> -1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1
> 1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-
> -1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1
> 1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-
> which is correct because the '-' is a negative status. I will run again and 
> see if it's -14 or -2
>
> JB
>
>
> -----Original Message-----
> From: Brice Goglin <brice.gog...@inria.fr> 
> Sent: 28 January 2019 10:56
> To: Biddiscombe, John A. <biddi...@cscs.ch>
> Cc: Hardware locality user list <hwloc-users@lists.open-mpi.org>
> Subject: Re: [hwloc-users] unusual memory binding results
>
> Can you try again disabling the touching in one thread to check whether the 
> other thread only touched its own pages? (others' status should be
> -2 (ENOENT))
>
> Recent kernels have ways to migrate memory at runtime
> (CONFIG_NUMA_BALANCING) but this should only occur when it detects that some 
> thread does a lot of remote access, which shouldn't be the case here, at 
> least at the beginning of the program.
>
> Brice
>
>
>
> Le 28/01/2019 à 10:35, Biddiscombe, John A. a écrit :
>> Brice
>>
>> I might have been using the wrong params to hwloc_get_area_memlocation 
>> in my original version, but I bypassed it and have been calling
>>
>>         int get_numa_domain(void *page)
>>         {
>>             HPX_ASSERT( (std::size_t(page) & 4095) ==0 );
>>
>>             void *pages[1] = { page };
>>             int  status[1] = { -1 };
>>             if (syscall(__NR_move_pages, 0, 1, pages, nullptr, status, 0) == 
>> 0) {
>>                 if (status[0]>=0 && 
>> status[0]<=HPX_HAVE_MAX_NUMA_DOMAIN_COUNT) {
>>                     return status[0];
>>                 }
>>                 return -1;
>>             }
>>             throw std::runtime_error("Failed to get numa node for page");
>>         }
>>
>> this function instead. Just testing one page address at a time. I 
>> still see this kind of pattern
>> 0010110101011110101010010101010110100110110101011101011101110101010000
>> 0101010000
>> 0010110101011110101010010101010110100110110101011101011101110101010000
>> 0101010000
>> 0010110101011110101010010101010110100110110101011101011101110101010000
>> 0101010000
>> 0010110101011110101010010101010110100110110101011101011101110101010000
>> 0101010000
>> 0010110101011110101010010101010110100110110101011101011101110101010000
>> 0101010000
>> 0010110101011110101010010101010110100110110101011101011101110101010000
>> 0101010000
>> 0010110101011110101010010101010110100110110101011101011101110101010000
>> 0101010000
>> 0010110101011110101010010101010110100110110101011101011101110101010000
>> 0101010000
>> 0010110101011110101010010101010110100110110101011101011101110101010000
>> 0101010000
>> 0010110101011110101010010101010110100110110101011101011101110101010000
>> 0101010000
>> 0010110101011110101010010101010110100110110101011101011101110101010000
>> 0101010000
>> when I should see
>> 0101010101010101010101010101010101010101010101010101010101010101010101
>> 0101010101
>> 1010101010101010101010101010101010101010101010101010101010101010101010
>> 1010101010
>> 0101010101010101010101010101010101010101010101010101010101010101010101
>> 0101010101
>> 1010101010101010101010101010101010101010101010101010101010101010101010
>> 1010101010
>> 0101010101010101010101010101010101010101010101010101010101010101010101
>> 0101010101
>> 1010101010101010101010101010101010101010101010101010101010101010101010
>> 1010101010
>> 0101010101010101010101010101010101010101010101010101010101010101010101
>> 0101010101
>> 1010101010101010101010101010101010101010101010101010101010101010101010
>> 1010101010
>> 0101010101010101010101010101010101010101010101010101010101010101010101
>> 0101010101
>> 1010101010101010101010101010101010101010101010101010101010101010101010
>> 1010101010
>>
>> I am deeply troubled by this and can't think of what to try next since I can 
>> see the memory contents hold the correct CPU ID of the thread that touched 
>> the memory, so either the syscall is wrong, or the kernel is doing something 
>> else. I welcome any suggestions on what might be wrong.
>>
>> Thanks for trying to help.
>>
>> JB
>>
>> -----Original Message-----
>> From: Brice Goglin <brice.gog...@inria.fr>
>> Sent: 26 January 2019 10:19
>> To: Biddiscombe, John A. <biddi...@cscs.ch>
>> Cc: Hardware locality user list <hwloc-users@lists.open-mpi.org>
>> Subject: Re: [hwloc-users] unusual memory binding results
>>
>> Le 25/01/2019 à 23:16, Biddiscombe, John A. a écrit :
>>>> move_pages() returning 0 with -14 in the status array? As opposed to 
>>>> move_pages() returning -1 with errno set to 14, which would definitely be 
>>>> a bug in hwloc.
>>> I think it was move_pages returning zero with -14 in the status array, and 
>>> then hwloc returning 0 with an empty nodeset (which I then messed up by 
>>> calling get bitmap first and assuming 0 meant numa node zero and not 
>>> checking for an empty nodeset).
>>>
>>> I'm not sure why I get -EFAULT status rather than -NOENT, but that's what 
>>> I'm seeing in the status field when I pass the pointer returned from the 
>>> alloc_membind call.
>> The only reason I see for getting -EFAULT there would be that you pass the 
>> buffer to move_pages (what hwloc_get_area_memlocation() wants, a start 
>> pointer and length) instead of a pointer to an array of page addresses 
>> (move_pages wants a void** pointing to individual pages).
>>
>> Brice
>>
>>
_______________________________________________
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users

Reply via email to