Can you try again disabling the touching in one thread to check whether the other thread only touched its own pages? (others' status should be -2 (ENOENT))
Recent kernels have ways to migrate memory at runtime (CONFIG_NUMA_BALANCING) but this should only occur when it detects that some thread does a lot of remote access, which shouldn't be the case here, at least at the beginning of the program. Brice Le 28/01/2019 à 10:35, Biddiscombe, John A. a écrit : > Brice > > I might have been using the wrong params to hwloc_get_area_memlocation in my > original version, but I bypassed it and have been calling > > int get_numa_domain(void *page) > { > HPX_ASSERT( (std::size_t(page) & 4095) ==0 ); > > void *pages[1] = { page }; > int status[1] = { -1 }; > if (syscall(__NR_move_pages, 0, 1, pages, nullptr, status, 0) == > 0) { > if (status[0]>=0 && > status[0]<=HPX_HAVE_MAX_NUMA_DOMAIN_COUNT) { > return status[0]; > } > return -1; > } > throw std::runtime_error("Failed to get numa node for page"); > } > > this function instead. Just testing one page address at a time. I still see > this kind of pattern > 00101101010111101010100101010101101001101101010111010111011101010100000101010000 > 00101101010111101010100101010101101001101101010111010111011101010100000101010000 > 00101101010111101010100101010101101001101101010111010111011101010100000101010000 > 00101101010111101010100101010101101001101101010111010111011101010100000101010000 > 00101101010111101010100101010101101001101101010111010111011101010100000101010000 > 00101101010111101010100101010101101001101101010111010111011101010100000101010000 > 00101101010111101010100101010101101001101101010111010111011101010100000101010000 > 00101101010111101010100101010101101001101101010111010111011101010100000101010000 > 00101101010111101010100101010101101001101101010111010111011101010100000101010000 > 00101101010111101010100101010101101001101101010111010111011101010100000101010000 > 00101101010111101010100101010101101001101101010111010111011101010100000101010000 > when I should see > 01010101010101010101010101010101010101010101010101010101010101010101010101010101 > 10101010101010101010101010101010101010101010101010101010101010101010101010101010 > 01010101010101010101010101010101010101010101010101010101010101010101010101010101 > 10101010101010101010101010101010101010101010101010101010101010101010101010101010 > 01010101010101010101010101010101010101010101010101010101010101010101010101010101 > 10101010101010101010101010101010101010101010101010101010101010101010101010101010 > 01010101010101010101010101010101010101010101010101010101010101010101010101010101 > 10101010101010101010101010101010101010101010101010101010101010101010101010101010 > 01010101010101010101010101010101010101010101010101010101010101010101010101010101 > 10101010101010101010101010101010101010101010101010101010101010101010101010101010 > > I am deeply troubled by this and can't think of what to try next since I can > see the memory contents hold the correct CPU ID of the thread that touched > the memory, so either the syscall is wrong, or the kernel is doing something > else. I welcome any suggestions on what might be wrong. > > Thanks for trying to help. > > JB > > -----Original Message----- > From: Brice Goglin <brice.gog...@inria.fr> > Sent: 26 January 2019 10:19 > To: Biddiscombe, John A. <biddi...@cscs.ch> > Cc: Hardware locality user list <hwloc-users@lists.open-mpi.org> > Subject: Re: [hwloc-users] unusual memory binding results > > Le 25/01/2019 à 23:16, Biddiscombe, John A. a écrit : >>> move_pages() returning 0 with -14 in the status array? As opposed to >>> move_pages() returning -1 with errno set to 14, which would definitely be a >>> bug in hwloc. >> I think it was move_pages returning zero with -14 in the status array, and >> then hwloc returning 0 with an empty nodeset (which I then messed up by >> calling get bitmap first and assuming 0 meant numa node zero and not >> checking for an empty nodeset). >> >> I'm not sure why I get -EFAULT status rather than -NOENT, but that's what >> I'm seeing in the status field when I pass the pointer returned from the >> alloc_membind call. > The only reason I see for getting -EFAULT there would be that you pass the > buffer to move_pages (what hwloc_get_area_memlocation() wants, a start > pointer and length) instead of a pointer to an array of page addresses > (move_pages wants a void** pointing to individual pages). > > Brice > > _______________________________________________ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/hwloc-users