Le 28/01/2019 à 11:28, Biddiscombe, John A. a écrit : > If I disable thread 0 and allow thread 1 then I get this pattern on 1 machine > (clearly wrong) > 11111111111111111111111111111111111111111111111111111111111111111111111111111111 > 11111111111111111111111111111111111111111111111111111111111111111111111111111111 > 11111111111111111111111111111111111111111111111111111111111111111111111111111111 > 11111111111111111111111111111111111111111111111111111111111111111111111111111111 > 11111111111111111111111111111111111111111111111111111111111111111111111111111111
Can you print the pattern before and after thread 1 touched its pages, or even in the middle ? It looks like somebody is touching too many pages here. Brice > and on another I get > -1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1 > 1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1- > -1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1 > 1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1- > -1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1 > 1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1- > which is correct because the '-' is a negative status. I will run again and > see if it's -14 or -2 > > JB > > > -----Original Message----- > From: Brice Goglin <brice.gog...@inria.fr> > Sent: 28 January 2019 10:56 > To: Biddiscombe, John A. <biddi...@cscs.ch> > Cc: Hardware locality user list <hwloc-users@lists.open-mpi.org> > Subject: Re: [hwloc-users] unusual memory binding results > > Can you try again disabling the touching in one thread to check whether the > other thread only touched its own pages? (others' status should be > -2 (ENOENT)) > > Recent kernels have ways to migrate memory at runtime > (CONFIG_NUMA_BALANCING) but this should only occur when it detects that some > thread does a lot of remote access, which shouldn't be the case here, at > least at the beginning of the program. > > Brice > > > > Le 28/01/2019 à 10:35, Biddiscombe, John A. a écrit : >> Brice >> >> I might have been using the wrong params to hwloc_get_area_memlocation >> in my original version, but I bypassed it and have been calling >> >> int get_numa_domain(void *page) >> { >> HPX_ASSERT( (std::size_t(page) & 4095) ==0 ); >> >> void *pages[1] = { page }; >> int status[1] = { -1 }; >> if (syscall(__NR_move_pages, 0, 1, pages, nullptr, status, 0) == >> 0) { >> if (status[0]>=0 && >> status[0]<=HPX_HAVE_MAX_NUMA_DOMAIN_COUNT) { >> return status[0]; >> } >> return -1; >> } >> throw std::runtime_error("Failed to get numa node for page"); >> } >> >> this function instead. Just testing one page address at a time. I >> still see this kind of pattern >> 0010110101011110101010010101010110100110110101011101011101110101010000 >> 0101010000 >> 0010110101011110101010010101010110100110110101011101011101110101010000 >> 0101010000 >> 0010110101011110101010010101010110100110110101011101011101110101010000 >> 0101010000 >> 0010110101011110101010010101010110100110110101011101011101110101010000 >> 0101010000 >> 0010110101011110101010010101010110100110110101011101011101110101010000 >> 0101010000 >> 0010110101011110101010010101010110100110110101011101011101110101010000 >> 0101010000 >> 0010110101011110101010010101010110100110110101011101011101110101010000 >> 0101010000 >> 0010110101011110101010010101010110100110110101011101011101110101010000 >> 0101010000 >> 0010110101011110101010010101010110100110110101011101011101110101010000 >> 0101010000 >> 0010110101011110101010010101010110100110110101011101011101110101010000 >> 0101010000 >> 0010110101011110101010010101010110100110110101011101011101110101010000 >> 0101010000 >> when I should see >> 0101010101010101010101010101010101010101010101010101010101010101010101 >> 0101010101 >> 1010101010101010101010101010101010101010101010101010101010101010101010 >> 1010101010 >> 0101010101010101010101010101010101010101010101010101010101010101010101 >> 0101010101 >> 1010101010101010101010101010101010101010101010101010101010101010101010 >> 1010101010 >> 0101010101010101010101010101010101010101010101010101010101010101010101 >> 0101010101 >> 1010101010101010101010101010101010101010101010101010101010101010101010 >> 1010101010 >> 0101010101010101010101010101010101010101010101010101010101010101010101 >> 0101010101 >> 1010101010101010101010101010101010101010101010101010101010101010101010 >> 1010101010 >> 0101010101010101010101010101010101010101010101010101010101010101010101 >> 0101010101 >> 1010101010101010101010101010101010101010101010101010101010101010101010 >> 1010101010 >> >> I am deeply troubled by this and can't think of what to try next since I can >> see the memory contents hold the correct CPU ID of the thread that touched >> the memory, so either the syscall is wrong, or the kernel is doing something >> else. I welcome any suggestions on what might be wrong. >> >> Thanks for trying to help. >> >> JB >> >> -----Original Message----- >> From: Brice Goglin <brice.gog...@inria.fr> >> Sent: 26 January 2019 10:19 >> To: Biddiscombe, John A. <biddi...@cscs.ch> >> Cc: Hardware locality user list <hwloc-users@lists.open-mpi.org> >> Subject: Re: [hwloc-users] unusual memory binding results >> >> Le 25/01/2019 à 23:16, Biddiscombe, John A. a écrit : >>>> move_pages() returning 0 with -14 in the status array? As opposed to >>>> move_pages() returning -1 with errno set to 14, which would definitely be >>>> a bug in hwloc. >>> I think it was move_pages returning zero with -14 in the status array, and >>> then hwloc returning 0 with an empty nodeset (which I then messed up by >>> calling get bitmap first and assuming 0 meant numa node zero and not >>> checking for an empty nodeset). >>> >>> I'm not sure why I get -EFAULT status rather than -NOENT, but that's what >>> I'm seeing in the status field when I pass the pointer returned from the >>> alloc_membind call. >> The only reason I see for getting -EFAULT there would be that you pass the >> buffer to move_pages (what hwloc_get_area_memlocation() wants, a start >> pointer and length) instead of a pointer to an array of page addresses >> (move_pages wants a void** pointing to individual pages). >> >> Brice >> >> _______________________________________________ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/hwloc-users