
Usually you would rather allocate and bind at the same time so that the
> memory doesn't need to be migrated when bound. However, if you do not touch
> the memory after allocation, pages are not actually physically allocated,
> hence there's no to migrate. Might work but keep this in mind.

I need all the data in one allocation, so that is why I opted to allocate
and then bind via the area function. The way I understand it is that by
using the memory binding policy HWLOC_MEMBIND_BIND with
hwloc_set_area_membind() the pages will actually get allocated on the
specified cores. If that is not the case I suppose the best solution would
be to just touch the allocated data with my threads.

Can you print memory binding like below instead of printing only the first
> PU in the set returned by get_area_membind?
    char *s;
    hwloc_bitmap_asprintf(&s, set);
    /* s is now a C string of the bitmap, use it in your std::cout line */

I tried that and now get_area_membind returns that all memory is bound to

> People often do the contrary. They bind threads, and then they have
> threads allocate/touch memory so that buffers are physically allocated near
> the related threads (automatic by default). It works well when the number
> of threads is known in advance. You place one thread per core, they never
> move. As long as memory is big enough to store the data nearby, everybody's
> happy. If the number of threads varies at runtime, and/or if they need to
> move, things become more difficult.
> Your approach is also correct. In the end, it's rather a question of
> whether you're code is data-centric or compute-centric, and whether
> imbalances may require to move things during the execution. Moving threads
> is usually cheaper. But oversubscribing cores with multiple threads is
> usually a bad idea, that's likely why people place one thread per core
> first.
My code is rather data-bound and my main motivation for binding the threads
is because I did not want hyperthreading on cores and because I want to
keep all threads that operate on the same data in one L3 Cache.

And send the output of lstopo on your machine so that I can understand it.
The machine has two sockets and on each socket are 64 cores. Cores 0-7
share one L3 cache, so do cores 8-15 and so on.
The output of lstopo is quite large, but if my description does not suffice
I can send it.

Thanks for your time


Am Di., 1. März 2022 um 15:42 Uhr schrieb Brice Goglin <

> Le 01/03/2022 à 15:17, Mike a écrit :
> Dear list,
> I have a program that utilizes Openmpi + multithreading and I want the
> freedom to decide on which hardware cores my threads should run. By using
> hwloc_set_cpubind() that already works, so now I also want to bind memory
> to the hardware cores. But I just can't get it to work.
> Basically, I wrote the memory binding into my allocator, so the memory
> will be allocated and then bound.
> Hello
> Usually you would rather allocate and bind at the same time so that the
> memory doesn't need to be migrated when bound. However, if you do not touch
> the memory after allocation, pages are not actually physically allocated,
> hence there's no to migrate. Might work but keep this in mind.
> I use hwloc 2.4.1, run the code on a Linux system and I did check with
> “hwloc-info --support” if hwloc_set_area_membind() and
> hwloc_get_area_membind() are supported and they are.
> Here is a snippet of my code, which runs through without any error. But
> the hwloc_get_area_membind() always returns that all memory is bound to PU
> 0, when I think it should be bound to different PUs. Am I missing something?
> Can you print memory binding like below instead of printing only the first
> PU in the set returned by get_area_membind?
> char *s;
> hwloc_bitmap_asprintf(&s, set);
> /* s is now a C string of the bitmap, use it in your std::cout line */
> And send the output of lstopo on your machine so that I can understand it.
> Or you could print the smallest object that contains the binding by
> calling hwloc_get_obj_covering_cpuset(topology, set). It returns an object
> whose type may be printed as a C-string with
> hwloc_obj_type_string(obj->type).
> You may also do the same before set_area_membind() if you want to verify
> that you're bindin where you really want.
> T* allocate(size_t n, hwloc_topology_t topology, int rank)
> {
>   // allocate memory
>   T* t = (T*)hwloc_alloc(topology, sizeof(T) * n);
>   // elements perthread
>   size_t ept = 1024;
>   hwloc_bitmap_t set;
>   size_t offset = 0;
>   size_t threadcount= 4;
>   set = hwloc_bitmap_alloc();
>   if(!set) {
>     fprintf(stderr, "failed to allocate a bitmap\n");
>   }
>   // bind memory to every thread
>   for(size_t i = 0;i < threadcount; i++)
>   {
>     // logical indexof where to bind the memory
>     auto logid = (i +rank * threadcount) * 2;
>     auto logobj = hwloc_get_obj_by_type(topology, HWLOC_OBJ_PU, logid);
>     hwloc_bitmap_only(set, logobj->os_index);
>     //set the memory binding
>     // I use HWLOC_MEMBIND_BIND as policy so I do not have to touch the
> memory first to allocate it
>     auto err = hwloc_set_area_membind(topology, t + offset, sizeof(T)
>     if(err < 0)
>       std::cout << "Error: memory binding failed" <<std::endl;
>     // print out data of first set
>     auto ii =hwloc_bitmap_first(set);
>     auto obj = hwloc_get_pu_obj_by_os_index(topology, ii);
>     std::cout << "Rank=" << rank << " Tid=" << i << " on PU logical
> index=" <<obj->logical_index << " (OS/physical index " <<ii << ")" <<
> std::endl;
>   // checking if memory is bound to the correct hardware core
>     hwloc_membind_policy_t policy;
>     hwloc_bitmap_zero(set);
>     err = hwloc_get_area_membind(topology, t + offset, sizeof(T) *
>     if(err < 0)
>       std::cout << "Error: getting memory binding failed"<< std::endl;
>     // print out data of hwloc_get_area_membind()
>     ii= hwloc_bitmap_first(set);
>     obj = hwloc_get_pu_obj_by_os_index(topology, ii);
>     std::cout << "Rank=" << rank << " Tid=" << i << " actually on PU
> logical index=" << obj->logical_index << " (OS/physical index " <<ii << ")"
> << std::endl;
>     // increase memory offset
>     offset += ept;
>   }
>   hwloc_bitmap_free(set);
>   return t;
> }
> Something that might be unrelated, but I still wanted to ask: from chapter
> 6 of the documentation I gather that a sort of best practice for binding
> threads and memory is to first allocate the memory, then binding the memory
> and finally doing the CPU binding. Am I correct in assuming this?
> People often do the contrary. They bind threads, and then they have
> threads allocate/touch memory so that buffers are physically allocated near
> the related threads (automatic by default). It works well when the number
> of threads is known in advance. You place one thread per core, they never
> move. As long as memory is big enough to store the data nearby, everybody's
> happy. If the number of threads varies at runtime, and/or if they need to
> move, things become more difficult.
> Your approach is also correct. In the end, it's rather a question of
> whether you're code is data-centric or compute-centric, and whether
> imbalances may require to move things during the execution. Moving threads
> is usually cheaper. But oversubscribing cores with multiple threads is
> usually a bad idea, that's likely why people place one thread per core
> first.
> If there's a need to improve the doc about this, please let me know.
> Brice
> _______________________________________________
> hwloc-users mailing list
> hwloc-users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/hwloc-users
hwloc-users mailing list

Reply via email to