Le 01/03/2022 à 15:17, Mike a écrit :

Dear list,

I have a program that utilizes Openmpi + multithreading and I want the freedom to decide on which hardware cores my threads should run. By using hwloc_set_cpubind() that already works, so now I also want to bind memory to the hardware cores. But I just can't get it to work.

Basically, I wrote the memory binding into my allocator, so the memory will be allocated and then bound.


Hello

Usually you would rather allocate and bind at the same time so that the memory doesn't need to be migrated when bound. However, if you do not touch the memory after allocation, pages are not actually physically allocated, hence there's no to migrate. Might work but keep this in mind.


I use hwloc 2.4.1, run the code on a Linux system and I did check with “hwloc-info --support” if hwloc_set_area_membind() and hwloc_get_area_membind() are supported and they are.

Here is a snippet of my code, which runs through without any error. But the hwloc_get_area_membind() always returns that all memory is bound to PU 0, when I think it should be bound to different PUs. Am I missing something?


Can you print memory binding like below instead of printing only the first PU in the set returned by get_area_membind?

char *s;
hwloc_bitmap_asprintf(&s, set);
/* s is now a C string of the bitmap, use it in your std::cout line */

And send the output of lstopo on your machine so that I can understand it.

Or you could print the smallest object that contains the binding by calling hwloc_get_obj_covering_cpuset(topology, set). It returns an object whose type may be printed as a C-string with hwloc_obj_type_string(obj->type).

You may also do the same before set_area_membind() if you want to verify that you're bindin where you really want.



T* allocate(size_t n, hwloc_topology_t topology, int rank)
{
  // allocate memory
  T* t = (T*)hwloc_alloc(topology, sizeof(T) * n);
  // elements perthread
  size_t ept = 1024;
  hwloc_bitmap_t set;
  size_t offset = 0;
  size_t threadcount= 4;

  set = hwloc_bitmap_alloc();
  if(!set) {
    fprintf(stderr, "failed to allocate a bitmap\n");
  }
  // bind memory to every thread
  for(size_t i = 0;i < threadcount; i++)
  {
    // logical indexof where to bind the memory
    auto logid = (i +rank * threadcount) * 2;
    auto logobj = hwloc_get_obj_by_type(topology, HWLOC_OBJ_PU, logid);
    hwloc_bitmap_only(set, logobj->os_index);
    //set the memory binding
    // I use HWLOC_MEMBIND_BIND as policy so I do not have to touch the memory first to allocate it     auto err = hwloc_set_area_membind(topology, t + offset, sizeof(T) *ept, set, HWLOC_MEMBIND_BIND, HWLOC_MEMBIND_STRICT | HWLOC_MEMBIND_THREAD);
    if(err < 0)
      std::cout << "Error: memory binding failed" <<std::endl;

    // print out data of first set
    auto ii =hwloc_bitmap_first(set);
    auto obj = hwloc_get_pu_obj_by_os_index(topology, ii);
    std::cout << "Rank=" << rank << " Tid=" << i << " on PU logical index=" <<obj->logical_index << " (OS/physical index " <<ii << ")" << std::endl;

  // checking if memory is bound to the correct hardware core
    hwloc_membind_policy_t policy;
    hwloc_bitmap_zero(set);
    err = hwloc_get_area_membind(topology, t + offset, sizeof(T) * ept,set, &policy, HWLOC_MEMBIND_STRICT | HWLOC_MEMBIND_THREAD);
    if(err < 0)
      std::cout << "Error: getting memory binding failed"<< std::endl;

    // print out data of hwloc_get_area_membind()
    ii= hwloc_bitmap_first(set);
    obj = hwloc_get_pu_obj_by_os_index(topology, ii);
    std::cout << "Rank=" << rank << " Tid=" << i << " actually on PU logical index=" << obj->logical_index << " (OS/physical index " <<ii << ")" << std::endl;

    // increase memory offset
    offset += ept;
  }
  hwloc_bitmap_free(set);
  return t;
}

Something that might be unrelated, but I still wanted to ask: from chapter 6 of the documentation I gather that a sort of best practice for binding threads and memory is to first allocate the memory, then binding the memory and finally doing the CPU binding. Am I correct in assuming this?



People often do the contrary. They bind threads, and then they have threads allocate/touch memory so that buffers are physically allocated near the related threads (automatic by default). It works well when the number of threads is known in advance. You place one thread per core, they never move. As long as memory is big enough to store the data nearby, everybody's happy. If the number of threads varies at runtime, and/or if they need to move, things become more difficult.

Your approach is also correct. In the end, it's rather a question of whether you're code is data-centric or compute-centric, and whether imbalances may require to move things during the execution. Moving threads is usually cheaper. But oversubscribing cores with multiple threads is usually a bad idea, that's likely why people place one thread per core first.

If there's a need to improve the doc about this, please let me know.

Brice


Attachment: OpenPGP_signature
Description: OpenPGP digital signature

_______________________________________________
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users

Reply via email to