Hello, Usually you would rather allocate and bind at the same time so that the > memory doesn't need to be migrated when bound. However, if you do not touch > the memory after allocation, pages are not actually physically allocated, > hence there's no to migrate. Might work but keep this in mind. >
I need all the data in one allocation, so that is why I opted to allocate and then bind via the area function. The way I understand it is that by using the memory binding policy HWLOC_MEMBIND_BIND with hwloc_set_area_membind() the pages will actually get allocated on the specified cores. If that is not the case I suppose the best solution would be to just touch the allocated data with my threads. Can you print memory binding like below instead of printing only the first > PU in the set returned by get_area_membind? > char *s; hwloc_bitmap_asprintf(&s, set); /* s is now a C string of the bitmap, use it in your std::cout line */ I tried that and now get_area_membind returns that all memory is bound to 0xffffffff,0xffffffff,,,0xffffffff,0xffffffff > > People often do the contrary. They bind threads, and then they have > threads allocate/touch memory so that buffers are physically allocated near > the related threads (automatic by default). It works well when the number > of threads is known in advance. You place one thread per core, they never > move. As long as memory is big enough to store the data nearby, everybody's > happy. If the number of threads varies at runtime, and/or if they need to > move, things become more difficult. > > Your approach is also correct. In the end, it's rather a question of > whether you're code is data-centric or compute-centric, and whether > imbalances may require to move things during the execution. Moving threads > is usually cheaper. But oversubscribing cores with multiple threads is > usually a bad idea, that's likely why people place one thread per core > first. > My code is rather data-bound and my main motivation for binding the threads is because I did not want hyperthreading on cores and because I want to keep all threads that operate on the same data in one L3 Cache. And send the output of lstopo on your machine so that I can understand it. > The machine has two sockets and on each socket are 64 cores. Cores 0-7 share one L3 cache, so do cores 8-15 and so on. The output of lstopo is quite large, but if my description does not suffice I can send it. Thanks for your time Mike Am Di., 1. März 2022 um 15:42 Uhr schrieb Brice Goglin < brice.gog...@inria.fr>: > > Le 01/03/2022 à 15:17, Mike a écrit : > > Dear list, > > I have a program that utilizes Openmpi + multithreading and I want the > freedom to decide on which hardware cores my threads should run. By using > hwloc_set_cpubind() that already works, so now I also want to bind memory > to the hardware cores. But I just can't get it to work. > > Basically, I wrote the memory binding into my allocator, so the memory > will be allocated and then bound. > > > Hello > > Usually you would rather allocate and bind at the same time so that the > memory doesn't need to be migrated when bound. However, if you do not touch > the memory after allocation, pages are not actually physically allocated, > hence there's no to migrate. Might work but keep this in mind. > > > I use hwloc 2.4.1, run the code on a Linux system and I did check with > “hwloc-info --support” if hwloc_set_area_membind() and > hwloc_get_area_membind() are supported and they are. > > Here is a snippet of my code, which runs through without any error. But > the hwloc_get_area_membind() always returns that all memory is bound to PU > 0, when I think it should be bound to different PUs. Am I missing something? > > > Can you print memory binding like below instead of printing only the first > PU in the set returned by get_area_membind? > > char *s; > hwloc_bitmap_asprintf(&s, set); > /* s is now a C string of the bitmap, use it in your std::cout line */ > > And send the output of lstopo on your machine so that I can understand it. > > Or you could print the smallest object that contains the binding by > calling hwloc_get_obj_covering_cpuset(topology, set). It returns an object > whose type may be printed as a C-string with > hwloc_obj_type_string(obj->type). > > You may also do the same before set_area_membind() if you want to verify > that you're bindin where you really want. > > > > T* allocate(size_t n, hwloc_topology_t topology, int rank) > { > // allocate memory > T* t = (T*)hwloc_alloc(topology, sizeof(T) * n); > // elements perthread > size_t ept = 1024; > hwloc_bitmap_t set; > size_t offset = 0; > size_t threadcount= 4; > > set = hwloc_bitmap_alloc(); > if(!set) { > fprintf(stderr, "failed to allocate a bitmap\n"); > } > // bind memory to every thread > for(size_t i = 0;i < threadcount; i++) > { > // logical indexof where to bind the memory > auto logid = (i +rank * threadcount) * 2; > auto logobj = hwloc_get_obj_by_type(topology, HWLOC_OBJ_PU, logid); > hwloc_bitmap_only(set, logobj->os_index); > //set the memory binding > // I use HWLOC_MEMBIND_BIND as policy so I do not have to touch the > memory first to allocate it > auto err = hwloc_set_area_membind(topology, t + offset, sizeof(T) > *ept, set, HWLOC_MEMBIND_BIND, HWLOC_MEMBIND_STRICT | HWLOC_MEMBIND_THREAD); > if(err < 0) > std::cout << "Error: memory binding failed" <<std::endl; > > // print out data of first set > auto ii =hwloc_bitmap_first(set); > auto obj = hwloc_get_pu_obj_by_os_index(topology, ii); > std::cout << "Rank=" << rank << " Tid=" << i << " on PU logical > index=" <<obj->logical_index << " (OS/physical index " <<ii << ")" << > std::endl; > > // checking if memory is bound to the correct hardware core > hwloc_membind_policy_t policy; > hwloc_bitmap_zero(set); > err = hwloc_get_area_membind(topology, t + offset, sizeof(T) * > ept,set, &policy, HWLOC_MEMBIND_STRICT | HWLOC_MEMBIND_THREAD); > if(err < 0) > std::cout << "Error: getting memory binding failed"<< std::endl; > > // print out data of hwloc_get_area_membind() > ii= hwloc_bitmap_first(set); > obj = hwloc_get_pu_obj_by_os_index(topology, ii); > std::cout << "Rank=" << rank << " Tid=" << i << " actually on PU > logical index=" << obj->logical_index << " (OS/physical index " <<ii << ")" > << std::endl; > > // increase memory offset > offset += ept; > } > hwloc_bitmap_free(set); > return t; > } > > Something that might be unrelated, but I still wanted to ask: from chapter > 6 of the documentation I gather that a sort of best practice for binding > threads and memory is to first allocate the memory, then binding the memory > and finally doing the CPU binding. Am I correct in assuming this? > > > People often do the contrary. They bind threads, and then they have > threads allocate/touch memory so that buffers are physically allocated near > the related threads (automatic by default). It works well when the number > of threads is known in advance. You place one thread per core, they never > move. As long as memory is big enough to store the data nearby, everybody's > happy. If the number of threads varies at runtime, and/or if they need to > move, things become more difficult. > > Your approach is also correct. In the end, it's rather a question of > whether you're code is data-centric or compute-centric, and whether > imbalances may require to move things during the execution. Moving threads > is usually cheaper. But oversubscribing cores with multiple threads is > usually a bad idea, that's likely why people place one thread per core > first. > > If there's a need to improve the doc about this, please let me know. > > Brice > > > _______________________________________________ > hwloc-users mailing list > hwloc-users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/hwloc-users
_______________________________________________ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/hwloc-users