Le 02/03/2022 à 11:38, Mike a écrit :

    If you print the set that is built before calling
    set_area_membind, you should only see 4 bits in there, right?
    (since threadcount=4 in your code)

    I'd say 0xf for rank0, 0xf0 for rank1, etc.

    set_area_membind() will translate that into a single NUMA node,
    before asking the kernel to bind. Later get_area_membind translate
    the single NUMA node back into a set that contains all PUs of the
    NUMA node.

    That said, I am not sure I understand what threadcount means in
    your code. Are you calling the allocate function multiple times
    with many different ranks? (MPI ranks?)

The allocator function is called once for every MPI rank and threadcount is the number of threads that run on one MPI rank. I build the set so that only 1 bit is set before calling set_area_membind, so that the memory can only be bound to the specified hardware core. Basically, I call set_area_membind once for every thread on a MPI rank. After the allocation I will call hwloc_set_cpubind with a set that has again 1 bit set, so that (if all works properly) I bound an area of memory and a software thread to one specific hardware core.

Can you display both mask before set_area_membind and after get_area_membind and send the entire output of all processes and threads? If you can prefix the line with the PID, it'd help a lot :)


Attachment: OpenPGP_signature
Description: OpenPGP digital signature

hwloc-users mailing list

Reply via email to