Le 07/07/2017 20:38, David Solt a écrit :
> We are using the hwloc api to identify GPUs on our cluster. While we
> are able to "discover" the GPUs, other information about them does not
> appear to be getting filled in. See below for example:

> (gdb) p *obj->attr
> $20 = {
>   cache = {
>     size = 1,
>     depth = 0,
>     linesize = 0,
>     associativity = 0,
>     type = HWLOC_OBJ_CACHE_UNIFIED
>   },
>   group = {
>     depth = 1
>   },
>   pcidev = {
>     domain = 1,
>     bus = 0 '\000',
>     dev = 0 '\000',
>     func = 0 '\000',
>     class_id = 0,
>     *vendor_id = 0,*
>    *device_id = 0, *
>     subvendor_id = 0,
>     subdevice_id = 0,
>     revision = 0 '\000',
>     linkspeed = 0
>   },
>   bridge = {
>     upstream = {
>       pci = {
>         domain = 1,
>         bus = 0 '\000',
>         dev = 0 '\000',
>         func = 0 '\000',
>         class_id = 0,
>         vendor_id = 0,
>         device_id = 0,
>         subvendor_id = 0,
>         subdevice_id = 0,
>         revision = 0 '\000',
>         linkspeed = 0
>       }
>     },
>     upstream_type = HWLOC_OBJ_BRIDGE_HOST,
>     downstream = {
>       pci = {
>         domain = 0,
>         secondary_bus = 0 '\000',
>         subordinate_bus = 0 '\000'
>       }
>     },
>     downstream_type = HWLOC_OBJ_BRIDGE_HOST,
>     depth = 0
>   },
>   osdev = {
>     type = *HWLOC_OBJ_OSDEV_GPU*
>   }
> }

> The name is generally just "cardX". 


Hello

attr is an union so only the "osdev" portion above matters. "osdev" can
be a lot of different things. So instead of having all possible
attributes in a struct, we use info key/value pairs (hwloc_obj->infos).
But those "cardX" devices are the GPU reported by the Linux kernel DRM
subsystem, we don't have much information about them anyway.

If you're looking at Power machine, I am going to assume you care about
CUDA devices. Those are "osdev" objects of type "COPROC" instead of
"GPU". They have many more attributes. Here's what I see on one of our
machines:

  PCI 10de:1094 (P#540672 busid=0000:84:00.0 class=0302(3D) PCIVendor="NVIDIA 
Corporation" PCIDevice="Tesla M2075 Dual-Slot Computing Processor Module") 
"NVIDIA Corporation Tesla M2075 Dual-Slot Computing Processor Module"
    Co-Processor L#5 (CoProcType=CUDA Backend=CUDA GPUVendor="NVIDIA 
Corporation" GPUModel="Tesla M2075" CUDAGlobalMemorySize=5428224 
CUDAL2CacheSize=768 CUDAMultiProcessors=14 CUDACoresPerMP=32 
CUDASharedMemorySizePerMP=48) "cuda2"


On recent kernels, you would see both a "cardX" GPU osdev and a "cudaX"
COPROC osdev in the PCI device. There can even be "nvmlX" and ":0.0" if
you have the nvml and nvctrl libraries. Those are basically different
ways to talk the GPU (Linux kernel DRM, CUDA, etc).

Given that I have never seen anybody use "cardX" for placing task/data
near a GPU, I am wondering if we should disable those by default. Or
maybe rename "GPU" into something that wouldn't attract people as much,
maybe "DRM".

> Does this mean that the cards are not configured correctly? Or is
> there an additional flag that needs to be set to get this information?


Make sure "cuda" appears in the summary at the end of the configure.

> Currently the code does:

>   hwloc_topology_init(&machine_topology);
>   hwloc_topology_set_flags(machine_topology,
> HWLOC_TOPOLOGY_FLAG_IO_DEVICES);
>   hwloc_topology_load(machine_topology);

> And this is enough to identify the CPUs and GPUs, but any additional
> information - particularly the device and vendor id's - seem to not be
> there. 

> I tried this with the most recent release (1.11.7) and saw the same
> results.   

> We tried this on a variety of PowerPC machines and I think even some
> x86_64 machines with similar results.   

> Thoughts?
> Dave

BTW, it looks like you're not going to the OMPI dev meeting next week.
I'll be there if one of your colleague wants to discuss this face to face.

Brice

_______________________________________________
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users

Reply via email to