Containers can be assigned multiple GPUs, so I assume you're thinking of putting these metrics in a repeated message? (similar to DiskStatistics)
It has seemed to me we should probably make this Nvidia specific (e.g. NvidiaGPUStatistics). In the past we thought generalizing this would be good, but there's only Nvidia support at the moment and we haven't been able to make sure that other GPU libraries provide the same information. For each metric can you also include the relevant calls from NVML for obtaining the information? Can you also highlight what cadvisor provides to make sure we don't miss anything? From my read of their code, it seems to be a subset of what you listed? https://github.com/google/cadvisor/blob/e310755a36728b457fcc1de6b54bb4c6cb38f031/accelerators/nvidia.go#L216-L246 On Fri, Mar 22, 2019 at 6:58 AM Jorge Machado <jom...@me.com.invalid> wrote: > another way would be to just use cadvisor > > > On 22 Mar 2019, at 08:35, Jorge Machado <jom...@me.com.INVALID> wrote: > > > > Hi Mesos devs, > > > > In our use case from mesos we need to get gpu resource usage per task > and build dashboards on grafana for it. Getting the metrics to Grafana we > will send the metrics to prometheus the main problem is how to get the > metrics in a reliable way. > > I proposing the following: > > > > Changing the mesos.proto and mesos.proto under v1 and on > ResourceStatistics message add: > > > > //GPU statistics for each container > > optional int32 gpu_idx = 50; > > optional string gpu_uuid = 51; > > optional string device_name = 52; > > optional uint64 gpu_memory_used_mb = 53; > > optional uint64 gpu_memory_total_mb = 54; > > optional double gpu_usage = 55; > > optional int32 gpu_temperature = 56; > > optional int32 gpu_frequency_MHz = 57; > > optional int32 gpu_power_used_W = 58; > > > > For starters I would like to change NvidiaGpuIsolatorProcess at > isolator.cpp and there get the nvml call for the usage method. As I’m new > to this I need some guidelines please. > > > > My questions: > > > > Does the NvidiaGpuIsolatorProcess runs already inside the container or > just outside in the agent ? (I’m assuming outside) > > From what I saw on the cpu metrics they are gathered inside the > container for the gpu we could do it in the NvidiaGpuIsolatorProcess and > get the metrics via the host. > > Anything more that I should check ? > > > > Thanks a lot > > > > Jorge Machado > > www.jmachado.me > > > > > > > > > > > >