Thanks for the feedback, BenM! Jorge, could you mind addressing BenM's comment above and put the proposal to a google doc?
We could discuss this proposal in next Containerization WG meeting on April 4th (please add an agenda and link your proposal): https://docs.google.com/document/d/1z55a7tLZFoRWVuUxz1FZwgxkHeugtc2nHR89skFXSpU/edit#heading=h.978qjujkxfvu -Gilbert On Fri, Mar 22, 2019 at 12:19 PM Benjamin Mahler <[email protected]> wrote: > Containers can be assigned multiple GPUs, so I assume you're thinking of > putting these metrics in a repeated message? (similar to DiskStatistics) > > It has seemed to me we should probably make this Nvidia specific (e.g. > NvidiaGPUStatistics). In the past we thought generalizing this would be > good, but there's only Nvidia support at the moment and we haven't been > able to make sure that other GPU libraries provide the same information. > > For each metric can you also include the relevant calls from NVML for > obtaining the information? Can you also highlight what cadvisor provides to > make sure we don't miss anything? From my read of their code, it seems to > be a subset of what you listed? > > https://github.com/google/cadvisor/blob/e310755a36728b457fcc1de6b54bb4c6cb38f031/accelerators/nvidia.go#L216-L246 > > On Fri, Mar 22, 2019 at 6:58 AM Jorge Machado <[email protected]> > wrote: > > > another way would be to just use cadvisor > > > > > On 22 Mar 2019, at 08:35, Jorge Machado <[email protected]> wrote: > > > > > > Hi Mesos devs, > > > > > > In our use case from mesos we need to get gpu resource usage per task > > and build dashboards on grafana for it. Getting the metrics to Grafana > we > > will send the metrics to prometheus the main problem is how to get the > > metrics in a reliable way. > > > I proposing the following: > > > > > > Changing the mesos.proto and mesos.proto under v1 and on > > ResourceStatistics message add: > > > > > > //GPU statistics for each container > > > optional int32 gpu_idx = 50; > > > optional string gpu_uuid = 51; > > > optional string device_name = 52; > > > optional uint64 gpu_memory_used_mb = 53; > > > optional uint64 gpu_memory_total_mb = 54; > > > optional double gpu_usage = 55; > > > optional int32 gpu_temperature = 56; > > > optional int32 gpu_frequency_MHz = 57; > > > optional int32 gpu_power_used_W = 58; > > > > > > For starters I would like to change NvidiaGpuIsolatorProcess at > > isolator.cpp and there get the nvml call for the usage method. As I’m new > > to this I need some guidelines please. > > > > > > My questions: > > > > > > Does the NvidiaGpuIsolatorProcess runs already inside the container or > > just outside in the agent ? (I’m assuming outside) > > > From what I saw on the cpu metrics they are gathered inside the > > container for the gpu we could do it in the NvidiaGpuIsolatorProcess and > > get the metrics via the host. > > > Anything more that I should check ? > > > > > > Thanks a lot > > > > > > Jorge Machado > > > www.jmachado.me > > > > > > > > > > > > > > > > > > > >
