Hi Mesos devs, 

In our use case from mesos we need to get gpu resource usage per task and build 
dashboards on grafana for it.  Getting the metrics to Grafana we will send the 
metrics to prometheus the main problem is how to get the metrics in a reliable 
way. 
I proposing the following: 

Changing the mesos.proto and mesos.proto under v1 and on ResourceStatistics 
message add: 

//GPU statistics for each container
optional int32 gpu_idx = 50;
optional string gpu_uuid = 51;
optional string device_name = 52;
optional uint64 gpu_memory_used_mb = 53;
optional uint64 gpu_memory_total_mb = 54;
optional double gpu_usage = 55;
optional int32 gpu_temperature = 56;
optional int32 gpu_frequency_MHz = 57;
optional int32 gpu_power_used_W = 58;

For starters I would like to change NvidiaGpuIsolatorProcess at isolator.cpp 
and there get the nvml call for the usage method. As I’m new to this I need 
some guidelines please. 

My questions:  

Does the NvidiaGpuIsolatorProcess runs already inside the container or just 
outside in the agent ? (I’m assuming outside)
From what I saw on the cpu metrics they are gathered inside the container for 
the gpu we could do it in the NvidiaGpuIsolatorProcess and get the metrics via 
the host. 
Anything more that I should check ? 

Thanks a lot

Jorge Machado
www.jmachado.me





Reply via email to