Thanks for the feedback, BenM!

Jorge, could you mind addressing BenM's comment above and put the proposal
to a google doc?

We could discuss this proposal in next Containerization WG meeting on April
4th (please add an agenda and link your proposal):
https://docs.google.com/document/d/1z55a7tLZFoRWVuUxz1FZwgxkHeugtc2nHR89skFXSpU/edit#heading=h.978qjujkxfvu

-Gilbert

On Fri, Mar 22, 2019 at 12:19 PM Benjamin Mahler <benjamin.mah...@gmail.com>
wrote:

> Containers can be assigned multiple GPUs, so I assume you're thinking of
> putting these metrics in a repeated message? (similar to DiskStatistics)
>
> It has seemed to me we should probably make this Nvidia specific (e.g.
> NvidiaGPUStatistics). In the past we thought generalizing this would be
> good, but there's only Nvidia support at the moment and we haven't been
> able to make sure that other GPU libraries provide the same information.
>
> For each metric can you also include the relevant calls from NVML for
> obtaining the information? Can you also highlight what cadvisor provides to
> make sure we don't miss anything? From my read of their code, it seems to
> be a subset of what you listed?
>
> https://github.com/google/cadvisor/blob/e310755a36728b457fcc1de6b54bb4c6cb38f031/accelerators/nvidia.go#L216-L246
>
> On Fri, Mar 22, 2019 at 6:58 AM Jorge Machado <jom...@me.com.invalid>
> wrote:
>
> > another way would be to just use cadvisor
> >
> > > On 22 Mar 2019, at 08:35, Jorge Machado <jom...@me.com.INVALID> wrote:
> > >
> > > Hi Mesos devs,
> > >
> > > In our use case from mesos we need to get gpu resource usage per task
> > and build dashboards on grafana for it.  Getting the metrics to Grafana
> we
> > will send the metrics to prometheus the main problem is how to get the
> > metrics in a reliable way.
> > > I proposing the following:
> > >
> > > Changing the mesos.proto and mesos.proto under v1 and on
> > ResourceStatistics message add:
> > >
> > > //GPU statistics for each container
> > > optional int32 gpu_idx = 50;
> > > optional string gpu_uuid = 51;
> > > optional string device_name = 52;
> > > optional uint64 gpu_memory_used_mb = 53;
> > > optional uint64 gpu_memory_total_mb = 54;
> > > optional double gpu_usage = 55;
> > > optional int32 gpu_temperature = 56;
> > > optional int32 gpu_frequency_MHz = 57;
> > > optional int32 gpu_power_used_W = 58;
> > >
> > > For starters I would like to change NvidiaGpuIsolatorProcess at
> > isolator.cpp and there get the nvml call for the usage method. As I’m new
> > to this I need some guidelines please.
> > >
> > > My questions:
> > >
> > > Does the NvidiaGpuIsolatorProcess runs already inside the container or
> > just outside in the agent ? (I’m assuming outside)
> > > From what I saw on the cpu metrics they are gathered inside the
> > container for the gpu we could do it in the NvidiaGpuIsolatorProcess and
> > get the metrics via the host.
> > > Anything more that I should check ?
> > >
> > > Thanks a lot
> > >
> > > Jorge Machado
> > > www.jmachado.me
> > >
> > >
> > >
> > >
> > >
> >
> >
>

Reply via email to