Adding Robert Alexander to the list, since he and I worked together on
the NVIDIA plug-in.

Thanks,

Bernard

On Tue, Jul 10, 2012 at 12:06 PM, Peter Phaal <peter.ph...@gmail.com> wrote:
> Nigel,
>
> A simple option would be to use Host sFlow agents to export the core
> metrics from your Windows servers and use gmetric to send add the GPU
> metrics.
>
> You could combine code from the python GPU module and gmetric
> implementations to produce a self contained script for exporting GPU
> metrics:
>
> https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia
> https://github.com/ganglia/ganglia_contrib
>
> Longer term, it would make sense to extend Host sFlow to use the
> C-based NVML API to extract and export metrics. This would be
> straightforward - the Host sFlow agent uses native C APIs on the
> platforms it supports to extract metrics.
>
> What would take some thought is developing standard set of summary
> metrics to characterize GPU performance. Once the set of metrics is
> agreed on, then adding them to the sFlow agent is pretty trivial.
>
> Currently the Ganglia python module exports the following metrics -
> are they the right set? Anything missing? It would be great to get
> involvement from the broader Ganglia community to capture best
> practice from anyone running large GPU clusters, as well as getting
> input from NVIDIA about the key metrics.
>
> * gpu_num
> * gpu_driver
> * gpu_type
> * gpu_uuid
> * gpu_pci_id
> * gpu_mem_total
> * gpu_graphics_speed
> * gpu_sm_speed
> * gpu_mem_speed
> * gpu_max_graphics_speed
> * gpu_max_sm_speed
> * gpu_max_mem_speed
> * gpu_temp
> * gpu_util
> * gpu_mem_util
> * gpu_mem_used
> * gpu_fan
> * gpu_power_usage
> * gpu_perf_state
> * gpu_ecc_mode
>
> As far as scalability is concerned, you should find that moving to
> sFlow as the measurement transport reduces network traffic since all
> the metrics for a node are transported in a single UDP datagram
> (rather than a datagram per metric when using gmond as the agent). The
> other consideration is that sFlow is unicast, so if you are using a
> multicast Ganglia setup then this involves re-structuring your a
> configuration.
>
> You still need to have at least one gmond instance, but it acts as an
> sFlow aggregator and is mute:
> http://blog.sflow.com/2011/07/ganglia-32-released.html
>
> Peter
>
> On Tue, Jul 10, 2012 at 8:36 AM, Nigel LEACH
> <nigel.le...@uk.bnpparibas.com> wrote:
>> Hello Bernard, I was coming to that conclusion, I’ve been trying to compile
>> on various combinations of Cygwin, Windows, Hardware this afternoon, but
>> without success yet. I’ve still got a few more tests to do though.
>>
>>
>>
>> The GPU plugin is my only reason for upgrading from our current 3.1.7, and
>> there is nothing else esoteric we use. We do have Linux Blades, but all of
>> our Tesla’s are hosted on Windows.  The entire estate is quite large, so we
>> would need to ensure sFlow scales, no reason to think it won’t, but I have
>> little experience with it..
>>
>>
>>
>> Regards
>>
>> Nigel
>>
>>
>>
>> From: bern...@vanhpc.org [mailto:bern...@vanhpc.org]
>> Sent: 10 July 2012 16:19
>> To: Nigel LEACH
>> Cc: neil.mckee...@gmail.com; ganglia-general@lists.sourceforge.net
>>
>>
>> Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin
>>
>>
>>
>> Hi Nigel:
>>
>>
>>
>> Perhaps other developers could chime in but I'm not sure if the latest
>> version could be compiled under Windows, at least I was not aware of any
>> testing done.
>>
>>
>>
>> Going forward I would like to encourage users to use hsflowd under Windows.
>> I'm talking to the developers to see if we can add support for GPU
>> monitoring.  Do you have any other requirements besides that?
>>
>>
>>
>> Thanks,
>>
>>
>>
>> Bernard
>>
>> On Tuesday, July 10, 2012, Nigel LEACH wrote:
>>
>> Hi Neil, Many thanks for the swift reply.
>>
>>
>>
>> I want to take a look at sFlow, but it isn’t a prerequisite.
>>
>>
>>
>> Anyway, I disabled sFlow, and (separately) included the patch you sent. Both
>> fixes appeared successful. For now I am going with your patch, and sFlow
>> enabled.
>>
>>
>>
>> I say “appeared successful”, as make was error free, and a gmond.exe was
>> created. However, it doesn’t appear to work out of the box. I created a
>> default gmond.conf
>>
>>
>>
>> ./gmond --default_config > /usr/local/etc/gmond.conf
>>
>>
>>
>> and then simply ran gmond. It started a process, but no port (8649) was
>> created. Running in debug mode I get this
>>
>>
>>
>> $ ./gmond -d 10
>>
>> loaded module: core_metrics
>>
>> loaded module: cpu_module
>>
>> loaded module: disk_module
>>
>> loaded module: load_module
>>
>> loaded module: mem_module
>>
>> loaded module: net_module
>>
>> loaded module: proc_module
>>
>> loaded module: sys_module
>>
>>
>>
>>
>>
>> and nothing further.
>>
>>
>>
>> I have done little investigation yet, so unless there is anything obvious I
>> am missing, I’ll continue to troubleshoot.
>>
>>
>>
>> Regards
>>
>> Nigel
>>
>>
>>
>>
>>
>> From: neil.mckee...@gmail.com [mailto:neil.mckee...@gmail.com]
>> Sent: 09 July 2012 18:15
>> To: Nigel LEACH
>> Cc: ganglia-general@lists.sourceforge.net
>> Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin
>>
>>
>>
>> You could try adding "--disable-sflow" as another configure option.   (Or
>> were you planning to use sFlow agents such as hsflowd?).
>>
>>
>>
>> Neil
>>
>>
>>
>>
>>
>> On Jul 9, 2012, at 3:50 AM, Nigel LEACH wrote:
>>
>>
>>
>> Ganglia 3.4.0
>>
>> Windows 2008 R2 Enterprise
>>
>> Cygwin 1.5.25
>>
>> IBM iDataPlex dx360 with Tesla M2070
>>
>> Confuse 2.7
>>
>>
>>
>> I’m trying to use the Ganglia Python modules to monitor a Windows based GPU
>> cluster, but having problems getting gmond to compile. This ‘configure’
>> completes successfully
>>
>>
>>
>> ./configure --with-libconfuse=/usr/local --without-libpcre
>> --enable-static-build
>>
>>
>>
>> but ‘make’ fails, this is the tail of standard output
>>
>>
>>
>> mv -f .deps/g25_config.Tpo .deps/g25_config.Po
>>
>> gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I.. -DCYGWIN -I/usr/include/apr-1
>> -I/usr/include/ap
>>
>> r-1    -I../lib -I../include/ -I../libmetrics -D_LARGEFILE64_SOURCE -DSFLOW
>> -g -O2 -I/usr/
>>
>> local/include -fno-strict-aliasing -Wall -MT core_metrics.o -MD -MP -MF
>> .deps/core_metrics
>>
>> .Tpo -c -o core_metrics.o core_metrics.c
>>
>> mv -f .deps/core_metrics.Tpo .deps/core_metrics.Po
>>
>> gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I.. -DCYGWIN -I/usr/include/apr-1
>> -I/usr/include/ap
>>
>> r-1    -I../lib -I../include/ -I../libmetrics -D_LARGEFILE64_SOURCE -DSFLOW
>> -g -O2 -I/usr/
>>
>> local/include -fno-strict-aliasing -Wall -MT sflow.o -MD -MP -MF
>> .deps/sflow.Tpo -c -o sfl
>>
>> ow.o sflow.c
>>
>> sflow.c: In function `process_struct_JVM':
>>
>> sflow.c:1033: warning: comparison is always true due to limited range of
>> data type
>>
>>
>> ___________________________________________________________
>> This e-mail may contain confidential and/or privileged information. If you
>> are not the intended recipient (or have received this e-mail in error)
>> please notify the sender immediately and delete this e-mail. Any
>> unauthorised copying, disclosure or distribution of the material in this
>> e-mail is prohibited.
>>
>> Please refer to http://www.bnpparibas.co.uk/en/email-disclaimer/ for
>> additional disclosures.
>>
>>
>> ___________________________________________________________
>> This e-mail may contain confidential and/or privileged information. If you
>> are not the intended recipient (or have received this e-mail in error)
>> please notify the sender immediately and delete this e-mail. Any
>> unauthorised copying, disclosure or distribution of the material in this
>> e-mail is prohibited.
>>
>> Please refer to http://www.bnpparibas.co.uk/en/email-disclaimer/ for
>> additional disclosures.
>>
>> ------------------------------------------------------------------------------
>> Live Security Virtual Conference
>> Exclusive live event will cover all the ways today's security and
>> threat landscape has changed and how IT managers can respond. Discussions
>> will include endpoint security, mobile security and the latest in malware
>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>> _______________________________________________
>> Ganglia-general mailing list
>> Ganglia-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/ganglia-general
>>

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Reply via email to