Thanks for the updates Peter and Bernard. 

I have been unable to get gmond 3.4 working under Cygwin, my latest errors are 
parsing gm_protocol_xdr.c. I don't know whether we should follow this up, it 
would be nice to have a Windows gmond, but my only reason for upgrading are the 
GPU metrics.

I take you point about re-using the existing GPU module and gmetric, 
unfortunately I don't have experience with Python. My plan is to write 
something in C to export the nvml metrics, with various output options. We will 
then decide whether to call this new code from existing gmond 3.1 via gmetric, 
new (if we get it working) gmond 3.4, or one of our existing third party tools 
- ITRS Geneous. 

As regards your list of metrics they are pretty definitive, but I will probably 
also export 

*total ecc errors - nvmlDeviceGetTotalEccErrors)
*individual ecc errors - nvmlDeviceGetDetailedEccErrors
*active compute processes - nvmlDeviceGetComputeRunningProcesses

Regards
Nigel  

-----Original Message-----
From: peter.ph...@gmail.com [mailto:peter.ph...@gmail.com] 
Sent: 10 July 2012 20:06
To: Nigel LEACH
Cc: bern...@vanhpc.org; ganglia-general@lists.sourceforge.net
Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin

Nigel,

A simple option would be to use Host sFlow agents to export the core metrics 
from your Windows servers and use gmetric to send add the GPU metrics.

You could combine code from the python GPU module and gmetric implementations 
to produce a self contained script for exporting GPU
metrics:

https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia
https://github.com/ganglia/ganglia_contrib

Longer term, it would make sense to extend Host sFlow to use the C-based NVML 
API to extract and export metrics. This would be straightforward - the Host 
sFlow agent uses native C APIs on the platforms it supports to extract metrics.

What would take some thought is developing standard set of summary metrics to 
characterize GPU performance. Once the set of metrics is agreed on, then adding 
them to the sFlow agent is pretty trivial.

Currently the Ganglia python module exports the following metrics - are they 
the right set? Anything missing? It would be great to get involvement from the 
broader Ganglia community to capture best practice from anyone running large 
GPU clusters, as well as getting input from NVIDIA about the key metrics.

* gpu_num
* gpu_driver
* gpu_type
* gpu_uuid
* gpu_pci_id
* gpu_mem_total
* gpu_graphics_speed
* gpu_sm_speed
* gpu_mem_speed
* gpu_max_graphics_speed
* gpu_max_sm_speed
* gpu_max_mem_speed
* gpu_temp
* gpu_util
* gpu_mem_util
* gpu_mem_used
* gpu_fan
* gpu_power_usage
* gpu_perf_state
* gpu_ecc_mode

As far as scalability is concerned, you should find that moving to sFlow as the 
measurement transport reduces network traffic since all the metrics for a node 
are transported in a single UDP datagram (rather than a datagram per metric 
when using gmond as the agent). The other consideration is that sFlow is 
unicast, so if you are using a multicast Ganglia setup then this involves 
re-structuring your a configuration.

You still need to have at least one gmond instance, but it acts as an sFlow 
aggregator and is mute:
http://blog.sflow.com/2011/07/ganglia-32-released.html

Peter

On Tue, Jul 10, 2012 at 8:36 AM, Nigel LEACH <nigel.le...@uk.bnpparibas.com> 
wrote:
> Hello Bernard, I was coming to that conclusion, I've been trying to 
> compile on various combinations of Cygwin, Windows, Hardware this 
> afternoon, but without success yet. I've still got a few more tests to do 
> though.
>
>
>
> The GPU plugin is my only reason for upgrading from our current 3.1.7, 
> and there is nothing else esoteric we use. We do have Linux Blades, 
> but all of our Tesla's are hosted on Windows.  The entire estate is 
> quite large, so we would need to ensure sFlow scales, no reason to 
> think it won't, but I have little experience with it..
>
>
>
> Regards
>
> Nigel
>
>
>
> From: bern...@vanhpc.org [mailto:bern...@vanhpc.org]
> Sent: 10 July 2012 16:19
> To: Nigel LEACH
> Cc: neil.mckee...@gmail.com; ganglia-general@lists.sourceforge.net
>
>
> Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin
>
>
>
> Hi Nigel:
>
>
>
> Perhaps other developers could chime in but I'm not sure if the latest 
> version could be compiled under Windows, at least I was not aware of 
> any testing done.
>
>
>
> Going forward I would like to encourage users to use hsflowd under Windows.
> I'm talking to the developers to see if we can add support for GPU 
> monitoring.  Do you have any other requirements besides that?
>
>
>
> Thanks,
>
>
>
> Bernard
>
> On Tuesday, July 10, 2012, Nigel LEACH wrote:
>
> Hi Neil, Many thanks for the swift reply.
>
>
>
> I want to take a look at sFlow, but it isn't a prerequisite.
>
>
>
> Anyway, I disabled sFlow, and (separately) included the patch you 
> sent. Both fixes appeared successful. For now I am going with your 
> patch, and sFlow enabled.
>
>
>
> I say "appeared successful", as make was error free, and a gmond.exe 
> was created. However, it doesn't appear to work out of the box. I 
> created a default gmond.conf
>
>
>
> ./gmond --default_config > /usr/local/etc/gmond.conf
>
>
>
> and then simply ran gmond. It started a process, but no port (8649) 
> was created. Running in debug mode I get this
>
>
>
> $ ./gmond -d 10
>
> loaded module: core_metrics
>
> loaded module: cpu_module
>
> loaded module: disk_module
>
> loaded module: load_module
>
> loaded module: mem_module
>
> loaded module: net_module
>
> loaded module: proc_module
>
> loaded module: sys_module
>
>
>
>
>
> and nothing further.
>
>
>
> I have done little investigation yet, so unless there is anything 
> obvious I am missing, I'll continue to troubleshoot.
>
>
>
> Regards
>
> Nigel
>
>
>
>
>
> From: neil.mckee...@gmail.com [mailto:neil.mckee...@gmail.com]
> Sent: 09 July 2012 18:15
> To: Nigel LEACH
> Cc: ganglia-general@lists.sourceforge.net
> Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin
>
>
>
> You could try adding "--disable-sflow" as another configure option.   (Or
> were you planning to use sFlow agents such as hsflowd?).
>
>
>
> Neil
>
>
>
>
>
> On Jul 9, 2012, at 3:50 AM, Nigel LEACH wrote:
>
>
>
> Ganglia 3.4.0
>
> Windows 2008 R2 Enterprise
>
> Cygwin 1.5.25
>
> IBM iDataPlex dx360 with Tesla M2070
>
> Confuse 2.7
>
>
>
> I'm trying to use the Ganglia Python modules to monitor a Windows 
> based GPU cluster, but having problems getting gmond to compile. This 
> 'configure'
> completes successfully
>
>
>
> ./configure --with-libconfuse=/usr/local --without-libpcre 
> --enable-static-build
>
>
>
> but 'make' fails, this is the tail of standard output
>
>
>
> mv -f .deps/g25_config.Tpo .deps/g25_config.Po
>
> gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I.. -DCYGWIN -I/usr/include/apr-1 
> -I/usr/include/ap
>
> r-1    -I../lib -I../include/ -I../libmetrics -D_LARGEFILE64_SOURCE -DSFLOW
> -g -O2 -I/usr/
>
> local/include -fno-strict-aliasing -Wall -MT core_metrics.o -MD -MP 
> -MF .deps/core_metrics
>
> .Tpo -c -o core_metrics.o core_metrics.c
>
> mv -f .deps/core_metrics.Tpo .deps/core_metrics.Po
>
> gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I.. -DCYGWIN -I/usr/include/apr-1 
> -I/usr/include/ap
>
> r-1    -I../lib -I../include/ -I../libmetrics -D_LARGEFILE64_SOURCE -DSFLOW
> -g -O2 -I/usr/
>
> local/include -fno-strict-aliasing -Wall -MT sflow.o -MD -MP -MF 
> .deps/sflow.Tpo -c -o sfl
>
> ow.o sflow.c
>
> sflow.c: In function `process_struct_JVM':
>
> sflow.c:1033: warning: comparison is always true due to limited range 
> of data type
>
>
> ___________________________________________________________
> This e-mail may contain confidential and/or privileged information. If 
> you are not the intended recipient (or have received this e-mail in 
> error) please notify the sender immediately and delete this e-mail. 
> Any unauthorised copying, disclosure or distribution of the material 
> in this e-mail is prohibited.
>
> Please refer to http://www.bnpparibas.co.uk/en/email-disclaimer/ for 
> additional disclosures.
>
>
> ___________________________________________________________
> This e-mail may contain confidential and/or privileged information. If 
> you are not the intended recipient (or have received this e-mail in 
> error) please notify the sender immediately and delete this e-mail. 
> Any unauthorised copying, disclosure or distribution of the material 
> in this e-mail is prohibited.
>
> Please refer to http://www.bnpparibas.co.uk/en/email-disclaimer/ for 
> additional disclosures.
>
> ----------------------------------------------------------------------
> --------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and 
> threat landscape has changed and how IT managers can respond. 
> Discussions will include endpoint security, mobile security and the 
> latest in malware threats. 
> http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Ganglia-general mailing list
> Ganglia-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/ganglia-general
>

___________________________________________________________
This e-mail may contain confidential and/or privileged information. If you are 
not the intended recipient (or have received this e-mail in error) please 
notify the sender immediately and delete this e-mail. Any unauthorised copying, 
disclosure or distribution of the material in this e-mail is prohibited.

Please refer to http://www.bnpparibas.co.uk/en/email-disclaimer/ for additional 
disclosures.


------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Reply via email to