Re: [Ganglia-general] Gmond Compilation on Cygwin
Not sure if I fully answered this one. For testing of the Python module pyNVML (http://packages.python.org/nvidia-ml-py/) is tested on Linux and native Windows (not Cygwin). We do this with each release of the package. For testing of the Ganglia metrics module, we don't really test these in house. Our QA isn't really set up to run gmond/gmeta/web. I ran some local testing of these myself on Linux, but not Windows. -Robert From: Bernard Li [mailto:bern...@vanhpc.org] Sent: Thursday, July 12, 2012 1:59 PM To: Robert Alexander Cc: Nigel LEACH; lozgachev.i...@gmail.com; ganglia-general@lists.sourceforge.net; Peter Phaal Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin Hi Robert: When you said you tested the Python metric modules, did you just test the Python scripts under Windows or did you somehow got gmond compiled under Windows natively with Python support? Thanks, Bernard On Thursday, July 12, 2012, Robert Alexander wrote: Hey, A meeting may be a good idea. My schedule is mostly open next week. When are others free? I will brush up on sflow by then. NVML and the Python metric module are tested at NVIDIA on Windows and Linux, but not within Cygwin. The process will be easier/faster on the NVML side if we keep Cygwin out of the loop. -Robert -Original Message- From: Bernard Li [mailto:bern...@vanhpc.org] Sent: Thursday, July 12, 2012 10:49 AM To: Nigel LEACH Cc: lozgachev.i...@gmail.com; ganglia-general@lists.sourceforge.net; Peter Phaal; Robert Alexander Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin Hi Nigel: Technically you only need 3.1 gmond to have support for the Python metric module. But I'm not sure whether we have ever tested this under Windows. Peter and Robert: How quickly can we get hsflowd to support GPU metrics collection internally? Should we setup a meeting to discuss this? Thanks, Bernard On Thu, Jul 12, 2012 at 4:05 AM, Nigel LEACH > wrote: > Thanks Ivan, but we have 3.0 and 3.1 gmond running under Cygwin (and using > APR), the problem is with the 3.4 spin. > > -Original Message- > From: lozgachev.i...@gmail.com > [mailto:lozgachev.i...@gmail.com] > Sent: 12 July 2012 11:54 > To: Nigel LEACH > Cc: peter.ph...@gmail.com; > ganglia-general@lists.sourceforge.net > Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin > > Hi all, > > Maybe it will be interesting. Some time ago I successfully compiled gmond > 3.0.7 and 3.1.2 under Cygwin. If you need it I can upload somewhere gmond and > 3rd party sources + compilation script. > Also, I have gmetad 3.0.7 compiled for Windows. In additional, I developed > (just for fun) my implementation of gmetad 3.1.2 using .NET and C#. > > P. S. I do not know whether it is possible to use these gmong versions to > collect statistic from GPU. > > -- > Best regards, > Ivan. > > 2012/7/12 Nigel LEACH >: >> Thanks for the updates Peter and Bernard. >> >> I have been unable to get gmond 3.4 working under Cygwin, my latest errors >> are parsing gm_protocol_xdr.c. I don't know whether we should follow this >> up, it would be nice to have a Windows gmond, but my only reason for >> upgrading are the GPU metrics. >> >> I take you point about re-using the existing GPU module and gmetric, >> unfortunately I don't have experience with Python. My plan is to write >> something in C to export the nvml metrics, with various output options. We >> will then decide whether to call this new code from existing gmond 3.1 via >> gmetric, new (if we get it working) gmond 3.4, or one of our existing third >> party tools - ITRS Geneous. >> >> As regards your list of metrics they are pretty definitive, but I >> will probably also export >> >> *total ecc errors - nvmlDeviceGetTotalEccErrors) *individual ecc >> errors - nvmlDeviceGetDetailedEccErrors *active compute processes - >> nvmlDeviceGetComputeRunningProcesses >> >> Regards >> Nigel >> >> -Original Message- >> From: peter.ph...@gmail.com >> [mailto:peter.ph...@gmail.com] >> Sent: 10 July 2012 20:06 >> To: Nigel LEACH >> Cc: bern...@vanhpc.org; >> ganglia-general@lists.sourceforge.net >> Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin >> >> Nigel, >> >> A simple option would be to use Host sFlow agents to export the core metrics >> from your Windows servers and use gmetric to send add the GPU metrics. >> >> You could combine code from the python GPU module and gmetric >> implementations to produce a self contained script for exporting GPU >> metrics: >> >> https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidi >> a htt
Re: [Ganglia-general] Gmond Compilation on Cygwin
Hi Robert, sFlow is a very simple protocol - an sFlow agent periodically sends XDR encoded structures over UDP. Each structure has a tag and a length, making the protocol extensible. In the short term, it would make sense is to define an sFlow structure to carry the current NVML metrics and tag it using NVIDIA's IANA assigned vendor number (5703). Something along the lines: /* NVML statistics */ /* opaque = counter_data; enterprise = 5703, format=1 */ struct nvml_gpu_counters { unsigned int device_count; unsigned int mem_total; unsigned int mem_util; ... } Additional examples are in the sFlow Host Structures specification (http://www.sflow.org/sflow_host.txt), these are the structures currently being exported by the Host sFlow agent. Extending the Windows Host sFlow agent to export these metrics would involve adding a routine to populate and serialize this structure - pretty straightforward - if you look at the Host sFlow agent source code you will see examples of how the existing structures are handled. For Ganglia to support the new counters, we would need to add a decoder to gmond for the new structure - also straightforward. Are per device metrics important, or can we roll up the metrics across all the GPUs on a server? With sFlow we generally roll up metrics for each node where possible - the goal is to provide enough detail so that the operations team can tell whether a node is healthy or not, but not so much as to overwhelm the monitoring system and limit scaleability. Once a problem is detected, detailed metrics for troubleshooting and diagnostics can be performed using point tools on the host. The metrics currently exposed by NVML API could be improved - everything appears to be a 1 second gauge. A more robust model for metrics is to maintain monotonic counters so that they can be polled at different frequencies and still produce meaningful results. Counters are also more robust when sending metrics over an unreliable transport like UDP. The receiver calculates the delta's and can easily compensate for lost packets. Longer term it would be useful to have a discussion to see what metrics best characterize operational performance and are feasible to implement. Counters such as number of threads started, number of busy ticks, number of idle ticks etc. are the type of measurement you want to calculate utilizations. Some kind of load average based on the thread run queue would also be interesting. My calendar is pretty open next week - I am based in San Francisco, so 8am-5pm PST works best. Peter On Thu, Jul 12, 2012 at 11:58 AM, Robert Alexander wrote: > Hey, > > A meeting may be a good idea. My schedule is mostly open next week. When > are others free? I will brush up on sflow by then. > > NVML and the Python metric module are tested at NVIDIA on Windows and Linux, > but not within Cygwin. The process will be easier/faster on the NVML side if > we keep Cygwin out of the loop. > > -Robert > > -Original Message- > From: Bernard Li [mailto:bern...@vanhpc.org] > Sent: Thursday, July 12, 2012 10:49 AM > To: Nigel LEACH > Cc: lozgachev.i...@gmail.com; ganglia-general@lists.sourceforge.net; Peter > Phaal; Robert Alexander > Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin > > Hi Nigel: > > Technically you only need 3.1 gmond to have support for the Python metric > module. But I'm not sure whether we have ever tested this under Windows. > > Peter and Robert: How quickly can we get hsflowd to support GPU metrics > collection internally? Should we setup a meeting to discuss this? > > Thanks, > > Bernard > > On Thu, Jul 12, 2012 at 4:05 AM, Nigel LEACH > wrote: >> Thanks Ivan, but we have 3.0 and 3.1 gmond running under Cygwin (and using >> APR), the problem is with the 3.4 spin. >> >> -Original Message- >> From: lozgachev.i...@gmail.com [mailto:lozgachev.i...@gmail.com] >> Sent: 12 July 2012 11:54 >> To: Nigel LEACH >> Cc: peter.ph...@gmail.com; ganglia-general@lists.sourceforge.net >> Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin >> >> Hi all, >> >> Maybe it will be interesting. Some time ago I successfully compiled gmond >> 3.0.7 and 3.1.2 under Cygwin. If you need it I can upload somewhere gmond >> and 3rd party sources + compilation script. >> Also, I have gmetad 3.0.7 compiled for Windows. In additional, I developed >> (just for fun) my implementation of gmetad 3.1.2 using .NET and C#. >> >> P. S. I do not know whether it is possible to use these gmong versions to >> collect statistic from GPU. >> >> -- >> Best regards, >> Ivan. >> >> 2012/7/12 Nigel LEACH : >>> Thanks for the updates Peter and Bernard. >>> >>> I have been unable to g
Re: [Ganglia-general] Gmond Compilation on Cygwin
Hi Robert: When you said you tested the Python metric modules, did you just test the Python scripts under Windows or did you somehow got gmond compiled under Windows natively with Python support? Thanks, Bernard On Thursday, July 12, 2012, Robert Alexander wrote: > Hey, > > A meeting may be a good idea. My schedule is mostly open next week. When > are others free? I will brush up on sflow by then. > > NVML and the Python metric module are tested at NVIDIA on Windows and > Linux, but not within Cygwin. The process will be easier/faster on the > NVML side if we keep Cygwin out of the loop. > > -Robert > > -Original Message- > From: Bernard Li [mailto:bern...@vanhpc.org ] > Sent: Thursday, July 12, 2012 10:49 AM > To: Nigel LEACH > Cc: lozgachev.i...@gmail.com ; > ganglia-general@lists.sourceforge.net ; Peter Phaal; Robert > Alexander > Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin > > Hi Nigel: > > Technically you only need 3.1 gmond to have support for the Python metric > module. But I'm not sure whether we have ever tested this under Windows. > > Peter and Robert: How quickly can we get hsflowd to support GPU metrics > collection internally? Should we setup a meeting to discuss this? > > Thanks, > > Bernard > > On Thu, Jul 12, 2012 at 4:05 AM, Nigel LEACH < > nigel.le...@uk.bnpparibas.com > wrote: > > Thanks Ivan, but we have 3.0 and 3.1 gmond running under Cygwin (and > using APR), the problem is with the 3.4 spin. > > > > -Original Message- > > From: lozgachev.i...@gmail.com [mailto: > lozgachev.i...@gmail.com ] > > Sent: 12 July 2012 11:54 > > To: Nigel LEACH > > Cc: peter.ph...@gmail.com ; > ganglia-general@lists.sourceforge.net > > Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin > > > > Hi all, > > > > Maybe it will be interesting. Some time ago I successfully compiled > gmond 3.0.7 and 3.1.2 under Cygwin. If you need it I can upload somewhere > gmond and 3rd party sources + compilation script. > > Also, I have gmetad 3.0.7 compiled for Windows. In additional, I > developed (just for fun) my implementation of gmetad 3.1.2 using .NET and > C#. > > > > P. S. I do not know whether it is possible to use these gmong versions > to collect statistic from GPU. > > > > -- > > Best regards, > > Ivan. > > > > 2012/7/12 Nigel LEACH >: > >> Thanks for the updates Peter and Bernard. > >> > >> I have been unable to get gmond 3.4 working under Cygwin, my latest > errors are parsing gm_protocol_xdr.c. I don't know whether we should follow > this up, it would be nice to have a Windows gmond, but my only reason for > upgrading are the GPU metrics. > >> > >> I take you point about re-using the existing GPU module and gmetric, > unfortunately I don't have experience with Python. My plan is to write > something in C to export the nvml metrics, with various output options. We > will then decide whether to call this new code from existing gmond 3.1 via > gmetric, new (if we get it working) gmond 3.4, or one of our existing third > party tools - ITRS Geneous. > >> > >> As regards your list of metrics they are pretty definitive, but I > >> will probably also export > >> > >> *total ecc errors - nvmlDeviceGetTotalEccErrors) *individual ecc > >> errors - nvmlDeviceGetDetailedEccErrors *active compute processes - > >> nvmlDeviceGetComputeRunningProcesses > >> > >> Regards > >> Nigel > >> > >> -Original Message- > >> From: peter.ph...@gmail.com [mailto: > peter.ph...@gmail.com ] > >> Sent: 10 July 2012 20:06 > >> To: Nigel LEACH > >> Cc: bern...@vanhpc.org ; > ganglia-general@lists.sourceforge.net > >> Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin > >> > >> Nigel, > >> > >> A simple option would be to use Host sFlow agents to export the core > metrics from your Windows servers and use gmetric to send add the GPU > metrics. > >> > >> You could combine code from the python GPU module and gmetric > >> implementations to produce a self contained script for exporting GPU > >> metrics: > >> > >> https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidi > >> a https://github.com/ganglia/ganglia_contrib > >> > >> Longer term, it would make sense to extend Host sFlow to use the > C-based NVML API to extract and export metrics. This would be > straightforward - the Host sFlow agent uses native C APIs on the platforms > it su
Re: [Ganglia-general] Gmond Compilation on Cygwin
Hey, A meeting may be a good idea. My schedule is mostly open next week. When are others free? I will brush up on sflow by then. NVML and the Python metric module are tested at NVIDIA on Windows and Linux, but not within Cygwin. The process will be easier/faster on the NVML side if we keep Cygwin out of the loop. -Robert -Original Message- From: Bernard Li [mailto:bern...@vanhpc.org] Sent: Thursday, July 12, 2012 10:49 AM To: Nigel LEACH Cc: lozgachev.i...@gmail.com; ganglia-general@lists.sourceforge.net; Peter Phaal; Robert Alexander Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin Hi Nigel: Technically you only need 3.1 gmond to have support for the Python metric module. But I'm not sure whether we have ever tested this under Windows. Peter and Robert: How quickly can we get hsflowd to support GPU metrics collection internally? Should we setup a meeting to discuss this? Thanks, Bernard On Thu, Jul 12, 2012 at 4:05 AM, Nigel LEACH wrote: > Thanks Ivan, but we have 3.0 and 3.1 gmond running under Cygwin (and using > APR), the problem is with the 3.4 spin. > > -Original Message- > From: lozgachev.i...@gmail.com [mailto:lozgachev.i...@gmail.com] > Sent: 12 July 2012 11:54 > To: Nigel LEACH > Cc: peter.ph...@gmail.com; ganglia-general@lists.sourceforge.net > Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin > > Hi all, > > Maybe it will be interesting. Some time ago I successfully compiled gmond > 3.0.7 and 3.1.2 under Cygwin. If you need it I can upload somewhere gmond and > 3rd party sources + compilation script. > Also, I have gmetad 3.0.7 compiled for Windows. In additional, I developed > (just for fun) my implementation of gmetad 3.1.2 using .NET and C#. > > P. S. I do not know whether it is possible to use these gmong versions to > collect statistic from GPU. > > -- > Best regards, > Ivan. > > 2012/7/12 Nigel LEACH : >> Thanks for the updates Peter and Bernard. >> >> I have been unable to get gmond 3.4 working under Cygwin, my latest errors >> are parsing gm_protocol_xdr.c. I don't know whether we should follow this >> up, it would be nice to have a Windows gmond, but my only reason for >> upgrading are the GPU metrics. >> >> I take you point about re-using the existing GPU module and gmetric, >> unfortunately I don't have experience with Python. My plan is to write >> something in C to export the nvml metrics, with various output options. We >> will then decide whether to call this new code from existing gmond 3.1 via >> gmetric, new (if we get it working) gmond 3.4, or one of our existing third >> party tools - ITRS Geneous. >> >> As regards your list of metrics they are pretty definitive, but I >> will probably also export >> >> *total ecc errors - nvmlDeviceGetTotalEccErrors) *individual ecc >> errors - nvmlDeviceGetDetailedEccErrors *active compute processes - >> nvmlDeviceGetComputeRunningProcesses >> >> Regards >> Nigel >> >> -Original Message- >> From: peter.ph...@gmail.com [mailto:peter.ph...@gmail.com] >> Sent: 10 July 2012 20:06 >> To: Nigel LEACH >> Cc: bern...@vanhpc.org; ganglia-general@lists.sourceforge.net >> Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin >> >> Nigel, >> >> A simple option would be to use Host sFlow agents to export the core metrics >> from your Windows servers and use gmetric to send add the GPU metrics. >> >> You could combine code from the python GPU module and gmetric >> implementations to produce a self contained script for exporting GPU >> metrics: >> >> https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidi >> a https://github.com/ganglia/ganglia_contrib >> >> Longer term, it would make sense to extend Host sFlow to use the C-based >> NVML API to extract and export metrics. This would be straightforward - the >> Host sFlow agent uses native C APIs on the platforms it supports to extract >> metrics. >> >> What would take some thought is developing standard set of summary metrics >> to characterize GPU performance. Once the set of metrics is agreed on, then >> adding them to the sFlow agent is pretty trivial. >> >> Currently the Ganglia python module exports the following metrics - are they >> the right set? Anything missing? It would be great to get involvement from >> the broader Ganglia community to capture best practice from anyone running >> large GPU clusters, as well as getting input from NVIDIA about the key >> metrics. >> >> * gpu_num >> * gpu_driver >> * gpu_type >> * gpu_uuid >>
Re: [Ganglia-general] Gmond Compilation on Cygwin
Hi Nigel: Technically you only need 3.1 gmond to have support for the Python metric module. But I'm not sure whether we have ever tested this under Windows. Peter and Robert: How quickly can we get hsflowd to support GPU metrics collection internally? Should we setup a meeting to discuss this? Thanks, Bernard On Thu, Jul 12, 2012 at 4:05 AM, Nigel LEACH wrote: > Thanks Ivan, but we have 3.0 and 3.1 gmond running under Cygwin (and using > APR), the problem is with the 3.4 spin. > > -Original Message- > From: lozgachev.i...@gmail.com [mailto:lozgachev.i...@gmail.com] > Sent: 12 July 2012 11:54 > To: Nigel LEACH > Cc: peter.ph...@gmail.com; ganglia-general@lists.sourceforge.net > Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin > > Hi all, > > Maybe it will be interesting. Some time ago I successfully compiled gmond > 3.0.7 and 3.1.2 under Cygwin. If you need it I can upload somewhere gmond and > 3rd party sources + compilation script. > Also, I have gmetad 3.0.7 compiled for Windows. In additional, I developed > (just for fun) my implementation of gmetad 3.1.2 using .NET and C#. > > P. S. I do not know whether it is possible to use these gmong versions to > collect statistic from GPU. > > -- > Best regards, > Ivan. > > 2012/7/12 Nigel LEACH : >> Thanks for the updates Peter and Bernard. >> >> I have been unable to get gmond 3.4 working under Cygwin, my latest errors >> are parsing gm_protocol_xdr.c. I don't know whether we should follow this >> up, it would be nice to have a Windows gmond, but my only reason for >> upgrading are the GPU metrics. >> >> I take you point about re-using the existing GPU module and gmetric, >> unfortunately I don't have experience with Python. My plan is to write >> something in C to export the nvml metrics, with various output options. We >> will then decide whether to call this new code from existing gmond 3.1 via >> gmetric, new (if we get it working) gmond 3.4, or one of our existing third >> party tools - ITRS Geneous. >> >> As regards your list of metrics they are pretty definitive, but I will >> probably also export >> >> *total ecc errors - nvmlDeviceGetTotalEccErrors) *individual ecc >> errors - nvmlDeviceGetDetailedEccErrors *active compute processes - >> nvmlDeviceGetComputeRunningProcesses >> >> Regards >> Nigel >> >> -Original Message- >> From: peter.ph...@gmail.com [mailto:peter.ph...@gmail.com] >> Sent: 10 July 2012 20:06 >> To: Nigel LEACH >> Cc: bern...@vanhpc.org; ganglia-general@lists.sourceforge.net >> Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin >> >> Nigel, >> >> A simple option would be to use Host sFlow agents to export the core metrics >> from your Windows servers and use gmetric to send add the GPU metrics. >> >> You could combine code from the python GPU module and gmetric >> implementations to produce a self contained script for exporting GPU >> metrics: >> >> https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia >> https://github.com/ganglia/ganglia_contrib >> >> Longer term, it would make sense to extend Host sFlow to use the C-based >> NVML API to extract and export metrics. This would be straightforward - the >> Host sFlow agent uses native C APIs on the platforms it supports to extract >> metrics. >> >> What would take some thought is developing standard set of summary metrics >> to characterize GPU performance. Once the set of metrics is agreed on, then >> adding them to the sFlow agent is pretty trivial. >> >> Currently the Ganglia python module exports the following metrics - are they >> the right set? Anything missing? It would be great to get involvement from >> the broader Ganglia community to capture best practice from anyone running >> large GPU clusters, as well as getting input from NVIDIA about the key >> metrics. >> >> * gpu_num >> * gpu_driver >> * gpu_type >> * gpu_uuid >> * gpu_pci_id >> * gpu_mem_total >> * gpu_graphics_speed >> * gpu_sm_speed >> * gpu_mem_speed >> * gpu_max_graphics_speed >> * gpu_max_sm_speed >> * gpu_max_mem_speed >> * gpu_temp >> * gpu_util >> * gpu_mem_util >> * gpu_mem_used >> * gpu_fan >> * gpu_power_usage >> * gpu_perf_state >> * gpu_ecc_mode >> >> As far as scalability is concerned, you should find that moving to sFlow as >> the measurement transport reduces network traffic since all the metrics for >> a node are transported in a sin
Re: [Ganglia-general] Gmond Compilation on Cygwin
Thanks Ivan, but we have 3.0 and 3.1 gmond running under Cygwin (and using APR), the problem is with the 3.4 spin. -Original Message- From: lozgachev.i...@gmail.com [mailto:lozgachev.i...@gmail.com] Sent: 12 July 2012 11:54 To: Nigel LEACH Cc: peter.ph...@gmail.com; ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin Hi all, Maybe it will be interesting. Some time ago I successfully compiled gmond 3.0.7 and 3.1.2 under Cygwin. If you need it I can upload somewhere gmond and 3rd party sources + compilation script. Also, I have gmetad 3.0.7 compiled for Windows. In additional, I developed (just for fun) my implementation of gmetad 3.1.2 using .NET and C#. P. S. I do not know whether it is possible to use these gmong versions to collect statistic from GPU. -- Best regards, Ivan. 2012/7/12 Nigel LEACH : > Thanks for the updates Peter and Bernard. > > I have been unable to get gmond 3.4 working under Cygwin, my latest errors > are parsing gm_protocol_xdr.c. I don't know whether we should follow this up, > it would be nice to have a Windows gmond, but my only reason for upgrading > are the GPU metrics. > > I take you point about re-using the existing GPU module and gmetric, > unfortunately I don't have experience with Python. My plan is to write > something in C to export the nvml metrics, with various output options. We > will then decide whether to call this new code from existing gmond 3.1 via > gmetric, new (if we get it working) gmond 3.4, or one of our existing third > party tools - ITRS Geneous. > > As regards your list of metrics they are pretty definitive, but I will > probably also export > > *total ecc errors - nvmlDeviceGetTotalEccErrors) *individual ecc > errors - nvmlDeviceGetDetailedEccErrors *active compute processes - > nvmlDeviceGetComputeRunningProcesses > > Regards > Nigel > > -Original Message- > From: peter.ph...@gmail.com [mailto:peter.ph...@gmail.com] > Sent: 10 July 2012 20:06 > To: Nigel LEACH > Cc: bern...@vanhpc.org; ganglia-general@lists.sourceforge.net > Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin > > Nigel, > > A simple option would be to use Host sFlow agents to export the core metrics > from your Windows servers and use gmetric to send add the GPU metrics. > > You could combine code from the python GPU module and gmetric > implementations to produce a self contained script for exporting GPU > metrics: > > https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia > https://github.com/ganglia/ganglia_contrib > > Longer term, it would make sense to extend Host sFlow to use the C-based NVML > API to extract and export metrics. This would be straightforward - the Host > sFlow agent uses native C APIs on the platforms it supports to extract > metrics. > > What would take some thought is developing standard set of summary metrics to > characterize GPU performance. Once the set of metrics is agreed on, then > adding them to the sFlow agent is pretty trivial. > > Currently the Ganglia python module exports the following metrics - are they > the right set? Anything missing? It would be great to get involvement from > the broader Ganglia community to capture best practice from anyone running > large GPU clusters, as well as getting input from NVIDIA about the key > metrics. > > * gpu_num > * gpu_driver > * gpu_type > * gpu_uuid > * gpu_pci_id > * gpu_mem_total > * gpu_graphics_speed > * gpu_sm_speed > * gpu_mem_speed > * gpu_max_graphics_speed > * gpu_max_sm_speed > * gpu_max_mem_speed > * gpu_temp > * gpu_util > * gpu_mem_util > * gpu_mem_used > * gpu_fan > * gpu_power_usage > * gpu_perf_state > * gpu_ecc_mode > > As far as scalability is concerned, you should find that moving to sFlow as > the measurement transport reduces network traffic since all the metrics for a > node are transported in a single UDP datagram (rather than a datagram per > metric when using gmond as the agent). The other consideration is that sFlow > is unicast, so if you are using a multicast Ganglia setup then this involves > re-structuring your a configuration. > > You still need to have at least one gmond instance, but it acts as an sFlow > aggregator and is mute: > http://blog.sflow.com/2011/07/ganglia-32-released.html > > Peter > > On Tue, Jul 10, 2012 at 8:36 AM, Nigel LEACH > wrote: >> Hello Bernard, I was coming to that conclusion, I've been trying to >> compile on various combinations of Cygwin, Windows, Hardware this >> afternoon, but without success yet. I've still got a few more tests to do >> though. >> >> >> >> The GPU plugin
Re: [Ganglia-general] Gmond Compilation on Cygwin
Hi all, Maybe it will be interesting. Some time ago I successfully compiled gmond 3.0.7 and 3.1.2 under Cygwin. If you need it I can upload somewhere gmond and 3rd party sources + compilation script. Also, I have gmetad 3.0.7 compiled for Windows. In additional, I developed (just for fun) my implementation of gmetad 3.1.2 using .NET and C#. P. S. I do not know whether it is possible to use these gmong versions to collect statistic from GPU. -- Best regards, Ivan. 2012/7/12 Nigel LEACH : > Thanks for the updates Peter and Bernard. > > I have been unable to get gmond 3.4 working under Cygwin, my latest errors > are parsing gm_protocol_xdr.c. I don't know whether we should follow this up, > it would be nice to have a Windows gmond, but my only reason for upgrading > are the GPU metrics. > > I take you point about re-using the existing GPU module and gmetric, > unfortunately I don't have experience with Python. My plan is to write > something in C to export the nvml metrics, with various output options. We > will then decide whether to call this new code from existing gmond 3.1 via > gmetric, new (if we get it working) gmond 3.4, or one of our existing third > party tools - ITRS Geneous. > > As regards your list of metrics they are pretty definitive, but I will > probably also export > > *total ecc errors - nvmlDeviceGetTotalEccErrors) > *individual ecc errors - nvmlDeviceGetDetailedEccErrors > *active compute processes - nvmlDeviceGetComputeRunningProcesses > > Regards > Nigel > > -Original Message- > From: peter.ph...@gmail.com [mailto:peter.ph...@gmail.com] > Sent: 10 July 2012 20:06 > To: Nigel LEACH > Cc: bern...@vanhpc.org; ganglia-general@lists.sourceforge.net > Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin > > Nigel, > > A simple option would be to use Host sFlow agents to export the core metrics > from your Windows servers and use gmetric to send add the GPU metrics. > > You could combine code from the python GPU module and gmetric implementations > to produce a self contained script for exporting GPU > metrics: > > https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia > https://github.com/ganglia/ganglia_contrib > > Longer term, it would make sense to extend Host sFlow to use the C-based NVML > API to extract and export metrics. This would be straightforward - the Host > sFlow agent uses native C APIs on the platforms it supports to extract > metrics. > > What would take some thought is developing standard set of summary metrics to > characterize GPU performance. Once the set of metrics is agreed on, then > adding them to the sFlow agent is pretty trivial. > > Currently the Ganglia python module exports the following metrics - are they > the right set? Anything missing? It would be great to get involvement from > the broader Ganglia community to capture best practice from anyone running > large GPU clusters, as well as getting input from NVIDIA about the key > metrics. > > * gpu_num > * gpu_driver > * gpu_type > * gpu_uuid > * gpu_pci_id > * gpu_mem_total > * gpu_graphics_speed > * gpu_sm_speed > * gpu_mem_speed > * gpu_max_graphics_speed > * gpu_max_sm_speed > * gpu_max_mem_speed > * gpu_temp > * gpu_util > * gpu_mem_util > * gpu_mem_used > * gpu_fan > * gpu_power_usage > * gpu_perf_state > * gpu_ecc_mode > > As far as scalability is concerned, you should find that moving to sFlow as > the measurement transport reduces network traffic since all the metrics for a > node are transported in a single UDP datagram (rather than a datagram per > metric when using gmond as the agent). The other consideration is that sFlow > is unicast, so if you are using a multicast Ganglia setup then this involves > re-structuring your a configuration. > > You still need to have at least one gmond instance, but it acts as an sFlow > aggregator and is mute: > http://blog.sflow.com/2011/07/ganglia-32-released.html > > Peter > > On Tue, Jul 10, 2012 at 8:36 AM, Nigel LEACH > wrote: >> Hello Bernard, I was coming to that conclusion, I've been trying to >> compile on various combinations of Cygwin, Windows, Hardware this >> afternoon, but without success yet. I've still got a few more tests to do >> though. >> >> >> >> The GPU plugin is my only reason for upgrading from our current 3.1.7, >> and there is nothing else esoteric we use. We do have Linux Blades, >> but all of our Tesla's are hosted on Windows. The entire estate is >> quite large, so we would need to ensure sFlow scales, no reason to >> think it won't, but I have little experience with it.. >> >> >
Re: [Ganglia-general] Gmond Compilation on Cygwin
Thanks for the updates Peter and Bernard. I have been unable to get gmond 3.4 working under Cygwin, my latest errors are parsing gm_protocol_xdr.c. I don't know whether we should follow this up, it would be nice to have a Windows gmond, but my only reason for upgrading are the GPU metrics. I take you point about re-using the existing GPU module and gmetric, unfortunately I don't have experience with Python. My plan is to write something in C to export the nvml metrics, with various output options. We will then decide whether to call this new code from existing gmond 3.1 via gmetric, new (if we get it working) gmond 3.4, or one of our existing third party tools - ITRS Geneous. As regards your list of metrics they are pretty definitive, but I will probably also export *total ecc errors - nvmlDeviceGetTotalEccErrors) *individual ecc errors - nvmlDeviceGetDetailedEccErrors *active compute processes - nvmlDeviceGetComputeRunningProcesses Regards Nigel -Original Message- From: peter.ph...@gmail.com [mailto:peter.ph...@gmail.com] Sent: 10 July 2012 20:06 To: Nigel LEACH Cc: bern...@vanhpc.org; ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin Nigel, A simple option would be to use Host sFlow agents to export the core metrics from your Windows servers and use gmetric to send add the GPU metrics. You could combine code from the python GPU module and gmetric implementations to produce a self contained script for exporting GPU metrics: https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia https://github.com/ganglia/ganglia_contrib Longer term, it would make sense to extend Host sFlow to use the C-based NVML API to extract and export metrics. This would be straightforward - the Host sFlow agent uses native C APIs on the platforms it supports to extract metrics. What would take some thought is developing standard set of summary metrics to characterize GPU performance. Once the set of metrics is agreed on, then adding them to the sFlow agent is pretty trivial. Currently the Ganglia python module exports the following metrics - are they the right set? Anything missing? It would be great to get involvement from the broader Ganglia community to capture best practice from anyone running large GPU clusters, as well as getting input from NVIDIA about the key metrics. * gpu_num * gpu_driver * gpu_type * gpu_uuid * gpu_pci_id * gpu_mem_total * gpu_graphics_speed * gpu_sm_speed * gpu_mem_speed * gpu_max_graphics_speed * gpu_max_sm_speed * gpu_max_mem_speed * gpu_temp * gpu_util * gpu_mem_util * gpu_mem_used * gpu_fan * gpu_power_usage * gpu_perf_state * gpu_ecc_mode As far as scalability is concerned, you should find that moving to sFlow as the measurement transport reduces network traffic since all the metrics for a node are transported in a single UDP datagram (rather than a datagram per metric when using gmond as the agent). The other consideration is that sFlow is unicast, so if you are using a multicast Ganglia setup then this involves re-structuring your a configuration. You still need to have at least one gmond instance, but it acts as an sFlow aggregator and is mute: http://blog.sflow.com/2011/07/ganglia-32-released.html Peter On Tue, Jul 10, 2012 at 8:36 AM, Nigel LEACH wrote: > Hello Bernard, I was coming to that conclusion, I've been trying to > compile on various combinations of Cygwin, Windows, Hardware this > afternoon, but without success yet. I've still got a few more tests to do > though. > > > > The GPU plugin is my only reason for upgrading from our current 3.1.7, > and there is nothing else esoteric we use. We do have Linux Blades, > but all of our Tesla's are hosted on Windows. The entire estate is > quite large, so we would need to ensure sFlow scales, no reason to > think it won't, but I have little experience with it.. > > > > Regards > > Nigel > > > > From: bern...@vanhpc.org [mailto:bern...@vanhpc.org] > Sent: 10 July 2012 16:19 > To: Nigel LEACH > Cc: neil.mckee...@gmail.com; ganglia-general@lists.sourceforge.net > > > Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin > > > > Hi Nigel: > > > > Perhaps other developers could chime in but I'm not sure if the latest > version could be compiled under Windows, at least I was not aware of > any testing done. > > > > Going forward I would like to encourage users to use hsflowd under Windows. > I'm talking to the developers to see if we can add support for GPU > monitoring. Do you have any other requirements besides that? > > > > Thanks, > > > > Bernard > > On Tuesday, July 10, 2012, Nigel LEACH wrote: > > Hi Neil, Many thanks for the swift reply. > > > > I want to take a look at sFlow, but it isn'
Re: [Ganglia-general] Gmond Compilation on Cygwin
Hey Nigel, I would be happy to help where I can. I think Peter's approach is a good start. We are updating the Ganglia plug-in with a few more metrics. My dev branch on github has some updates not yet in the trunk. https://github.com/ralexander/gmond_python_modules/tree/master/gpu/nvidia In terms of metrics, I can help explain what each means. I expect the usefulness of each to vary based on installation, so hopefully others can contribute their thoughts. * gpu_num - Useful indirectly. * gpu_driver - Useful when different machines may have different installed driver versions. * gpu_type - Marketing name of the GPU. * gpu_uuid - Globally unique immutable ID for the GPU chip. This is the NVIDIA preferred identifier when SW interfaces with a GPU. On a multi GPU board, each GPU has a unique UUID. * gpu_pci_id - What the GPU looks like on the PCI bus ID. + gpu_serial - For Tesla GPUs there is a serial number printed on the board. Note, that when there are multiple GPU chips on a single board, they share a common board serial number. When a human needs to grab a particular board, this number works well. * gpu_mem_total * gpu_mem_used Useful for high level application profiling. * gpu_graphics_speed + gpu_max_graphics_speed * gpu_sm_speed + gpu_max_sm_speed * gpu_mem_speed + gpu_max_mem_speed These are various clock speeds. Faster clocks -> higher performance. * gpu_perf_state Similar to CPU pstates. P0 is the fastest performance. When pstate is >P0 clock speeds and PCIe bandwidth can be reduced. * gpu_util * gpu_mem_util % of time when the GPU SM or GPU memory was busy over the last second This is a very coarse grain way to monitor GPU usage. I.E. If only one SM is busy, but it is busy for the entire second then gpu_util = 100 * gpu_fan * gpu_temp Some GPUs support these. Useful to see how well the GPU is cooled. * gpu_power_usage + gpu_power_man_mode + gpu_power_man_limit GPU power draw. Some GPUs support configurable power limits via power management mode. * gpu_ecc_mode Useful to ensure all GPUs are configured the same. Describes if GPU memory error checking and correction is on or off. If you are only concerned about coarse grained GPU performance, then GPU performance state, utilization and %memory used may work well. Bernard, thanks for the heads up. Hope that helps, Robert Alexander NVIDIA CUDA Tools Software Engineer -Original Message- From: Bernard Li [mailto:bern...@vanhpc.org] Sent: Tuesday, July 10, 2012 12:32 PM To: Peter Phaal Cc: Nigel LEACH; ganglia-general@lists.sourceforge.net; Robert Alexander Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin Adding Robert Alexander to the list, since he and I worked together on the NVIDIA plug-in. Thanks, Bernard On Tue, Jul 10, 2012 at 12:06 PM, Peter Phaal wrote: > Nigel, > > A simple option would be to use Host sFlow agents to export the core > metrics from your Windows servers and use gmetric to send add the GPU > metrics. > > You could combine code from the python GPU module and gmetric > implementations to produce a self contained script for exporting GPU > metrics: > > https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia > https://github.com/ganglia/ganglia_contrib > > Longer term, it would make sense to extend Host sFlow to use the > C-based NVML API to extract and export metrics. This would be > straightforward - the Host sFlow agent uses native C APIs on the > platforms it supports to extract metrics. > > What would take some thought is developing standard set of summary > metrics to characterize GPU performance. Once the set of metrics is > agreed on, then adding them to the sFlow agent is pretty trivial. > > Currently the Ganglia python module exports the following metrics - > are they the right set? Anything missing? It would be great to get > involvement from the broader Ganglia community to capture best > practice from anyone running large GPU clusters, as well as getting > input from NVIDIA about the key metrics. > > * gpu_num > * gpu_driver > * gpu_type > * gpu_uuid > * gpu_pci_id > * gpu_mem_total > * gpu_graphics_speed > * gpu_sm_speed > * gpu_mem_speed > * gpu_max_graphics_speed > * gpu_max_sm_speed > * gpu_max_mem_speed > * gpu_temp > * gpu_util > * gpu_mem_util > * gpu_mem_used > * gpu_fan > * gpu_power_usage > * gpu_perf_state > * gpu_ecc_mode > > As far as scalability is concerned, you should find that moving to > sFlow as the measurement transport reduces network traffic since all > the metrics for a node are transported in a single UDP datagram > (rather than a datagram per metric when using gmond as the agent). The > other consideration is that sFlow is unicast, so if you are usi
Re: [Ganglia-general] Gmond Compilation on Cygwin
Adding Robert Alexander to the list, since he and I worked together on the NVIDIA plug-in. Thanks, Bernard On Tue, Jul 10, 2012 at 12:06 PM, Peter Phaal wrote: > Nigel, > > A simple option would be to use Host sFlow agents to export the core > metrics from your Windows servers and use gmetric to send add the GPU > metrics. > > You could combine code from the python GPU module and gmetric > implementations to produce a self contained script for exporting GPU > metrics: > > https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia > https://github.com/ganglia/ganglia_contrib > > Longer term, it would make sense to extend Host sFlow to use the > C-based NVML API to extract and export metrics. This would be > straightforward - the Host sFlow agent uses native C APIs on the > platforms it supports to extract metrics. > > What would take some thought is developing standard set of summary > metrics to characterize GPU performance. Once the set of metrics is > agreed on, then adding them to the sFlow agent is pretty trivial. > > Currently the Ganglia python module exports the following metrics - > are they the right set? Anything missing? It would be great to get > involvement from the broader Ganglia community to capture best > practice from anyone running large GPU clusters, as well as getting > input from NVIDIA about the key metrics. > > * gpu_num > * gpu_driver > * gpu_type > * gpu_uuid > * gpu_pci_id > * gpu_mem_total > * gpu_graphics_speed > * gpu_sm_speed > * gpu_mem_speed > * gpu_max_graphics_speed > * gpu_max_sm_speed > * gpu_max_mem_speed > * gpu_temp > * gpu_util > * gpu_mem_util > * gpu_mem_used > * gpu_fan > * gpu_power_usage > * gpu_perf_state > * gpu_ecc_mode > > As far as scalability is concerned, you should find that moving to > sFlow as the measurement transport reduces network traffic since all > the metrics for a node are transported in a single UDP datagram > (rather than a datagram per metric when using gmond as the agent). The > other consideration is that sFlow is unicast, so if you are using a > multicast Ganglia setup then this involves re-structuring your a > configuration. > > You still need to have at least one gmond instance, but it acts as an > sFlow aggregator and is mute: > http://blog.sflow.com/2011/07/ganglia-32-released.html > > Peter > > On Tue, Jul 10, 2012 at 8:36 AM, Nigel LEACH > wrote: >> Hello Bernard, I was coming to that conclusion, I’ve been trying to compile >> on various combinations of Cygwin, Windows, Hardware this afternoon, but >> without success yet. I’ve still got a few more tests to do though. >> >> >> >> The GPU plugin is my only reason for upgrading from our current 3.1.7, and >> there is nothing else esoteric we use. We do have Linux Blades, but all of >> our Tesla’s are hosted on Windows. The entire estate is quite large, so we >> would need to ensure sFlow scales, no reason to think it won’t, but I have >> little experience with it.. >> >> >> >> Regards >> >> Nigel >> >> >> >> From: bern...@vanhpc.org [mailto:bern...@vanhpc.org] >> Sent: 10 July 2012 16:19 >> To: Nigel LEACH >> Cc: neil.mckee...@gmail.com; ganglia-general@lists.sourceforge.net >> >> >> Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin >> >> >> >> Hi Nigel: >> >> >> >> Perhaps other developers could chime in but I'm not sure if the latest >> version could be compiled under Windows, at least I was not aware of any >> testing done. >> >> >> >> Going forward I would like to encourage users to use hsflowd under Windows. >> I'm talking to the developers to see if we can add support for GPU >> monitoring. Do you have any other requirements besides that? >> >> >> >> Thanks, >> >> >> >> Bernard >> >> On Tuesday, July 10, 2012, Nigel LEACH wrote: >> >> Hi Neil, Many thanks for the swift reply. >> >> >> >> I want to take a look at sFlow, but it isn’t a prerequisite. >> >> >> >> Anyway, I disabled sFlow, and (separately) included the patch you sent. Both >> fixes appeared successful. For now I am going with your patch, and sFlow >> enabled. >> >> >> >> I say “appeared successful”, as make was error free, and a gmond.exe was >> created. However, it doesn’t appear to work out of the box. I created a >> default gmond.conf >> >> >> >> ./gmond --default_config > /usr/local/etc/gmond.conf >> >> >> >
Re: [Ganglia-general] Gmond Compilation on Cygwin
Nigel, A simple option would be to use Host sFlow agents to export the core metrics from your Windows servers and use gmetric to send add the GPU metrics. You could combine code from the python GPU module and gmetric implementations to produce a self contained script for exporting GPU metrics: https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia https://github.com/ganglia/ganglia_contrib Longer term, it would make sense to extend Host sFlow to use the C-based NVML API to extract and export metrics. This would be straightforward - the Host sFlow agent uses native C APIs on the platforms it supports to extract metrics. What would take some thought is developing standard set of summary metrics to characterize GPU performance. Once the set of metrics is agreed on, then adding them to the sFlow agent is pretty trivial. Currently the Ganglia python module exports the following metrics - are they the right set? Anything missing? It would be great to get involvement from the broader Ganglia community to capture best practice from anyone running large GPU clusters, as well as getting input from NVIDIA about the key metrics. * gpu_num * gpu_driver * gpu_type * gpu_uuid * gpu_pci_id * gpu_mem_total * gpu_graphics_speed * gpu_sm_speed * gpu_mem_speed * gpu_max_graphics_speed * gpu_max_sm_speed * gpu_max_mem_speed * gpu_temp * gpu_util * gpu_mem_util * gpu_mem_used * gpu_fan * gpu_power_usage * gpu_perf_state * gpu_ecc_mode As far as scalability is concerned, you should find that moving to sFlow as the measurement transport reduces network traffic since all the metrics for a node are transported in a single UDP datagram (rather than a datagram per metric when using gmond as the agent). The other consideration is that sFlow is unicast, so if you are using a multicast Ganglia setup then this involves re-structuring your a configuration. You still need to have at least one gmond instance, but it acts as an sFlow aggregator and is mute: http://blog.sflow.com/2011/07/ganglia-32-released.html Peter On Tue, Jul 10, 2012 at 8:36 AM, Nigel LEACH wrote: > Hello Bernard, I was coming to that conclusion, I’ve been trying to compile > on various combinations of Cygwin, Windows, Hardware this afternoon, but > without success yet. I’ve still got a few more tests to do though. > > > > The GPU plugin is my only reason for upgrading from our current 3.1.7, and > there is nothing else esoteric we use. We do have Linux Blades, but all of > our Tesla’s are hosted on Windows. The entire estate is quite large, so we > would need to ensure sFlow scales, no reason to think it won’t, but I have > little experience with it.. > > > > Regards > > Nigel > > > > From: bern...@vanhpc.org [mailto:bern...@vanhpc.org] > Sent: 10 July 2012 16:19 > To: Nigel LEACH > Cc: neil.mckee...@gmail.com; ganglia-general@lists.sourceforge.net > > > Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin > > > > Hi Nigel: > > > > Perhaps other developers could chime in but I'm not sure if the latest > version could be compiled under Windows, at least I was not aware of any > testing done. > > > > Going forward I would like to encourage users to use hsflowd under Windows. > I'm talking to the developers to see if we can add support for GPU > monitoring. Do you have any other requirements besides that? > > > > Thanks, > > > > Bernard > > On Tuesday, July 10, 2012, Nigel LEACH wrote: > > Hi Neil, Many thanks for the swift reply. > > > > I want to take a look at sFlow, but it isn’t a prerequisite. > > > > Anyway, I disabled sFlow, and (separately) included the patch you sent. Both > fixes appeared successful. For now I am going with your patch, and sFlow > enabled. > > > > I say “appeared successful”, as make was error free, and a gmond.exe was > created. However, it doesn’t appear to work out of the box. I created a > default gmond.conf > > > > ./gmond --default_config > /usr/local/etc/gmond.conf > > > > and then simply ran gmond. It started a process, but no port (8649) was > created. Running in debug mode I get this > > > > $ ./gmond -d 10 > > loaded module: core_metrics > > loaded module: cpu_module > > loaded module: disk_module > > loaded module: load_module > > loaded module: mem_module > > loaded module: net_module > > loaded module: proc_module > > loaded module: sys_module > > > > > > and nothing further. > > > > I have done little investigation yet, so unless there is anything obvious I > am missing, I’ll continue to troubleshoot. > > > > Regards > > Nigel > > > > > > From: neil.mckee...@gmail.com [mailto:neil.mckee...@gmail.com] > Sent: 09 July
Re: [Ganglia-general] Gmond Compilation on Cygwin
Hello Bernard, I was coming to that conclusion, I've been trying to compile on various combinations of Cygwin, Windows, Hardware this afternoon, but without success yet. I've still got a few more tests to do though. The GPU plugin is my only reason for upgrading from our current 3.1.7, and there is nothing else esoteric we use. We do have Linux Blades, but all of our Tesla's are hosted on Windows. The entire estate is quite large, so we would need to ensure sFlow scales, no reason to think it won't, but I have little experience with it.. Regards Nigel From: bern...@vanhpc.org [mailto:bern...@vanhpc.org] Sent: 10 July 2012 16:19 To: Nigel LEACH Cc: neil.mckee...@gmail.com; ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin Hi Nigel: Perhaps other developers could chime in but I'm not sure if the latest version could be compiled under Windows, at least I was not aware of any testing done. Going forward I would like to encourage users to use hsflowd under Windows. I'm talking to the developers to see if we can add support for GPU monitoring. Do you have any other requirements besides that? Thanks, Bernard On Tuesday, July 10, 2012, Nigel LEACH wrote: Hi Neil, Many thanks for the swift reply. I want to take a look at sFlow, but it isn't a prerequisite. Anyway, I disabled sFlow, and (separately) included the patch you sent. Both fixes appeared successful. For now I am going with your patch, and sFlow enabled. I say "appeared successful", as make was error free, and a gmond.exe was created. However, it doesn't appear to work out of the box. I created a default gmond.conf ./gmond --default_config > /usr/local/etc/gmond.conf and then simply ran gmond. It started a process, but no port (8649) was created. Running in debug mode I get this $ ./gmond -d 10 loaded module: core_metrics loaded module: cpu_module loaded module: disk_module loaded module: load_module loaded module: mem_module loaded module: net_module loaded module: proc_module loaded module: sys_module and nothing further. I have done little investigation yet, so unless there is anything obvious I am missing, I'll continue to troubleshoot. Regards Nigel From: neil.mckee...@gmail.com [mailto:neil.mckee...@gmail.com] Sent: 09 July 2012 18:15 To: Nigel LEACH Cc: ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin You could try adding "--disable-sflow" as another configure option. (Or were you planning to use sFlow agents such as hsflowd?). Neil On Jul 9, 2012, at 3:50 AM, Nigel LEACH wrote: Ganglia 3.4.0 Windows 2008 R2 Enterprise Cygwin 1.5.25 IBM iDataPlex dx360 with Tesla M2070 Confuse 2.7 I'm trying to use the Ganglia Python modules to monitor a Windows based GPU cluster, but having problems getting gmond to compile. This 'configure' completes successfully ./configure --with-libconfuse=/usr/local --without-libpcre --enable-static-build but 'make' fails, this is the tail of standard output mv -f .deps/g25_config.Tpo .deps/g25_config.Po gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I.. -DCYGWIN -I/usr/include/apr-1 -I/usr/include/ap r-1-I../lib -I../include/ -I../libmetrics -D_LARGEFILE64_SOURCE -DSFLOW -g -O2 -I/usr/ local/include -fno-strict-aliasing -Wall -MT core_metrics.o -MD -MP -MF .deps/core_metrics .Tpo -c -o core_metrics.o core_metrics.c mv -f .deps/core_metrics.Tpo .deps/core_metrics.Po gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I.. -DCYGWIN -I/usr/include/apr-1 -I/usr/include/ap r-1-I../lib -I../include/ -I../libmetrics -D_LARGEFILE64_SOURCE -DSFLOW -g -O2 -I/usr/ local/include -fno-strict-aliasing -Wall -MT sflow.o -MD -MP -MF .deps/sflow.Tpo -c -o sfl ow.o sflow.c sflow.c: In function `process_struct_JVM': sflow.c:1033: warning: comparison is always true due to limited range of data type ___ This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and delete this e-mail. Any unauthorised copying, disclosure or distribution of the material in this e-mail is prohibited. Please refer to http://www.bnpparibas.co.uk/en/email-disclaimer/ for additional disclosures. ___ This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and delete this e-mail. Any unauthorised copying, disclosure or distribution of the material in this e-mail is prohibited. Please refer to http://www.bnpparibas.co.uk/en/email-disclaimer/ for additional disclosures. --
Re: [Ganglia-general] Gmond Compilation on Cygwin
Hi Nigel: Perhaps other developers could chime in but I'm not sure if the latest version could be compiled under Windows, at least I was not aware of any testing done. Going forward I would like to encourage users to use hsflowd under Windows. I'm talking to the developers to see if we can add support for GPU monitoring. Do you have any other requirements besides that? Thanks, Bernard On Tuesday, July 10, 2012, Nigel LEACH wrote: > Hi Neil, Many thanks for the swift reply. > > ** ** > > I want to take a look at sFlow, but it isn’t a prerequisite. > > ** ** > > Anyway, I disabled sFlow, and (separately) included the patch you sent. > Both fixes appeared successful. For now I am going with your patch, and > sFlow enabled. > > ** ** > > I say “appeared successful”, as make was error free, and a gmond.exe was > created. However, it doesn’t appear to work out of the box. I created a > default gmond.conf > > ** ** > > ./gmond --default_config > /usr/local/etc/gmond.conf > > ** ** > > and then simply ran gmond. It started a process, but no port (8649) was > created. Running in debug mode I get this > > ** ** > > $ ./gmond -d 10 > > loaded module: core_metrics > > loaded module: cpu_module > > loaded module: disk_module > > loaded module: load_module > > loaded module: mem_module > > loaded module: net_module > > loaded module: proc_module > > loaded module: sys_module > > ** ** > > ** ** > > and nothing further. > > ** ** > > I have done little investigation yet, so unless there is anything obvious > I am missing, I’ll continue to troubleshoot. > > ** ** > > Regards > > Nigel > > ** ** > > ** ** > > *From:* neil.mckee...@gmail.com 'neil.mckee...@gmail.com');> > [mailto:neil.mckee...@gmail.com 'neil.mckee...@gmail.com');>] > > *Sent:* 09 July 2012 18:15 > *To:* Nigel LEACH > *Cc:* ganglia-general@lists.sourceforge.net 'ganglia-general@lists.sourceforge.net');> > *Subject:* Re: [Ganglia-general] Gmond Compilation on Cygwin > > ** ** > > You could try adding "--disable-sflow" as another configure option. (Or > were you planning to use sFlow agents such as hsflowd?). > > ** ** > > Neil > > ** ** > > ** ** > > On Jul 9, 2012, at 3:50 AM, Nigel LEACH wrote: > > > > > > Ganglia 3.4.0 > > Windows 2008 R2 Enterprise > > Cygwin 1.5.25 > > IBM iDataPlex dx360 with Tesla M2070 > > Confuse 2.7 > > > > I’m trying to use the Ganglia Python modules to monitor a Windows based > GPU cluster, but having problems getting gmond to compile. This ‘configure’ > completes successfully > > > > ./configure --with-libconfuse=/usr/local --without-libpcre > --enable-static-build > > > > but ‘make’ fails, this is the tail of standard output > > > > mv -f .deps/g25_config.Tpo .deps/g25_config.Po > > gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I.. -DCYGWIN -I/usr/include/apr-1 > -I/usr/include/ap > > r-1-I../lib -I../include/ -I../libmetrics -D_LARGEFILE64_SOURCE > -DSFLOW -g -O2 -I/usr/ > > local/include -fno-strict-aliasing -Wall -MT core_metrics.o -MD -MP -MF > .deps/core_metrics > > .Tpo -c -o core_metrics.o core_metrics.c > > mv -f .deps/core_metrics.Tpo .deps/core_metrics.Po > > gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I.. -DCYGWIN -I/usr/include/apr-1 > -I/usr/include/ap > > r-1-I../lib -I../include/ -I../libmetrics -D_LARGEFILE64_SOURCE > -DSFLOW -g -O2 -I/usr/ > > local/include -fno-strict-aliasing -Wall -MT sflow.o -MD -MP -MF > .deps/sflow.Tpo -c -o sfl > > ow.o sflow.c > > sflow.c: In function `process_struct_JVM': > > sflow.c:1033: warning: comparison is always true due to limited range of > data type > > > ___ > This e-mail may contain confidential and/or privileged information. If you > are not the intended recipient (or have received this e-mail in error) > please notify the sender immediately and delete this e-mail. Any > unauthorised copying, disclosure or distribution of the material in this > e-mail is prohibited. > > Please refer to http://www.bnpparibas.co.uk/en/email-disclaimer/ for > additional disclosures. > -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Gmond Compilation on Cygwin
Hi Neil, Many thanks for the swift reply. I want to take a look at sFlow, but it isn't a prerequisite. Anyway, I disabled sFlow, and (separately) included the patch you sent. Both fixes appeared successful. For now I am going with your patch, and sFlow enabled. I say "appeared successful", as make was error free, and a gmond.exe was created. However, it doesn't appear to work out of the box. I created a default gmond.conf ./gmond --default_config > /usr/local/etc/gmond.conf and then simply ran gmond. It started a process, but no port (8649) was created. Running in debug mode I get this $ ./gmond -d 10 loaded module: core_metrics loaded module: cpu_module loaded module: disk_module loaded module: load_module loaded module: mem_module loaded module: net_module loaded module: proc_module loaded module: sys_module and nothing further. I have done little investigation yet, so unless there is anything obvious I am missing, I'll continue to troubleshoot. Regards Nigel From: neil.mckee...@gmail.com [mailto:neil.mckee...@gmail.com] Sent: 09 July 2012 18:15 To: Nigel LEACH Cc: ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] Gmond Compilation on Cygwin You could try adding "--disable-sflow" as another configure option. (Or were you planning to use sFlow agents such as hsflowd?). Neil On Jul 9, 2012, at 3:50 AM, Nigel LEACH wrote: Ganglia 3.4.0 Windows 2008 R2 Enterprise Cygwin 1.5.25 IBM iDataPlex dx360 with Tesla M2070 Confuse 2.7 I'm trying to use the Ganglia Python modules to monitor a Windows based GPU cluster, but having problems getting gmond to compile. This 'configure' completes successfully ./configure --with-libconfuse=/usr/local --without-libpcre --enable-static-build but 'make' fails, this is the tail of standard output mv -f .deps/g25_config.Tpo .deps/g25_config.Po gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I.. -DCYGWIN -I/usr/include/apr-1 -I/usr/include/ap r-1-I../lib -I../include/ -I../libmetrics -D_LARGEFILE64_SOURCE -DSFLOW -g -O2 -I/usr/ local/include -fno-strict-aliasing -Wall -MT core_metrics.o -MD -MP -MF .deps/core_metrics .Tpo -c -o core_metrics.o core_metrics.c mv -f .deps/core_metrics.Tpo .deps/core_metrics.Po gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I.. -DCYGWIN -I/usr/include/apr-1 -I/usr/include/ap r-1-I../lib -I../include/ -I../libmetrics -D_LARGEFILE64_SOURCE -DSFLOW -g -O2 -I/usr/ local/include -fno-strict-aliasing -Wall -MT sflow.o -MD -MP -MF .deps/sflow.Tpo -c -o sfl ow.o sflow.c sflow.c: In function `process_struct_JVM': sflow.c:1033: warning: comparison is always true due to limited range of data type sflow.c:1034: warning: comparison is always true due to limited range of data type sflow.c:1035: warning: comparison is always true due to limited range of data type sflow.c:1036: warning: comparison is always true due to limited range of data type sflow.c:1037: warning: comparison is always true due to limited range of data type sflow.c:1038: warning: comparison is always true due to limited range of data type sflow.c:1039: warning: comparison is always true due to limited range of data type sflow.c: In function `processCounterSample': sflow.c:1169: warning: unsigned int format, uint32_t arg (arg 4) sflow.c:1169: warning: unsigned int format, uint32_t arg (arg 4) sflow.c: In function `process_sflow_datagram': sflow.c:1348: error: `AF_INET6' undeclared (first use in this function) sflow.c:1348: error: (Each undeclared identifier is reported only once sflow.c:1348: error: for each function it appears in.) make[3]: *** [sflow.o] Error 1 make[3]: Leaving directory `/var/tmp/ganglia-3.4.0/gmond' make[2]: *** [all-recursive] Error 1 make[2]: Leaving directory `/var/tmp/ganglia-3.4.0/gmond' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/var/tmp/ganglia-3.4.0' make: *** [all] Error 2 Has anyone come across this before ? Many Thanks Nigel ___ This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and delete this e-mail. Any unauthorised copying, disclosure or distribution of the material in this e-mail is prohibited. Please refer to http://www.bnpparibas.co.uk/en/email-disclaimer/ for additional disclosures. -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net<
Re: [Ganglia-general] Gmond Compilation on Cygwin
You could try adding "--disable-sflow" as another configure option. (Or were you planning to use sFlow agents such as hsflowd?). Neil On Jul 9, 2012, at 3:50 AM, Nigel LEACH wrote: > Ganglia 3.4.0 > Windows 2008 R2 Enterprise > Cygwin 1.5.25 > IBM iDataPlex dx360 with Tesla M2070 > Confuse 2.7 > > I’m trying to use the Ganglia Python modules to monitor a Windows based GPU > cluster, but having problems getting gmond to compile. This ‘configure’ > completes successfully > > ./configure --with-libconfuse=/usr/local --without-libpcre > --enable-static-build > > but ‘make’ fails, this is the tail of standard output > > mv -f .deps/g25_config.Tpo .deps/g25_config.Po > gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I.. -DCYGWIN -I/usr/include/apr-1 > -I/usr/include/ap > r-1-I../lib -I../include/ -I../libmetrics -D_LARGEFILE64_SOURCE -DSFLOW > -g -O2 -I/usr/ > local/include -fno-strict-aliasing -Wall -MT core_metrics.o -MD -MP -MF > .deps/core_metrics > .Tpo -c -o core_metrics.o core_metrics.c > mv -f .deps/core_metrics.Tpo .deps/core_metrics.Po > gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I.. -DCYGWIN -I/usr/include/apr-1 > -I/usr/include/ap > r-1-I../lib -I../include/ -I../libmetrics -D_LARGEFILE64_SOURCE -DSFLOW > -g -O2 -I/usr/ > local/include -fno-strict-aliasing -Wall -MT sflow.o -MD -MP -MF > .deps/sflow.Tpo -c -o sfl > ow.o sflow.c > sflow.c: In function `process_struct_JVM': > sflow.c:1033: warning: comparison is always true due to limited range of data > type > sflow.c:1034: warning: comparison is always true due to limited range of data > type > sflow.c:1035: warning: comparison is always true due to limited range of data > type > sflow.c:1036: warning: comparison is always true due to limited range of data > type > sflow.c:1037: warning: comparison is always true due to limited range of data > type > sflow.c:1038: warning: comparison is always true due to limited range of data > type > sflow.c:1039: warning: comparison is always true due to limited range of data > type > sflow.c: In function `processCounterSample': > sflow.c:1169: warning: unsigned int format, uint32_t arg (arg 4) > sflow.c:1169: warning: unsigned int format, uint32_t arg (arg 4) > sflow.c: In function `process_sflow_datagram': > sflow.c:1348: error: `AF_INET6' undeclared (first use in this function) > sflow.c:1348: error: (Each undeclared identifier is reported only once > sflow.c:1348: error: for each function it appears in.) > make[3]: *** [sflow.o] Error 1 > make[3]: Leaving directory `/var/tmp/ganglia-3.4.0/gmond' > make[2]: *** [all-recursive] Error 1 > make[2]: Leaving directory `/var/tmp/ganglia-3.4.0/gmond' > make[1]: *** [all-recursive] Error 1 > make[1]: Leaving directory `/var/tmp/ganglia-3.4.0' > make: *** [all] Error 2 > > Has anyone come across this before ? > > Many Thanks > Nigel > > > ___ > This e-mail may contain confidential and/or privileged information. If you > are not the intended recipient (or have received this e-mail in error) please > notify the sender immediately and delete this e-mail. Any unauthorised > copying, disclosure or distribution of the material in this e-mail is > prohibited. > > Please refer to http://www.bnpparibas.co.uk/en/email-disclaimer/ for > additional disclosures. > -- > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. > http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/___ > Ganglia-general mailing list > Ganglia-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general