[Ganglia-developers] hsflowd for Windows + Ganglia webfrontend
Hi all: I just tested hsflowd 1.11 on Windows with the latest code in trunk and run into some issues with the frontend where certain graphs like load_, cpu_ and mem_ reports are not showing up. The reason is probably due to the fact that a recent change in the sflow integration code where unsupported metrics are not submitted. For instance, the load_* metrics are expensive to calculate on Windows and therefore are not supported. I was wondering would it be better to report these metrics with the value of 0 rather than dropping them, as they are core sets of metrics used in the frontend reports -- thoughts? Also, I think Nick was suggesting that perhaps we can try to use the Processor Queue Length on Windows for the load metric. Thanks, Bernard -- The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE: Pinpoint memory and threading errors before they happen. Find and fix more than 250 security defects in the development cycle. Locate bottlenecks in serial and parallel code that limit performance. http://p.sf.net/sfu/intel-dev2devfeb ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] dmax for python and C gmond modules
Hi Brad: On Tue, Feb 1, 2011 at 4:51 PM, Brad Nicholes bnicho...@novell.com wrote: I would probably have to go back and figure out what I was thinking at the time, but I vaguely recall that dmax was hardcoded for all of the standard metric in the 3.0 version of gmond. So at the time I was probably thinking that exposing dmax just wasn't necessary. That was probably a wrong assumption. Thinking back now, it would have made since for dmax to be hardcoded in 3.0 because there was no way to add or remove metrics in that version. But in the 3.1 version where you can do it, dmax shouldn't have been hardcoded and should have been exposed. With gmond 3.0, you should be able to add metrics using gmetric. Anyway, I've just filed a feature request to better keep track of this issue: http://bugzilla.ganglia.info/cgi-bin/bugzilla/show_bug.cgi?id=297 Thanks, Bernard -- The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE: Pinpoint memory and threading errors before they happen. Find and fix more than 250 security defects in the development cycle. Locate bottlenecks in serial and parallel code that limit performance. http://p.sf.net/sfu/intel-dev2devfeb ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] hsflowd for Windows + Ganglia webfrontend
That's odd, the CPU and MEM charts are working OK for me. It's just the load-avg that is missing (which causes the host to be marked down). For troubleshooting, if you intercept the sFlow feed with sflowtool (http://www.inmon.com/technology/sflowTools.php) then you can see what the numbers look like. The graphical freeware tool sFlowTrend is an option too. (http://www.inmon.com/products/sFlowTrend.php). How frequently would we need to poll the Processor Queue Length to get a reasonable load-average estimate? If we could get away with only doing it every few seconds then it might be worthwhile (perhaps a random delay would be appropriate) but I suspect you might have to poll much faster than that to get something worthwhile? Neil On Feb 10, 2011, at 2:50 PM, Bernard Li wrote: Hi all: I just tested hsflowd 1.11 on Windows with the latest code in trunk and run into some issues with the frontend where certain graphs like load_, cpu_ and mem_ reports are not showing up. The reason is probably due to the fact that a recent change in the sflow integration code where unsupported metrics are not submitted. For instance, the load_* metrics are expensive to calculate on Windows and therefore are not supported. I was wondering would it be better to report these metrics with the value of 0 rather than dropping them, as they are core sets of metrics used in the frontend reports -- thoughts? Also, I think Nick was suggesting that perhaps we can try to use the Processor Queue Length on Windows for the load metric. Thanks, Bernard -- The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE: Pinpoint memory and threading errors before they happen. Find and fix more than 250 security defects in the development cycle. Locate bottlenecks in serial and parallel code that limit performance. http://p.sf.net/sfu/intel-dev2devfeb ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] hsflowd for Windows + Ganglia webfrontend
Hi Neil: On Thu, Feb 10, 2011 at 5:05 PM, Neil McKee neil.mc...@inmon.com wrote: That's odd, the CPU and MEM charts are working OK for me. It's just the load-avg that is missing (which causes the host to be marked down). You'll likely need to nuke your old rrd files to see the error. For Windows, mem_shared, mem_buffers and cpu_nice are missing, and they are needed by the respective mem_ and cpu_ reports. I guess we could fix to frontend code to only include them if they exist. Do these metrics exist in Windows-land? How frequently would we need to poll the Processor Queue Length to get a reasonable load-average estimate? If we could get away with only doing it every few seconds then it might be worthwhile (perhaps a random delay would be appropriate) but I suspect you might have to poll much faster than that to get something worthwhile? I will defer to Nick for answering that question :-) Cheers, Bernard -- The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE: Pinpoint memory and threading errors before they happen. Find and fix more than 250 security defects in the development cycle. Locate bottlenecks in serial and parallel code that limit performance. http://p.sf.net/sfu/intel-dev2devfeb ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] hsflowd for Windows + Ganglia webfrontend
P.S. It looks like GMOND_STARTED=0 for the Windows host -- should I file a bug? Thanks, Bernard On Thu, Feb 10, 2011 at 5:21 PM, Bernard Li bern...@vanhpc.org wrote: Hi Neil: On Thu, Feb 10, 2011 at 5:05 PM, Neil McKee neil.mc...@inmon.com wrote: That's odd, the CPU and MEM charts are working OK for me. It's just the load-avg that is missing (which causes the host to be marked down). You'll likely need to nuke your old rrd files to see the error. For Windows, mem_shared, mem_buffers and cpu_nice are missing, and they are needed by the respective mem_ and cpu_ reports. I guess we could fix to frontend code to only include them if they exist. Do these metrics exist in Windows-land? How frequently would we need to poll the Processor Queue Length to get a reasonable load-average estimate? If we could get away with only doing it every few seconds then it might be worthwhile (perhaps a random delay would be appropriate) but I suspect you might have to poll much faster than that to get something worthwhile? I will defer to Nick for answering that question :-) Cheers, Bernard -- The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE: Pinpoint memory and threading errors before they happen. Find and fix more than 250 security defects in the development cycle. Locate bottlenecks in serial and parallel code that limit performance. http://p.sf.net/sfu/intel-dev2devfeb ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] hsflowd for Windows + Ganglia webfrontend
OK, I cleared out everything under /var/lib/ganglia/rrds/. Now I see the problem. In the short term I guess gmond/sflow.c could just submit zeros for the missing metrics instead of leaving them out altogether(?) However in the long term it would seem more elegant to fix this in the place where the RRD is constructed. After all, for most of these metrics posting a value of 0 is just plain wrong. When the value is really dunno or not applicable or undefined then leaving it out seems more correct. Neil On Feb 10, 2011, at 5:21 PM, Bernard Li wrote: Hi Neil: On Thu, Feb 10, 2011 at 5:05 PM, Neil McKee neil.mc...@inmon.com wrote: That's odd, the CPU and MEM charts are working OK for me. It's just the load-avg that is missing (which causes the host to be marked down). You'll likely need to nuke your old rrd files to see the error. For Windows, mem_shared, mem_buffers and cpu_nice are missing, and they are needed by the respective mem_ and cpu_ reports. I guess we could fix to frontend code to only include them if they exist. Do these metrics exist in Windows-land? How frequently would we need to poll the Processor Queue Length to get a reasonable load-average estimate? If we could get away with only doing it every few seconds then it might be worthwhile (perhaps a random delay would be appropriate) but I suspect you might have to poll much faster than that to get something worthwhile? I will defer to Nick for answering that question :-) Cheers, Bernard -- The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE: Pinpoint memory and threading errors before they happen. Find and fix more than 250 security defects in the development cycle. Locate bottlenecks in serial and parallel code that limit performance. http://p.sf.net/sfu/intel-dev2devfeb ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] hsflowd for Windows + Ganglia webfrontend
I didn't realize that gmond_started was meant to go with every heatbeat-message. Below is a patch. Neil [root@ganglia gmond]# svn diff sflow.c Index: sflow.c === --- sflow.c (revision 2471) +++ sflow.c (working copy) @@ -398,7 +398,7 @@ #define SFLOW_OK_COUNTER64(field) (field != (uint64_t)-1) /* always send a heartbeat */ - process_sflow_uint32(hostdata, SFLOW_M_heartbeat, 0); + process_sflow_uint32(hostdata, SFLOW_M_heartbeat, (apr_time_as_msec(now) - x.uptime_mS) / 1000); if(offset_HID) { /* sumbit the system fields that we already extracted above */ On Feb 10, 2011, at 5:38 PM, Bernard Li wrote: P.S. It looks like GMOND_STARTED=0 for the Windows host -- should I file a bug? Thanks, Bernard On Thu, Feb 10, 2011 at 5:21 PM, Bernard Li bern...@vanhpc.org wrote: Hi Neil: On Thu, Feb 10, 2011 at 5:05 PM, Neil McKee neil.mc...@inmon.com wrote: That's odd, the CPU and MEM charts are working OK for me. It's just the load-avg that is missing (which causes the host to be marked down). You'll likely need to nuke your old rrd files to see the error. For Windows, mem_shared, mem_buffers and cpu_nice are missing, and they are needed by the respective mem_ and cpu_ reports. I guess we could fix to frontend code to only include them if they exist. Do these metrics exist in Windows-land? How frequently would we need to poll the Processor Queue Length to get a reasonable load-average estimate? If we could get away with only doing it every few seconds then it might be worthwhile (perhaps a random delay would be appropriate) but I suspect you might have to poll much faster than that to get something worthwhile? I will defer to Nick for answering that question :-) Cheers, Bernard -- The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE: Pinpoint memory and threading errors before they happen. Find and fix more than 250 security defects in the development cycle. Locate bottlenecks in serial and parallel code that limit performance. http://p.sf.net/sfu/intel-dev2devfeb ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers