Re: [Ganglia-developers] Ganglia-developers Digest, Vol 4, Issue 2

Chuck Simmons Tue, 05 Sep 2006 10:49:39 -0700

The 21 MB/sec is impressive. Of course, the byte bandwidth isuninteresting; how many I/Os per second are you executing?

Also, we would like the I/O and database subsystem to be capable ofrecording information about disk drives. Also, it would be nice tomonitor each network interface and each disk drive instead of aggregateinformation across all networks and disk drives. The I/O and databasesubsystem of Ganglia needs to be greatly revised not only to be able tokeep up with current demands, but also to allow the software to grow andbe enhanced.

A small part of the current problem stems from Ganglia's cache coherenceimplementation. In order to make sure that the node in the clusterwhich gmetad is reading stats from contains a reasonably current copy ofall metrics for the cluster, the gmond processes throughout the clusterneed to publish data frequently.

It seems to me that instead of having gmond push data frequently to acentral location and have gmetad pull from that central location,gmond's should push deltas to their state all the way out to gmetad, andprovide a mechanism for gmetad to synchronize state with gmond when adelta has been lost in transit.

The second part of the problem, as Richard suggests, is compressing thedata stream from a node to the database. With the cache coherenceproblem fixed, we don't need to periodically transmit unchanging data.The existing mechanisms to only transmit data when it changessignificantly can be used more heavily and enhanced.

The third part of the problem is adjusting the database subsystem. Inorder to provide the current reliability guarantees, we can streamdeltas to disk using large sequential I/Os, while maintaining currentstate in memory that is periodically updated on disk. While this wouldstill leave Richard performing 21 MB/sec or more to disk, the number ofI/O operations would drop dramatically.

We can also attack the problem by looking at the reasons why Richardwants to sample every 10 seconds or why I want to sample data everysecond. Low frequency sampling is useful for understanding averageloads and utilization and verifying that machines and hardware isbasically up and running. High frequency sampling is more useful fordiagnosing system problems, and having automated software detect andrespond to problems quickly.

It may be possible to separate out functionality that requires highfrequency sampling. Consider, for example, a node performing highfrequency sampling and transmitting two distinct data streams togmetad. One data stream contains low frequency signals that describeaverage behavior of a resource or metric. Another data stream containsinfrequent bursts of data when a resource or metric has a particularlyinteresting set of values (e.g. >= 99% utilization). The aggregateamount of data coming out of a node would then be low while stillproviding high resolution where it matters.


Cheers, Chuck



"

my talk at linuxworld went well. i got to spend a few days

hanging out with tobi oetiker (RRDTool), ian berry (cacti),remo rickli (nedi), kees cook (sendpage). tobi and i havebeen talking about ways to help with performance of RRDTooland reduce disk io.




I have been pondering this too. I now monitor 6,000 mostly
windows hosts on a couple of ganglia servers at 10 second intervals.
I have sustained I/O rates to SAN of 21 Megabytes per second.
As you know, RRDTool's I/O behaviour is very simple (open, read,
write, close for every metric/file). But at least that is safe
and relatively stateless. If you don't mind losing a few data points
if RRDTool/gmetad suddenly fails, the one could buffer some number of
data points
for every host/metric and flush as required.

As RRDTool allows multiple data points per call, the same buffer/flush
behaviour could be coded in Ganglia too.

The next idea I have been thinking about is the step size paframeter for
the RRDs.
As you know, every metric that is not a string will have an RRD update
at the poll rate of the cluster. So one gets gazillions of updates of
stuff like processor clock speed, total memory etc, and not everything
needs polling at the same rate as (say) cpu anyway.

It would be better I think to set the step size to be the sample rate
of the metric on the monitored host. There may be a few problems:

1) the metric sample rate is never transmitted to the headnode or
ganglia server.
2) If different RRDs have different step sizes, then the method of
defining
   the RRDs RRAs needs changing.
3) What step size do you give to the cluster or grid RRDs if the hosts
don't
  all sample at the same rate?

My last idea is a bit speculative (read I am not sure), but for a metric
if the TN goes up and the VAL does not change, don't call RRDupdate.
Actually, you don't need the time, but the code would need to be mindful
of
writing a value before RRD decides the metric has gone undefined. If a
changed value
comes along after a period of time with no updating, then 2 values need
to be written -
the last old value at -1 time, and the new metric at the current time.

My last disturbing idea is to only update a metric when its TN is less
than some value,
say 2 x poll rate, or when a timeout is reached. Doing it that way
obviates the need to
store the previous values of all metrics on all hosts.

phew.....
regards,
richard
"

Re: [Ganglia-developers] Ganglia-developers Digest, Vol 4, Issue 2

Reply via email to