The 21 MB/sec is impressive. Of course, the byte bandwidth is uninteresting; how many I/Os per second are you executing?

Also, we would like the I/O and database subsystem to be capable of recording information about disk drives. Also, it would be nice to monitor each network interface and each disk drive instead of aggregate information across all networks and disk drives. The I/O and database subsystem of Ganglia needs to be greatly revised not only to be able to keep up with current demands, but also to allow the software to grow and be enhanced.

A small part of the current problem stems from Ganglia's cache coherence implementation. In order to make sure that the node in the cluster which gmetad is reading stats from contains a reasonably current copy of all metrics for the cluster, the gmond processes throughout the cluster need to publish data frequently.

It seems to me that instead of having gmond push data frequently to a central location and have gmetad pull from that central location, gmond's should push deltas to their state all the way out to gmetad, and provide a mechanism for gmetad to synchronize state with gmond when a delta has been lost in transit.

The second part of the problem, as Richard suggests, is compressing the data stream from a node to the database. With the cache coherence problem fixed, we don't need to periodically transmit unchanging data. The existing mechanisms to only transmit data when it changes significantly can be used more heavily and enhanced.

The third part of the problem is adjusting the database subsystem. In order to provide the current reliability guarantees, we can stream deltas to disk using large sequential I/Os, while maintaining current state in memory that is periodically updated on disk. While this would still leave Richard performing 21 MB/sec or more to disk, the number of I/O operations would drop dramatically.

We can also attack the problem by looking at the reasons why Richard wants to sample every 10 seconds or why I want to sample data every second. Low frequency sampling is useful for understanding average loads and utilization and verifying that machines and hardware is basically up and running. High frequency sampling is more useful for diagnosing system problems, and having automated software detect and respond to problems quickly.

It may be possible to separate out functionality that requires high frequency sampling. Consider, for example, a node performing high frequency sampling and transmitting two distinct data streams to gmetad. One data stream contains low frequency signals that describe average behavior of a resource or metric. Another data stream contains infrequent bursts of data when a resource or metric has a particularly interesting set of values (e.g. >= 99% utilization). The aggregate amount of data coming out of a node would then be low while still providing high resolution where it matters.

Cheers, Chuck



"

my talk at linuxworld went well. i got to spend a few days
hanging out with tobi oetiker (RRDTool), ian berry (cacti), remo rickli (nedi), kees cook (sendpage). tobi and i have been talking about ways to help with performance of RRDTool and reduce disk io.



I have been pondering this too. I now monitor 6,000 mostly
windows hosts on a couple of ganglia servers at 10 second intervals.
I have sustained I/O rates to SAN of 21 Megabytes per second.
As you know, RRDTool's I/O behaviour is very simple (open, read,
write, close for every metric/file). But at least that is safe
and relatively stateless. If you don't mind losing a few data points
if RRDTool/gmetad suddenly fails, the one could buffer some number of
data points
for every host/metric and flush as required.

As RRDTool allows multiple data points per call, the same buffer/flush
behaviour could be coded in Ganglia too.

The next idea I have been thinking about is the step size paframeter for
the RRDs.
As you know, every metric that is not a string will have an RRD update
at the poll rate of the cluster. So one gets gazillions of updates of
stuff like processor clock speed, total memory etc, and not everything
needs polling at the same rate as (say) cpu anyway.

It would be better I think to set the step size to be the sample rate
of the metric on the monitored host. There may be a few problems:

1) the metric sample rate is never transmitted to the headnode or
ganglia server.
2) If different RRDs have different step sizes, then the method of
defining
   the RRDs RRAs needs changing.
3) What step size do you give to the cluster or grid RRDs if the hosts
don't
  all sample at the same rate?

My last idea is a bit speculative (read I am not sure), but for a metric
if the TN goes up and the VAL does not change, don't call RRDupdate.
Actually, you don't need the time, but the code would need to be mindful
of
writing a value before RRD decides the metric has gone undefined. If a
changed value
comes along after a period of time with no updating, then 2 values need
to be written -
the last old value at -1 time, and the new metric at the current time.

My last disturbing idea is to only update a metric when its TN is less
than some value,
say 2 x poll rate, or when a timeout is reached. Doing it that way
obviates the need to
store the previous values of all metrics on all hosts.

phew.....
regards,
richard
"



Reply via email to