New question #240674 on Graphite:
https://answers.launchpad.net/graphite/+question/240674
I am trying to set up a Graphite cluster capable of handling 500K metric
datapoints every 10 seconds - as a starting point. After navigating through
some of the answers in this site, blog posts and other documentation, I have
set up the following configuration:
- 2 machines with 8 cores, 32 GB of memory and 3 TB of storage each
- On each machine:
- 5 carbon-relays
- 9 carbon-aggregators
- 9 carbon-caches
In total, there are 10 relays, 18 aggregators and 18 caches in the cluster.
Each aggregator communicates with a single cache - it's 1-to-1. The webapps are
configured to speak to their corresponding host's caches. An haproxy load
balancer receives all the metric traffic and distributes the load among the 10
relays. The 18 aggregators are specified as the destinations of each of the
relays in the configuration file. The relays are configured with
aggregated-consistent-hashing to group metrics that would be aggregated, based
on the aggregation rules, in the same cache.
This setup behaves well. I have been able to run stress tests on the cluster,
publishing larger sets of metrics incrementally to monitor the cluster health
at every point. However, I have noticed that there are issues with the
aggregated metrics.
For example, in the screenshot linked below, the graph on the right shows the
raw values received. The graph on the left shows the aggregated values computed
from the raw values. In this case, this metric's aggregation is defined as a
sum in the aggregation rules configuration file.
http://bit.ly/1hO6bBQ
If I do the sum by hand, the result is a value around 750. Clearly not what
Graphite is computing. This happens for *all* aggregated metrics in my cluster.
While investigating this issue, I also noticed something strange when comparing
the number of metrics received by the relays against the number of metrics sent
by the relays to the aggregators. In the screenshot below, the graph on the
right shows that the relays received around 280K metrics. However, only around
140K of those are sent to the aggregators.
http://bit.ly/1f91Q85
If I enable whitelists and reduce the number of metrics processed by the
cluster, the aggregations start functioning properly again and the relay's
received vs sent metrics start to match again. See screenshots linked below:
http://bit.ly/1aXJyST
http://bit.ly/18EP28u
Questions:
- Any insight into why the aggregated metrics are "spiky" while the
corresponding raw values look correct?
- Is there a scenario in which a relay will send less metrics than it receives?
- Does my setup makes sense? Is there a better way to scale a Graphite cluster?
I would greatly appreciate any help.
Thanks!
--
You received this question notification because you are a member of
graphite-dev, which is an answer contact for Graphite.
_______________________________________________
Mailing list: https://launchpad.net/~graphite-dev
Post to : [email protected]
Unsubscribe : https://launchpad.net/~graphite-dev
More help : https://help.launchpad.net/ListHelp