What kind of storage are you using ? Are you using rrdcached?

Vladimir

On 03/16/2014 11:17 AM, Filipe Bonjour wrote:
Hello,

I am fairly new to Ganglia, and have a problem with Ganglia 3.6.0 / Ganglia Web 3.5.10: In one of my clusters, after I add approximately 55 hosts, the graphs go blank. I think it's very similar to http://www.mail-archive.com/ganglia-general%40lists.sourceforge.net/msg07852.html, but that thread is old and does not seem to have been resolved.

I have five clusters -- four are small (up to 40 hosts) and are working fine. The last one, "HPC Cluster", is where I am having trouble.
I start gmond 3-5 servers at a time and after approximately 55 hosts, the web site starts showing blackouts. If I add more hosts the blackout becomes permanent. The blackouts present as follows:

* In the Grid report the number of hosts and cores for akk clusters is correct and allhosts are marked "up".
* In the Cluster report, the number of cores is correct but some hosts are marked "down". Actually, most of the time they're all marked "down".
* In both the Grid report and the Cluster report the graphs for HPC Cluster are blank.
* None of this affects any of the other clusters.

I've read a number of threads and tried to anticipate the most common questions.

* All servers use NTP and running "date" on all of them shows they're synchronized to a second or so.
* I do not see the message "illegal attempt to update using time X when last update time is X".
* I moved gmetad to a bigger box (16 cores, 256 GB RAM, negligible prior usage). Didn't even increase the number of hosts I can add before the blackouts start.
* All data sources use the default polling interval, 15 seconds.
* Tried adding the servers in different orders.
* No errors in the logs, no errors of I start gmond and gmetad with -d.
* I ran "netstat -su" on all boxes, and there were no packets dropped anywhere.
* I ran "netstat -au" on the gmetad box and "Recv-Q" and 'Send-Q" were always 0.
* The server where gmetad runs has a UDP buffer (/proc/sys/net/core/rmem_max) of 4194304.
* I dumped the RRDs and in these blank areas the metrics are "NaN".
* During blackouts, I tried "telnet <gmetad servicing node> <port>" and always got an immediate and apparently full response.
* The cluster's gmonds are all multicast and all listen and send. I tried unicast and I tried a deaf/mute multicastwithout any improvement.

I guess that the fact that none of the other clusters is impacted means it's not a resources issie. I therefore assume it's a configuration or architecture issue. I can post the configuration files of gmetad and gmond, but this post is pretty long as it is. So, in short:

* I am using one gmetad for all clusters.
* The data source for "HPC Cluster" has 6 nodes servicing it (n800, n816, n832, n848, n864, n880) and uses port 8650.
  I originally only had two nodes. When the blackouts started, I added more.
* Gmond on all hosts uses six multicast channels, the same 6 nodes (n800, n816, n832, n848, n864, n880) on port 8650.
  I originally had a single multicast channel. When the blackouts started, I added more.
* Gmond on all hosts listens on UDP and TCP on port 8650.

Since none of the other clusters is impected, I could probably split this cluster in smaller clusters and those would work, but this will make reporting the full cluster usage more painful.

Any suggestions or ideas would be welcome.


------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Reply via email to