Re: [Ganglia-general] Probleme scaling up Ganglia past 55 nodes

2014-03-18 Thread Filipe Bonjour
Dear Vladimir,

 We're not using rrdcached. I'll look into this tomorrow (we have a 
 maintenance today).

After 24 hours, rrdcached seems to have solved it.

Thanks for you advice, and also to Mozammil for his.

Filipe

-- 
Filipe Bonjour

   In theory there is no difference between theory and practice.
   In practice there is. -- Yogi Berra


--
Learn Graph Databases - Download FREE O'Reilly Book
Graph Databases is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


[Ganglia-general] Probleme scaling up Ganglia past 55 nodes

2014-03-16 Thread Filipe Bonjour
Hello,

I am fairly new to Ganglia, and have a problem with Ganglia 3.6.0 / Ganglia
Web 3.5.10: In one of my clusters, after I add approximately 55 hosts, the
graphs go blank. I think it's very similar to
http://www.mail-archive.com/ganglia-general%40lists.sourceforge.net/msg07852.html,
but that thread is old and does not seem to have been resolved.

I have five clusters -- four are small (up to 40 hosts) and are working
fine. The last one, HPC Cluster, is where I am having trouble. I start
gmond 3-5 servers at a time and after approximately 55 hosts, the web site
starts showing blackouts. If I add more hosts the blackout becomes
permanent. The blackouts present as follows:

* In the Grid report the number of hosts and cores for akk clusters is
correct and allhosts are marked up.
* In the Cluster report, the number of cores is correct but some hosts are
marked down. Actually, most of the time they're all marked down.
* In both the Grid report and the Cluster report the graphs for HPC Cluster
are blank.
* None of this affects any of the other clusters.

I've read a number of threads and tried to anticipate the most common
questions.

* All servers use NTP and running date on all of them shows they're
synchronized to a second or so.
* I do not see the message illegal attempt to update using time X when
last update time is X.
* I moved gmetad to a bigger box (16 cores, 256 GB RAM, negligible prior
usage). Didn't even increase the number of hosts I can add before the
blackouts start.
* All data sources use the default polling interval, 15 seconds.
* Tried adding the servers in different orders.
* No errors in the logs, no errors of I start gmond and gmetad with -d.
* I ran netstat -su on all boxes, and there were no packets dropped
anywhere.
* I ran netstat -au on the gmetad box and Recv-Q and 'Send-Q were
always 0.
* The server where gmetad runs has a UDP buffer
(/proc/sys/net/core/rmem_max) of 4194304.
* I dumped the RRDs and in these blank areas the metrics are NaN.
* During blackouts, I tried telnet gmetad servicing node port and
always got an immediate and apparently full response.
* The cluster's gmonds are all multicast and all listen and send. I tried
unicast and I tried a deaf/mute multicastwithout any improvement.

I guess that the fact that none of the other clusters is impacted means
it's not a resources issie. I therefore assume it's a configuration or
architecture issue. I can post the configuration files of gmetad and gmond,
but this post is pretty long as it is. So, in short:

* I am using one gmetad for all clusters.
* The data source for HPC Cluster has 6 nodes servicing it (n800, n816,
n832, n848, n864, n880) and uses port 8650.
  I originally only had two nodes. When the blackouts started, I added more.
* Gmond on all hosts uses six multicast channels, the same 6 nodes (n800,
n816, n832, n848, n864, n880) on port 8650.
  I originally had a single multicast channel. When the blackouts started,
I added more.
* Gmond on all hosts listens on UDP and TCP on port 8650.

Since none of the other clusters is impected, I could probably split this
cluster in smaller clusters and those would work, but this will make
reporting the full cluster usage more painful.

Any suggestions or ideas would be welcome.

Thanks,

Filipe
--
Learn Graph Databases - Download FREE O'Reilly Book
Graph Databases is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Probleme scaling up Ganglia past 55 nodes

2014-03-16 Thread Vladimir Vuksan

  
  
What kind of storage are you using ?
  Are you using rrdcached?
  
  Vladimir
  
  On 03/16/2014 11:17 AM, Filipe Bonjour wrote:


  
Hello,

I am fairly new to Ganglia, and have a problem with Ganglia
3.6.0 / Ganglia Web 3.5.10: In one of my clusters, after I
add approximately 55 hosts, the graphs go blank. I think
it's very similar to http://www.mail-archive.com/ganglia-general%40lists.sourceforge.net/msg07852.html,
but that thread is old and does not seem to have been
resolved.

I have five clusters -- four are small (up to 40 hosts) and
are working fine. The last one, "HPC Cluster", is where I am
having trouble. I start gmond 3-5 servers at a
time and after approximately 55 hosts, the web site starts
showing blackouts. If I add more hosts the blackout becomes
permanent. The blackouts present as follows:

* In the Grid report the number of hosts and cores for akk
clusters is correct and allhosts are marked "up".
* In the Cluster report, the number of cores is correct but
some hosts are marked "down". Actually, most of the time
they're all marked "down".
* In both the Grid report and the Cluster report the graphs
for HPC Cluster are blank.
  
* None of
this affects any of the other clusters.
  

I've read a number of threads and tried to anticipate the
most common questions.

* All servers use NTP and running "date" on all of them
shows they're synchronized to a second or so.
* I do not see the message "illegal attempt to update using
time X when last update time is X".
* I moved gmetad to a bigger box (16 cores, 256 GB RAM,
negligible prior usage). Didn't even increase the number of
hosts I can add before the blackouts start.
  
* All data
sources use the default polling interval, 15 seconds.
* Tried adding the servers in different orders.
* No errors in the logs, no errors of I start gmond and
gmetad with -d.
* I ran "netstat -su" on all boxes, and there were no
packets dropped anywhere.
* I ran "netstat -au" on the gmetad box and "Recv-Q" and
'Send-Q" were always 0.
  * The server
  where gmetad runs has a UDP buffer
  (/proc/sys/net/core/rmem_max) of 4194304.
* I dumped the RRDs and in these blank areas the metrics are
"NaN".
* During blackouts, I tried "telnet gmetad servicing
node port" and always got an immediate and
apparently full response.
* The cluster's gmonds are all multicast and all listen and
send. I tried unicast and I tried a deaf/mute
multicastwithout any improvement.
  


I guess that the fact that none of the other clusters is
impacted means it's not a resources issie. I therefore
assume it's a configuration or architecture issue. I can
post the configuration files of gmetad and gmond, but this
post is pretty long as it is. So, in short:

* I am using one gmetad for all clusters.
* The data source for "HPC Cluster" has 6 nodes servicing it
(n800, n816, n832, n848, n864, n880) and uses port 8650.
  
 I
originally only had two nodes. When the blackouts started, I
added more.
  
* Gmond on
all hosts uses six multicast channels, the same 6 nodes
(n800, n816, n832, n848, n864, n880) on port 8650.
  
 I
originally had a single multicast channel. When the
blackouts started, I added more.
  
* Gmond on
all hosts listens on UDP and TCP on port 8650.

Since none of the other clusters is impected, I could
probably split this cluster in smaller clusters and those
would work, but this will make reporting the full cluster
usage more painful.

Any suggestions or ideas would be welcome.
  

  


  


--
Learn Graph Databases - Download FREE O'Reilly Book
Graph Databases is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech___

Re: [Ganglia-general] Probleme scaling up Ganglia past 55 nodes

2014-03-16 Thread Filipe Bonjour
Dear Vladimir,

 What kind of storage are you using ? Are you using rrdcached?

 Vladimir

We're using a parallel filesystem (GPFS). I agree that it's not very 
fast, and I was tempted to move the RRDs to faster disk or tmpfs. But if 
the problem was slow storage or network, wouldn't the other clusters be 
affected?

We're not using rrdcached. I'll look into this tomorrow (we have a 
maintenance today).

Thanks,

Filipe

-- 
Filipe Bonjour

   In theory there is no difference between theory and practice.
   In practice there is. -- Yogi Berra


--
Learn Graph Databases - Download FREE O'Reilly Book
Graph Databases is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Probleme scaling up Ganglia past 55 nodes

2014-03-16 Thread Mohd Mozammil khan
I had past experience with Ganglia setup having more than 100 nodes. Since the 
setup was in Amazon AWS, I was bond to use unicast mode to communicate between 
gmond and gmetad and I personally like unicast which somewhere or the other 
saves network bandwidth. 

The way I had organized the setup, every cluster was configured to have its own 
unicast master node. Let's say in group of 20 nodes cluster, I had used one of 
the hosts to act as unicast master node. Similarly, configured gmetad to fetch 
data from unicat master nodes chosen for each cluster. I would suggest you to 
divide your setup per cluster basis, this way you will have the freedom to deal 
with once cluster at a time rather than fighting with whole setup all together 
in case of any issue.

Hope it helps you out.

Thanks,
Mozammil



On Sunday, 16 March 2014 9:03 PM, Vladimir Vuksan vli...@veus.hr wrote:
 
What kind of storage are you using ? Are you using rrdcached?

Vladimir

On 03/16/2014 11:17 AM, Filipe Bonjour wrote:

Hello,

I am fairly new to Ganglia, and have a problem with Ganglia
3.6.0 / Ganglia Web 3.5.10: In one of my clusters, after I
add approximately 55 hosts, the graphs go blank. I think
it's very similar to 
http://www.mail-archive.com/ganglia-general%40lists.sourceforge.net/msg07852.html,
 but that thread is old and does not seem to have been resolved.

I have five clusters -- four are small (up to 40 hosts) and
are working fine. The last one, HPC Cluster, is where I am
having trouble.I start gmond 3-5 servers at a time and after 
approximately 55 hosts, the web site starts showing blackouts. If I add more 
hosts the blackout becomes permanent. The blackouts present as follows:

* In the Grid report the number of hosts and cores for akk
clusters is correct and allhosts are marked up.
* In the Cluster report, the number of cores is correct but
some hosts are marked down. Actually, most of the time
they're all marked down.
* In both the Grid report and the Cluster report the graphs
for HPC Cluster are blank.

* None of this affects any of the other clusters.


I've read a number of threads and tried to anticipate the
most common questions.

* All servers use NTP and running date on all of them
shows they're synchronized to a second or so.
* I do not see the message illegal attempt to update using
time X when last update time is X.
* I moved gmetad to a bigger box (16 cores, 256 GB RAM,
negligible prior usage). Didn't even increase the number of
hosts I can add before the blackouts start.

* All data sources use the default polling interval, 15 seconds.
* Tried adding the servers in different orders.
* No errors in the logs, no errors of I start gmond and
gmetad with -d.
* I ran netstat -su on all boxes, and there were no
packets dropped anywhere.
* I ran netstat -au on the gmetad box and Recv-Q and
'Send-Q were always 0.
* The server where gmetad runs has a UDP buffer (/proc/sys/net/core/rmem_max) 
of 4194304.
* I dumped the RRDs and in these blank areas the metrics are
NaN.
* During blackouts, I tried telnet gmetad servicing
node port and always got an immediate and
apparently full response.
* The cluster's gmonds are all multicast and all listen and
send. I tried unicast and I tried a deaf/mute
multicastwithout any improvement.
 

I guess that the fact that none of the other clusters is
impacted means it's not a resources issie. I therefore
assume it's a configuration or architecture issue. I can
post the configuration files of gmetad and gmond, but this
post is pretty long as it is. So, in short:

* I am using one gmetad for all clusters.
* The data source for HPC Cluster has 6 nodes servicing it
(n800, n816, n832, n848, n864, n880) and uses port 8650.

  I originally only had two nodes. When the blackouts started, I added more.

* Gmond on all hosts uses six multicast channels, the same 6 nodes (n800, 
n816, n832, n848, n864, n880) on port 8650.

  I originally had a single multicast channel. When the blackouts started, I 
added more.

* Gmond on all hosts listens on UDP and TCP on port 8650.

Since none of the other clusters is impected, I could
probably split this cluster in smaller clusters and those
would work, but this will make reporting the full cluster
usage more painful.

Any suggestions or ideas would be welcome.



--
Learn Graph Databases - Download FREE O'Reilly Book
Graph Databases is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!