[Ganglia-general] gmond too many open files

2012-02-24 Thread M. Leong Lists
On some of my boxes, I'm getting too many files on /var/log/messages:

Feb 24 13:08:09 server1 /usr/sbin/gmond[29766]: slurpfile() open() error 
on file /proc/loadavg: Too many open files
Feb 24 13:08:09 server1 /usr/sbin/gmond[29766]: update_file() got an 
error from slurpfile() reading /proc/loadavg

Feb 24 13:07:56 server1 /usr/sbin/gmond[29766]: slurpfile() open() error 
on file /proc/meminfo: Too many open files
Feb 24 13:07:56 server1 /usr/sbin/gmond[29766]: update_file() got an 
error from slurpfile() reading /proc/meminfo

Feb 24 13:08:09 server1 /usr/sbin/gmond[29766]: slurpfile() open() error 
on file /proc/stat: Too many open files
Feb 24 13:08:09 server1 /usr/sbin/gmond[29766]: update_file() got an 
error from slurpfile() reading /proc/stat

Feb 24 13:07:56 server1 /usr/sbin/gmond[29766]: slurpfile() open() error 
on file /proc/net/dev: Too many open files
Feb 24 13:07:56 server1 /usr/sbin/gmond[29766]: update_file() got an 
error from slurpfile() reading /proc/net/dev


Is this a bug in the app not closing those files

thx


--
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Ganglia for monitoring processes?

2012-02-24 Thread Aaron Urbain
you are asking for process level cpu consumption for a box, I believe.
I would be interested in this also.



On Feb 24, 2012, at 3:23 PM, vincent blanqué wrote:

> Hi everybody,
> 
> Ganglia suit to graph consumption of each process that run on my server?
> 
> thx,
> 
> Vincent
> --
> Virtualization & Cloud Management Using Capacity Planning
> Cloud computing makes use of virtualization - but cloud computing 
> also focuses on allowing computing to be delivered as a service.
> http://www.accelacomm.com/jaw/sfnl/114/51521223/___
> Ganglia-general mailing list
> Ganglia-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/ganglia-general


--
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


[Ganglia-general] Ganglia for monitoring processes?

2012-02-24 Thread vincent blanqué
Hi everybody,

Ganglia suit to graph consumption of each process that run on my server?

thx,

Vincent
--
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Ganglia gmond memory leak?

2012-02-24 Thread Aidan Wong
I have the following config in regards to metric cleanup:

  host_dmax =  259200 /*secs - 3 days*/
  cleanup_threshold = 300 /*secs */


From: Matt Massie mailto:m...@massie.us>>
Date: Thu, 23 Feb 2012 11:06:03 -0800
To: mailto:svd.gang...@mylife.com>>
Cc: 
mailto:ganglia-general@lists.sourceforge.net>>
Subject: Re: [Ganglia-general] Ganglia gmond memory leak?

Each unique metric (keyed on metric name) requires memory space in gmond.

A good test is to peek at the number of metrics in gmond over time, e.g.

$ telnet localhost 8649 | grep METRIC | wc -l

If the number of metrics over time increases, so will the memory use.

Ganglia will release the metric and memory when the age of the metric is 
greater than DMAX.  A DMAX value of zero will cause ganglia to hold the metric 
indefinitely.  In order to make sure that ganglia is releasing old metrics, set 
the DMAX value to something like 5 minutes (300 secs).

For example, lets assume you are doing per process monitoring and the metric 
name looks like

"cpu_user.%d" % (pid,)

Over time, you'll have lots of metrics (cpu_user.343493, cpu_user.343022, 
cpu_user.232323) that start accumulating and taking up memory space.

-Matt


On Thu, Feb 23, 2012 at 10:01 AM, 
mailto:svd.gang...@mylife.com>> wrote:
i observed this in the past as well.  running valgrind for days did not
yeild any clue.  i had a hunch that remote spoofed metrics were involved,
as the leak seemed to get better when i had coincidentally disabled the
sending of some of those spoof metrics.  but, we never found anything
conclusive.  there was also some odd race such that sometimes after
restart the leak was much faster, but after restarting a few times the
leak slowed (but was always still fast enough to be a burden).

-scott

>> From: Aidan Wong 
>> mailto:aidanw...@attinteractive.com>>
>> To: "Ave-Lallemant, Nathan P" 
>> mailto:nathan.p.ave-lallem...@efleets.com>>;
>>  ganglia-general 
>> mailto:ganglia-general@lists.sourceforge.net>>
>> Sent: Thursday, February 23, 2012 8:34 AM
>> Subject: Re: [Ganglia-general] Ganglia gmond memory leak?
>>
>>
>> I've restarted the gmond process and memory usage drops until gmond hogs 
>> memory over time. ?Any Ganglia contributors who may want to chime in on this 
>> memory leak issue? ?I'm on Ganglia 3.2.0. ?Are there any improvements on 
>> version 3.3.1 addressing this issue?
>>
>>
>> Thanks

--
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

-- 
Virtualization & Cloud Management Using Capacity Planning Cloud computing makes 
use of virtualization - but cloud computing also focuses on allowing computing 
to be delivered as a service. 
http://www.accelacomm.com/jaw/sfnl/114/51521223/___
 Ganglia-general mailing list 
Ganglia-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/ganglia-general
--
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Ganglia gmond memory leak?

2012-02-24 Thread Aidan Wong
I'm not using any IIRC plugins as far as I know.  I'm using basically
Ganglia 3.2.0 right out of the box.  The extra metrics that I'm sending
are from my Hadoop cluster nodes where I defined the host and gmond port
of the destination gmond that collects the metrics.

On 2/23/12 6:27 PM, "Robin Humble"  wrote:

>On Thu, Feb 23, 2012 at 07:22:36PM +, Aidan Wong wrote:
>>That one node that recently had the running away memory leak was sending
>>253 metrics.  I'm using unicast sending all metrics to a specific host
>>where I have configured the "udp_send_channel" with the "host" and "port"
>>attributes defined.
>
>IIRC plugins are loaded once and then run within gmond's address space.
>so I guess plugins could be causing memory leaks.
>which plugins are you using? they alloc/free as they should?
>
>we haven't notived any leaks (certainly no serious leaks) across our
>~1800 gmonds using 3.2.0, but we aren't sending that many metrics
>either - just using most of the standard stuff plus modified diskstat,
>cputemp python plugins, and with a bunch of other metrics spoof'd from
>chassis and switches (more cpu cycles for HPC job this way).
>we are using multicast.
>all except a few gmonds (not included below) are senders only.
>
> %CPU %MEMVSZ   RSS  COMMAND
>min   0.0  0.0  70972  1864 /usr/sbin/gmond
>median0.0  0.0  70972  3392 /usr/sbin/gmond
>ave   0.0  0.0  70977  3240 /usr/sbin/gmond
>max   0.0  0.0  71104  4720 /usr/sbin/gmond
>
>those with the larger RSS have been rebooted recently and haven't yet
>had unused pages pushed out by vm pressure.
>
>cheers,
>robin
>--
>Dr Robin Humble, HPC Systems Analyst, NCI National Facility
>
>--
>
>Virtualization & Cloud Management Using Capacity Planning
>Cloud computing makes use of virtualization - but cloud computing
>also focuses on allowing computing to be delivered as a service.
>http://www.accelacomm.com/jaw/sfnl/114/51521223/
>___
>Ganglia-general mailing list
>Ganglia-general@lists.sourceforge.net
>https://lists.sourceforge.net/lists/listinfo/ganglia-general
>



--
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


[Ganglia-general] Metric name mapping --> Real Meaning?

2012-02-24 Thread Jeff Blaine
I've yet to find any document describing the metrics
provided with base Ganglia.  Does such a thing exist?

I mean, sure, I can guess that "cpu_wio" is % CPU time
spent waiting on I/O, but should I have to guess?

bytes_in / bytes_out = Disk?  Network?

Etc...

--
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Failing multicast for 1 host

2012-02-24 Thread Jeff Blaine
On 2/23/2012 3:10 PM, Jeff Blaine wrote:
> Hi all,
>
> We've got an existing Ganglia set up that is working fine.
> One new host is giving us trouble though.  Although one
> of its hostname resolves to 2 IP addresses (something I
> am looking into the history of to see if it's sane), I
> would like to get the problem out to the list in case
> someone sees something obviously wrong.

That part has been addressed -- the host oahu now
resolves to a single IP address.

I've made no progress with Ganglia though.

udp_send_channel {
 mcast_if = eth0
 mcast_join = 239.2.11.77
 port = 8649
}
udp_recv_channel {
 mcast_if = eth0
 mcast_join = 239.2.11.77
 port = 8649
 bind = 239.2.11.77
}

Our gmetad is on a host 1xx.xx.11.xx

The Netcat Test (see FAQ) passes when sending "hello"
from our gmetad server to port 8649 on oahu (or direct
to either IP address)

Here's the first ~30 seconds of gmond -d 9

udp_recv_channel mcast_join=239.2.11.77 mcast_if=eth0 port=8649 
bind=239.2.11.77
tcp_accept_channel bind=NULL port=8649
udp_send_channel mcast_join=239.2.11.77 mcast_if=eth0 host=NULL port=8649

 metric 'cpu_user' being collected now
 metric 'cpu_user' has value_threshold 1.00
 metric 'cpu_system' being collected now
 metric 'cpu_system' has value_threshold 1.00
 metric 'cpu_idle' being collected now
 metric 'cpu_idle' has value_threshold 5.00
 metric 'cpu_nice' being collected now
 metric 'cpu_nice' has value_threshold 1.00
 metric 'cpu_aidle' being collected now
 metric 'cpu_aidle' has value_threshold 5.00
 metric 'cpu_wio' being collected now
 metric 'cpu_wio' has value_threshold 1.00
 metric 'load_one' being collected now
 metric 'load_one' has value_threshold 1.00
 metric 'load_five' being collected now
 metric 'load_five' has value_threshold 1.00
 metric 'load_fifteen' being collected now
 metric 'load_fifteen' has value_threshold 1.00
 metric 'proc_run' being collected now
 metric 'proc_run' has value_threshold 1.00
 metric 'proc_total' being collected now
 metric 'proc_total' has value_threshold 1.00
 metric 'mem_free' being collected now
 metric 'mem_free' has value_threshold 1024.00
 metric 'mem_shared' being collected now
 metric 'mem_shared' has value_threshold 1024.00
 metric 'mem_buffers' being collected now
 metric 'mem_buffers' has value_threshold 1024.00
 metric 'mem_cached' being collected now
 metric 'mem_cached' has value_threshold 1024.00
 metric 'swap_free' being collected now
 metric 'swap_free' has value_threshold 1024.00
 metric 'bytes_out' being collected now
  **  BYTES_OUT RETURN:  0.392367
 metric 'bytes_out' has value_threshold 4096.00
 metric 'bytes_in' being collected now
  Last two chars: 

  Last two chars: 12

  Last two chars: 12

  Last two chars: 

  Last two chars: 

  Last two chars: 23

  Last two chars: 23

  Last two chars: 45

  Last two chars: 45

  Last two chars: 

  Last two chars: 

  **  BYTES_IN RETURN:  3.481530
 metric 'bytes_in' has value_threshold 4096.00
 metric 'pkts_in' being collected now
  **  PKTS_IN RETURN:  0.030713
 metric 'pkts_in' has value_threshold 256.00
 metric 'pkts_out' being collected now
  **  PKTS_OUT RETURN:  0.011354
 metric 'pkts_out' has value_threshold 256.00
 metric 'disk_total' being collected now
Counting device /dev/root (15.63 %)
Counting device /dev/sda1 (42.37 %)
Counting device /dev/sdb1 (6.47 %)
Counting device /dev/sdc1 (41.42 %)
For all disks: 252.434 GB total, 195.375 GB free for users.
 metric 'disk_total' has value_threshold 1.00
 metric 'disk_free' being collected now
Counting device /dev/root (15.63 %)
Counting device /dev/sda1 (42.37 %)
Counting device /dev/sdb1 (6.47 %)
Counting device /dev/sdc1 (41.42 %)
For all disks: 252.434 GB total, 195.375 GB free for users.
 metric 'disk_free' has value_threshold 1.00
 metric 'part_max_used' being collected now
Counting device /dev/root (15.63 %)
Counting device /dev/sda1 (42.37 %)
Counting device /dev/sdb1 (6.47 %)
Counting device /dev/sdc1 (41.42 %)
For all disks: 252.434 GB total, 195.375 GB free for users.
 metric 'part_max_used' has value_threshold 1.00
 sent message 'heartbeat' of length 8 with 0 errors
 sent message 'cpu_num' of length 8 with 0 errors
 sent message 'cpu_speed' of length 8 with 0 errors
 sent message 'mem_total' of length 8 with 0 errors
 sent message 'swap_total' of length 8 with 0 errors
 sent message 'boottime' of length 8 with 0 errors
 sent message 'machine_type' of length 12 with 0 errors
 sent message 'os_name' of length 16 with 0 errors
 sent message 'os_release' of length 28 with 0 errors
 sent message 'location' of length 16 with 0 errors
 sent message 'gexec' of lengt