Re: [Ganglia plugin] Next steps

2013-06-28 Thread Chris Mattmann
Hey Rajith!

Great breaking down of the problem. My thoughts/comments are below:


-Original Message-

From: Rajith Siriwardana rajithsiriward...@gmail.com
Date: Thursday, June 27, 2013 10:27 AM
To: OODT dev dev@oodt.apache.org, jpluser
chris.a.mattm...@jpl.nasa.gov, jpluser chris.mattm...@gmail.com
Subject: Re: [Ganglia plugin] Next steps

Hi all,
Attached/linked diagram [1] shows how the GangliaResourceMonitorFactory
will be integrated to AssignmentMonitor to calculate load.
In here in AssignmentMonitor it keeps the node's load in a static hashmap
(nodeId, load) so I guess the loadMap should be updated in a timely
manner (ex: 1 min interval) by parsing the ganglia XML right?

Bingo, correct.

 

Since the load we need is not a traditional value and it's a value which
says how many of these jobs can fit on a machine. So as I understood, the
load calculation should happen a way that, which the most relevant
metrics are taken into calculation and weights should be added to the
values. then the load value should normalize within the range of 0 and 1.

Yep, or more likely, I would make the load normalize into a value between
0 and node.getCapacity(),
where that value is read from the nodes.xml file.

 
I guess following metrics are the most relevant ones with the default
Ganglia metrics for the calculation.

load_one = one minute load average
load_five = five minutes load average
load_fifteen = fifteen minutes load average

mem_free = amount of available memory
swap_free = amount of available swap memory

+1


Followings are the models currently have in mind.
(I). weight the 1 min, 5 min and 15 min load numbers and normalize the
value.

+1

(II). adding the mem_free and swap_free metrics to the calculation with
model I.

+1



more weight should goes to either 5 or 15. according to [3].
#1. but how can I rationalize the weights i give?

Use node.getCapacity() and allow the user to provide that rationalization,
e.g., 
allow them to easily tinker (via configuration) the different weights on
the metrics,
while at the same time ensuring those weights when multiplied together
with the metrics
values to remain between 0 and node.getCapacity()

 
#2. furthermore what is the capacity of a Node? since we are talking
about normalization what is the role of this capacity? how it affects
this calculation. (when assigning load to a particular node it calculate
something like if (loadValue = (loadCap - curLoad)) inhere loadCap =
node.getCapacity() and curLoad=loadMap.get(node.getNodeId())).intValue() )

Allow the user to set capacity() in nodes.xml, and then read it from there
(as a start).

 

Other considerations
#3. what should be the value if the node is offline?

Capacity should probably be set to 0 at that point. IOW, if it's offline,
ignore the user's pre-profiled
capacity, and then say it can't hold any jobs.

We can say a particular Node is offline by TN and TMAX value. gmetad, a
host is considered offline and is ignored if TN  4 * TMAX.[2]
(TN :  TN value is the number of seconds since the metric was last
updated TMAX: The maximum time in seconds between gmetric calls)

+1

Great work! Please proceed.

Cheers,
Chris



default  ganglia metrics is listed here and your thoughts are welcome.
disk_free = Disk Space Available
machine_type = System architecture
bytes_out = Number of bytes out per second
gexec = DESC VAL = gexec available
proc_total = Total number of processes
cpu_nice = Percentage of CPU utilization that occurred while executing at
the user level with nice priority
pkts_in = Packets in per second
cpu_speed = CPU Speed in terms of MHz
boottime = The last time that the system was started
cpu_wio = Percentage of time that the CPU or CPUs were idle during which
the system had an outstanding disk I/O request
os_name = Operating system name
load_one = One minute load average
os_release = Operating system release date
disk_total = Total available disk space
cpu_user = Percentage of CPU utilization that occurred while executing at
the user level
cpu_idle = Percentage of time that the CPU or CPUs were idle and the
system did not have an outstanding disk I/O request
swap_free = Amount of available swap memory
mem_cached = Amount of cached memory
pkts_out = Packets out per second
load_five = Five minute load average
cpu_num = Total number of CPUs
load_fifteen  = Fifteen minute load average
mem_free = Amount of available memory
cpu_system = Percentage of CPU utilization that occurred while executing
at the system level
proc_run = Total number of running processes
mem_total = Total amount of memory displayed in KBs
cpu_aidle = Percent of time since boot idle CPU
bytes_in  = Number of bytes in per second
mem_buffers  = Amount of buffered memory
mem_shared = Amount of shared memory
swap_total = Total amount of swap space displayed in KBs
part_max_used = Maximum percent used for all partitions


[1] 
https://issues.apache.org/jira/secure/attachment/12589911/diagram1.png
[2] http://entropy.gforge.inria.fr

Re: [Ganglia plugin] Next steps

2013-06-27 Thread Rajith Siriwardana
Hi all,

Attached/linked diagram [1] shows how the GangliaResourceMonitorFactory
will be integrated to AssignmentMonitor to calculate load.
In here in AssignmentMonitor it keeps the node's load in a static hashmap
(nodeId, load) so I guess the *loadMap should be updated in a timely
manner* (ex: 1 min interval) by parsing the ganglia XML right?

Since the load we need is not a traditional value and it's a value which
says how many of these jobs can fit on a machine. So as I understood, the
load calculation should happen a way that, which the most relevant metrics
are taken into calculation and weights should be added to the values. then
the load value should normalize within the range of 0 and 1.
I guess following metrics are the most relevant ones with the default
Ganglia metrics for the calculation.

load_one = one minute load average
load_five = five minutes load average
load_fifteen = fifteen minutes load average

mem_free = amount of available memory
swap_free = amount of available swap memory

Followings are the models currently have in mind.
(I). weight the 1 min, 5 min and 15 min load numbers and normalize the
value.
(II). adding the mem_free and swap_free metrics to the calculation with
model I.

more weight should goes to either 5 or 15. according to [3].
#1. *but how can I rationalize the weights i give?*
#2. furthermore what is the capacity of a Node? since we are talking
about *normalization
what is the role of this capacity?* how it affects this calculation. (when
assigning load to a particular node it calculate something like if
(loadValue = (loadCap - curLoad)) inhere loadCap = node.getCapacity() and
curLoad=loadMap.get(node.getNodeId())).intValue() )

Other considerations
#3. what should be the value if the node is offline?

We can say a particular Node is offline by TN and TMAX value. gmetad, a
host is considered offline and is ignored if TN  4 * TMAX.[2]

(TN :  TN value is the number of seconds since the metric was last
updated TMAX:
The maximum time in seconds between gmetric calls)

*default  ganglia metrics is listed here and your thoughts are welcome.*
disk_free = Disk Space Available
machine_type = System architecture
bytes_out = Number of bytes out per second
gexec = DESC VAL = gexec available
proc_total = Total number of processes
cpu_nice = Percentage of CPU utilization that occurred while executing at
the user level with nice priority
pkts_in = Packets in per second
cpu_speed = CPU Speed in terms of MHz
boottime = The last time that the system was started
cpu_wio = Percentage of time that the CPU or CPUs were idle during which
the system had an outstanding disk I/O request
os_name = Operating system name
load_one = One minute load average
os_release = Operating system release date
disk_total = Total available disk space
cpu_user = Percentage of CPU utilization that occurred while executing at
the user level
cpu_idle = Percentage of time that the CPU or CPUs were idle and the system
did not have an outstanding disk I/O request
swap_free = Amount of available swap memory
mem_cached = Amount of cached memory
pkts_out = Packets out per second
load_five = Five minute load average
cpu_num = Total number of CPUs
load_fifteen  = Fifteen minute load average
mem_free = Amount of available memory
cpu_system = Percentage of CPU utilization that occurred while executing at
the system level
proc_run = Total number of running processes
mem_total = Total amount of memory displayed in KBs
cpu_aidle = Percent of time since boot idle CPU
bytes_in  = Number of bytes in per second
mem_buffers  = Amount of buffered memory
mem_shared = Amount of shared memory
swap_total = Total amount of swap space displayed in KBs
part_max_used = Maximum percent used for all partitions

[1] https://issues.apache.org/jira/secure/attachment/12589911/diagram1.png
[2] http://entropy.gforge.inria.fr/ganglia.html
[3] http://blog.scoutapp.com/articles/2009/07/31/understanding-load-averages


Cheers,
Rajith



On Fri, Jun 21, 2013 at 7:22 PM, Rajith Siriwardana 
rajithsiriward...@gmail.com wrote:


 moving the conversation to dev.

 Cheers,
 Rajith

 On Thu, Jun 20, 2013 at 11:10 AM, Chris Mattmann chris.mattm...@gmail.com
  wrote:

 Hi Rajith,

 RE: #1 yep that's the next step.

 RE: #2, I would create a pluggable function/class that allows
 different Besting algorithms to be plugged in. One simple one
 would be AverageLoad (avg between the 3 load values). Another
 simple would be FiveMinuteLoad; another OneMinLoad; etc. I would
 also imagine allowing ArbitraryMetricWeightedAvgLoad where it takes
 in maybe a ListString specifying the metric names, and then also
 maybe a HashMapString, Double that identifies the metric name,
 and then the weight to apply in the weighted average, e.g., maybe
 {{1minload, 3.0}, {5minload, 10.0}, {15minload, 1.0}}

 indicating that the final load should be calculated as:

 3*[val of 1minLoad] + 10*[val of 5minLoad] + 1*[val of 15minLoad]