Re: [Ganglia plugin] Next steps
Hey Rajith! Great breaking down of the problem. My thoughts/comments are below: -Original Message- From: Rajith Siriwardana rajithsiriward...@gmail.com Date: Thursday, June 27, 2013 10:27 AM To: OODT dev dev@oodt.apache.org, jpluser chris.a.mattm...@jpl.nasa.gov, jpluser chris.mattm...@gmail.com Subject: Re: [Ganglia plugin] Next steps Hi all, Attached/linked diagram [1] shows how the GangliaResourceMonitorFactory will be integrated to AssignmentMonitor to calculate load. In here in AssignmentMonitor it keeps the node's load in a static hashmap (nodeId, load) so I guess the loadMap should be updated in a timely manner (ex: 1 min interval) by parsing the ganglia XML right? Bingo, correct. Since the load we need is not a traditional value and it's a value which says how many of these jobs can fit on a machine. So as I understood, the load calculation should happen a way that, which the most relevant metrics are taken into calculation and weights should be added to the values. then the load value should normalize within the range of 0 and 1. Yep, or more likely, I would make the load normalize into a value between 0 and node.getCapacity(), where that value is read from the nodes.xml file. I guess following metrics are the most relevant ones with the default Ganglia metrics for the calculation. load_one = one minute load average load_five = five minutes load average load_fifteen = fifteen minutes load average mem_free = amount of available memory swap_free = amount of available swap memory +1 Followings are the models currently have in mind. (I). weight the 1 min, 5 min and 15 min load numbers and normalize the value. +1 (II). adding the mem_free and swap_free metrics to the calculation with model I. +1 more weight should goes to either 5 or 15. according to [3]. #1. but how can I rationalize the weights i give? Use node.getCapacity() and allow the user to provide that rationalization, e.g., allow them to easily tinker (via configuration) the different weights on the metrics, while at the same time ensuring those weights when multiplied together with the metrics values to remain between 0 and node.getCapacity() #2. furthermore what is the capacity of a Node? since we are talking about normalization what is the role of this capacity? how it affects this calculation. (when assigning load to a particular node it calculate something like if (loadValue = (loadCap - curLoad)) inhere loadCap = node.getCapacity() and curLoad=loadMap.get(node.getNodeId())).intValue() ) Allow the user to set capacity() in nodes.xml, and then read it from there (as a start). Other considerations #3. what should be the value if the node is offline? Capacity should probably be set to 0 at that point. IOW, if it's offline, ignore the user's pre-profiled capacity, and then say it can't hold any jobs. We can say a particular Node is offline by TN and TMAX value. gmetad, a host is considered offline and is ignored if TN 4 * TMAX.[2] (TN : TN value is the number of seconds since the metric was last updated TMAX: The maximum time in seconds between gmetric calls) +1 Great work! Please proceed. Cheers, Chris default ganglia metrics is listed here and your thoughts are welcome. disk_free = Disk Space Available machine_type = System architecture bytes_out = Number of bytes out per second gexec = DESC VAL = gexec available proc_total = Total number of processes cpu_nice = Percentage of CPU utilization that occurred while executing at the user level with nice priority pkts_in = Packets in per second cpu_speed = CPU Speed in terms of MHz boottime = The last time that the system was started cpu_wio = Percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request os_name = Operating system name load_one = One minute load average os_release = Operating system release date disk_total = Total available disk space cpu_user = Percentage of CPU utilization that occurred while executing at the user level cpu_idle = Percentage of time that the CPU or CPUs were idle and the system did not have an outstanding disk I/O request swap_free = Amount of available swap memory mem_cached = Amount of cached memory pkts_out = Packets out per second load_five = Five minute load average cpu_num = Total number of CPUs load_fifteen = Fifteen minute load average mem_free = Amount of available memory cpu_system = Percentage of CPU utilization that occurred while executing at the system level proc_run = Total number of running processes mem_total = Total amount of memory displayed in KBs cpu_aidle = Percent of time since boot idle CPU bytes_in = Number of bytes in per second mem_buffers = Amount of buffered memory mem_shared = Amount of shared memory swap_total = Total amount of swap space displayed in KBs part_max_used = Maximum percent used for all partitions [1] https://issues.apache.org/jira/secure/attachment/12589911/diagram1.png [2] http://entropy.gforge.inria.fr
Re: [Ganglia plugin] Next steps
Hi all, Attached/linked diagram [1] shows how the GangliaResourceMonitorFactory will be integrated to AssignmentMonitor to calculate load. In here in AssignmentMonitor it keeps the node's load in a static hashmap (nodeId, load) so I guess the *loadMap should be updated in a timely manner* (ex: 1 min interval) by parsing the ganglia XML right? Since the load we need is not a traditional value and it's a value which says how many of these jobs can fit on a machine. So as I understood, the load calculation should happen a way that, which the most relevant metrics are taken into calculation and weights should be added to the values. then the load value should normalize within the range of 0 and 1. I guess following metrics are the most relevant ones with the default Ganglia metrics for the calculation. load_one = one minute load average load_five = five minutes load average load_fifteen = fifteen minutes load average mem_free = amount of available memory swap_free = amount of available swap memory Followings are the models currently have in mind. (I). weight the 1 min, 5 min and 15 min load numbers and normalize the value. (II). adding the mem_free and swap_free metrics to the calculation with model I. more weight should goes to either 5 or 15. according to [3]. #1. *but how can I rationalize the weights i give?* #2. furthermore what is the capacity of a Node? since we are talking about *normalization what is the role of this capacity?* how it affects this calculation. (when assigning load to a particular node it calculate something like if (loadValue = (loadCap - curLoad)) inhere loadCap = node.getCapacity() and curLoad=loadMap.get(node.getNodeId())).intValue() ) Other considerations #3. what should be the value if the node is offline? We can say a particular Node is offline by TN and TMAX value. gmetad, a host is considered offline and is ignored if TN 4 * TMAX.[2] (TN : TN value is the number of seconds since the metric was last updated TMAX: The maximum time in seconds between gmetric calls) *default ganglia metrics is listed here and your thoughts are welcome.* disk_free = Disk Space Available machine_type = System architecture bytes_out = Number of bytes out per second gexec = DESC VAL = gexec available proc_total = Total number of processes cpu_nice = Percentage of CPU utilization that occurred while executing at the user level with nice priority pkts_in = Packets in per second cpu_speed = CPU Speed in terms of MHz boottime = The last time that the system was started cpu_wio = Percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request os_name = Operating system name load_one = One minute load average os_release = Operating system release date disk_total = Total available disk space cpu_user = Percentage of CPU utilization that occurred while executing at the user level cpu_idle = Percentage of time that the CPU or CPUs were idle and the system did not have an outstanding disk I/O request swap_free = Amount of available swap memory mem_cached = Amount of cached memory pkts_out = Packets out per second load_five = Five minute load average cpu_num = Total number of CPUs load_fifteen = Fifteen minute load average mem_free = Amount of available memory cpu_system = Percentage of CPU utilization that occurred while executing at the system level proc_run = Total number of running processes mem_total = Total amount of memory displayed in KBs cpu_aidle = Percent of time since boot idle CPU bytes_in = Number of bytes in per second mem_buffers = Amount of buffered memory mem_shared = Amount of shared memory swap_total = Total amount of swap space displayed in KBs part_max_used = Maximum percent used for all partitions [1] https://issues.apache.org/jira/secure/attachment/12589911/diagram1.png [2] http://entropy.gforge.inria.fr/ganglia.html [3] http://blog.scoutapp.com/articles/2009/07/31/understanding-load-averages Cheers, Rajith On Fri, Jun 21, 2013 at 7:22 PM, Rajith Siriwardana rajithsiriward...@gmail.com wrote: moving the conversation to dev. Cheers, Rajith On Thu, Jun 20, 2013 at 11:10 AM, Chris Mattmann chris.mattm...@gmail.com wrote: Hi Rajith, RE: #1 yep that's the next step. RE: #2, I would create a pluggable function/class that allows different Besting algorithms to be plugged in. One simple one would be AverageLoad (avg between the 3 load values). Another simple would be FiveMinuteLoad; another OneMinLoad; etc. I would also imagine allowing ArbitraryMetricWeightedAvgLoad where it takes in maybe a ListString specifying the metric names, and then also maybe a HashMapString, Double that identifies the metric name, and then the weight to apply in the weighted average, e.g., maybe {{1minload, 3.0}, {5minload, 10.0}, {15minload, 1.0}} indicating that the final load should be calculated as: 3*[val of 1minLoad] + 10*[val of 5minLoad] + 1*[val of 15minLoad]