2013/12/7 Adrian Sevcenco <adrian.sevce...@cern.ch>: > On 12/06/2013 10:51 PM, Devon H. O'Dell wrote: >> 2013/12/6 Vladimir Vuksan <vli...@veus.hr>: >>> Hello everyone, > Hi! > >>> For few weeks now we have had performance issues due to growth of >>> our monitoring setup. One of my colleagues Devon O'Dell volunteered >>> to help and below is an e-mail of his findings. >> >> Hi! I joined the ML, so I'm around to answer questions. Nice to >> 'meet' you guys! > Thank you for your work! I have also some questions/ideas also but i am > still struggling with the internal gmond structures so it may take a > while until i can contribute also myself (plus i am not a programmer by > profession)
No worries! > So: > You said that you are using a gmond to collect data from every machine. > The problem with the current implementation of gmond is that: > 1. cannot be used for aggregation only (no metrics from localhost) > 2. the cluster tagging is done at xml reporting level not at host level. > > It would be nice to have possibility to have gmond aggregators that just > pass along a collection of metrics from multiple machines. > Also if the cluster tagging would be made at gmond reporting level it > would be possible to aggregate in an gmond metrics from different > clusters and gmetad would just write each metrics bundle in the > corresponding cluster space. > > Moreover (it was discussed on the list without a clear conclusion) it > would be great if there can be introduced in gmond an UUID ID (without > regard of method of generation: from hardware or random generated) > that would be the actual key for identifying a machine. > It would be enough to have in gmond.conf in host section something like: > uuid = "some_uuid" > and > move override_hostname from globals to host in a form of an list > override_hostname_list="list_of names" > that would be reported to gmetad as a list of aliases (alongside the > reverse DNS result) > This will have the effect that the host be be search also by any of > former or present hostnames (resolved of not by DNS) Not sure how to speak to these two points. My practical familiarity with ganglia is limited to our usage of it at Fastly. Conceptually, I don't think these things are hard to do, but I've only been looking at the code for about a week (and spent almost 0 time looking at gmond). >> Ganglia performance, but most of the low hanging fruit is now gone; at >> some me point it will require: >> >> * writing a version of librrd (this probably also means changing the >> rrd file format), > We (ALICE experiment from CERN) use an tool named MonaLisa ALICE is a really neat experiment; thanks for your work on that :) > (http://monalisa.caltech.edu) written in java that can take in many > hundredths of thousands of metrics and written them in postgres database. > One obvious advantage would be that there is no need of summarizing at > recording stage and also that you have access to the precise metrics > without losing information because of averaging. > > Wouldn't be possible to adapt the gmetad to write the data in a postgres > database? One side effect would be that gweb can easily be on other > server (for security and load separation purposes) and make reports from > the database (also with the averaging mechanism implemented at reporting > level) Yeah, it's possible. The downside to doing this is that, while inserting the data to some SQL server may be fast, reporting on large amounts of time series data ends up being terribly slow. Of course everyone has a different workload, so this may be acceptable for some folks. RRD is a great format for time series data. >> * replacing the hash table in Ganglia with one that performs better, >> * changing the data serialization format from XML to one that is >> easier faster to parse, > i could just speaking nonsense as i dont understand exactly where is the > hash table is used (at the metrics collection step by gmond or gmetad?) The hash tables in gmetad (and probably gmond) are used to represent just about every piece of data in the system. From my understanding, hash tables in gmond are iterated over to generate the XML sent to gmetad. Gmetad stores the metrics in XML as it parses them. When it does its summary stuff, it iterates over the hash tables at each level, summing numbers. In gmetad, this is problematic for a few reasons: 1. Hash lookups imply at least one allocation and data copy; inserts do two. A significant amount of time executing is spent in memcpy because everything that gmetad does is based on looking up values in hash tables, changing them, and adding new ones. 2. The data in the hash buckets are not in contiguous memory. This means that even if we fix the data sharing between threads so that we don't have to copy data on a lookup, iteration will remain slow because every piece of data is a cache miss and has to be fetched from memory (and can't be prefetched). 3. The hash function is inefficient. If gmetad tends to run on 32- or 64-bit intel architectures, Murmur would be a better idea. But even then, it seems like the normal case requires normalizing the key value to lowercase. This takes forever, largely because tolower(3) takes locale into account. I imagine gmond faces similar problems; I haven't taken a look yet. > but couldn't be used for all communication the same xdr format (and > maybe the communication can be improved by using zeromq?) > (also with some standalone cli tool that would read and process the > output of an gmond). This would remove the need of an xml output, and > with the cli tool also would be the possibility of text human inspection > of the metrics. (eventually with the conversion to xml done by cli tool) 0MQ is a neat API and probably fast enough. But I think that it's more the format of the data that's slow to parse than the communication medium. And if there's an application-specific protocol for sharing data, it'd probably end up being faster than 0MQ (implemented properly). >> * using a different data structure than a hash table for metrics >> hierarchies (probably a tree with metrics stored at each level in >> contiguous memory and an index describing each metric at each level) > postgres tables? This wouldn't be fast enough; the hash tables I'm referring to are indeed the in-memory data structures representing the hierarchy of metrics. >> * refactoring gmetad and gmond into a single process that shares memory > > i dont think is a good idea as there are processes with different > functionality in mind. that would make an process very heavy even if you > dont start the gmetad part. (and basically what ganglia is excelling is > as a simple, light weight and robust agent based monitoring tool) > > I would want to help if its possible but i would need also some mentoring. Happy to help where I can. I fear that I know (comparatively) very little about the use cases for this tool, so it's possible that I'm saying / suggesting some very silly things. I'm just good at making things faster :) --dho > Thank you! > Adrian > > > ------------------------------------------------------------------------------ > Sponsored by Intel(R) XDK > Develop, test and display web and hybrid apps with a single code base. > Download it for free now! > http://pubads.g.doubleclick.net/gampad/clk?id=111408631&iu=/4140/ostg.clktrk > _______________________________________________ > Ganglia-developers mailing list > Ganglia-developers@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/ganglia-developers > ------------------------------------------------------------------------------ Sponsored by Intel(R) XDK Develop, test and display web and hybrid apps with a single code base. Download it for free now! http://pubads.g.doubleclick.net/gampad/clk?id=111408631&iu=/4140/ostg.clktrk _______________________________________________ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers