[ https://issues.apache.org/jira/browse/KAFKA-489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Neha Narkhede updated KAFKA-489: -------------------------------- Attachment: kafka-489-v1.patch 1. Added a new metrics definition file defined in json format. The purpose of this file is to define the graphs that we want to plot at the end of each test case run. The file is organized as a list of dashboards, one per role (zookeeper/broker/producer/consumer). Each dashboard has all the graphs associated for that role. Each graph is associated with a mbean and has - 1. Graph title 2. Mbean name 3. Attributes and respective y axis labels. A separate graph is plotted for each attribute specified for a mbean. 2. Metrics are collected when an entity is started. The metrics definition json file is read and the JmxTool is used to collect all the attributes for a particular mbean in a separate csv file. So, we have one csv file per mbean definied in metrics.json 3. At the end of each test case run, the metrics scripts go through the metrics.json file, find the csv files for all entities associated with a mbean, and plot one graph per attribute. These graphs are placed under testcase/dashboards/[role]/. 4. Once the graphs are plotted, another script goes through metrics.json and creates html dashboards arranging the graphs in html files, one per role. So we have 4 dashboards, one for zookeeper, broker, producer and consumer. To view the graphs, open testcase/dashboards/metrics.html 5. Here are the new packages used to build this framework - 5.1. The graphs are plotted using matplotlib. So matplotlib needs to be installed in order to get these graphs 5.2. I included a very lightweight python package “pyh” to create the html pages. This avoids writing boiler-plate code to create simple html pages. 6. Code changes - 6.1. metrics.py includes APIs to collect metrics, plot graphs and create dashboards 6.2. Currently, we use SnapshotStats to collect metrics. The problem is that it only supports collecting metrics at fixed time windows which is not configurable. It was hard coded to be 1 minute, but for tests, this turns out to be too large to get any meaningful metrics. Another issue is that we use static objects to register mbeans in most places. I couldn't find a better way to pass the monitoring duration config parameter. For now, I've hardcoded that to 1s, I realize that this is not ideal. But since we will be scraping SnapshotStats as part of KAFKA-203, we can punt on this for now. 6.3. The new codahale metrics for socket server and request purgatory attached the broker id to the mbean name. This is inconvenient for most monitoring systems since now they need to pick up the broker id from the application config in order to create the right mbean name for metrics collection. Also, since each broker starts in a separate JVM, there isn't much value in identifying the mbeans by the broker id. So, I removed the broker id from all mbeans exposed by the codahale metrics package. 6.4. Right now, there is one csv file output per mbean. The other alternative is one csv file for all possible mbeans exposed by the kafka process. There are 2 concerns with including all metrics in one big csv file - 1. This file might become too large for long running tests. 2. If some column gets wedged for some mbean, that can affect the graphs for all other mbeans 6.5. Arguably, we don't even need this metrics.json file if we decide to collect all mbeans exposed by a kafka process. The reason I have one is that each graph needs to have a meaningful x/y axes label and a meaningful title. But, if we think the fully qualified mbean name is not too bad for a graph title, and the mbean attribute name is workable as the yaxes label, we could remove it. 6.6. Removed the single_host_multiple_brokers_test To see the graphs - 1. Run the test cd system_test python -B system_test_runner.py 2. Open the system_test/replication_testsuite/testcase_1/dashboards/metrics.html in a browser > Add metrics collection and graphs to the system test framework > -------------------------------------------------------------- > > Key: KAFKA-489 > URL: https://issues.apache.org/jira/browse/KAFKA-489 > Project: Kafka > Issue Type: Bug > Affects Versions: 0.8 > Reporter: Neha Narkhede > Assignee: Neha Narkhede > Labels: replication-testing > Attachments: kafka-489-v1.patch > > > We have a new system test framework that allows defining a test cluster, > starting kafka processes in the cluster, running tests and collecting logs. > In addition to this, it will be great to have the ability to do the following > for each test case run - > 1. collect metrics as exposed by mbeans > 2. collect various system metrics exposed by sar/vmstat/jvm > 3. graph the metrics > The expected output of this work should be the ability to output a link to > all the graphs for each test case. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira