[jira] [Updated] (KAFKA-489) Add metrics collection and graphs to the system test framework

Neha Narkhede (JIRA) Tue, 28 Aug 2012 12:51:08 -0700

     [ 
https://issues.apache.org/jira/browse/KAFKA-489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Neha Narkhede updated KAFKA-489:
--------------------------------

    Attachment: kafka-489-v1.patch


1. Added a new metrics definition file defined in json format. The purpose of 
this file is to define the graphs that we want to plot at the end of each test 
case run. The file is organized as a list of dashboards, one per role 
(zookeeper/broker/producer/consumer). Each dashboard has all the graphs 
associated for that role. Each graph is associated with a mbean and has -
1. Graph title
2. Mbean name
3. Attributes and respective y axis labels.
        A separate graph is plotted for each attribute specified for a mbean. 

2. Metrics are collected when an entity is started. The metrics definition json 
file is read and the JmxTool is used to collect all the attributes for a 
particular mbean in a separate csv file. So, we have one csv file per mbean 
definied in metrics.json

3. At the end of each test case run, the metrics scripts go through the 
metrics.json file, find the csv files for all entities associated with a mbean, 
and plot one graph per attribute. These graphs are placed under 
testcase/dashboards/[role]/. 

4. Once the graphs are plotted, another script goes through metrics.json and 
creates html dashboards arranging the graphs in html files, one per role. So we 
have 4 dashboards, one for zookeeper, broker, producer and consumer. To view 
the graphs, open testcase/dashboards/metrics.html

5. Here are the new packages used to build this framework -
5.1. The graphs are plotted using matplotlib. So matplotlib needs to be 
installed in order to get these graphs
5.2. I included a very lightweight python package “pyh” to create the html 
pages. This avoids writing boiler-plate code to create simple html pages.

6. Code changes -
6.1. metrics.py includes APIs to collect metrics, plot graphs and create 
dashboards
6.2. Currently, we use SnapshotStats to collect metrics. The problem is that it 
only supports collecting metrics at fixed time windows which is not 
configurable. It was hard coded to be 1 minute, but for tests, this turns out 
to be too large to get any meaningful metrics.  Another issue is that we use 
static objects to register mbeans in most places. I couldn't find a better way 
to pass the monitoring duration config parameter. For now, I've hardcoded that 
to 1s, I realize that this is not ideal. But since we will be scraping 
SnapshotStats as part of KAFKA-203, we can punt on this for now. 
6.3. The new codahale metrics for socket server and request purgatory attached 
the broker id to the mbean name. This is inconvenient for most monitoring 
systems since now they need to pick up the broker id from the application 
config in order to create the right mbean name for metrics collection. Also, 
since each broker starts in a separate JVM, there isn't much value in 
identifying the mbeans by the broker id. So, I removed the broker id from all 
mbeans exposed by the codahale metrics package. 
6.4. Right now, there is one csv file output per mbean. The other alternative 
is one csv file for all possible mbeans exposed by the kafka process. There are 
2  concerns with including all metrics in one big csv file -
1. This file might become too large for long running tests. 
2. If some column gets wedged for some mbean, that can affect the graphs for 
all other mbeans
6.5. Arguably, we don't even need this metrics.json file if we decide to 
collect all mbeans exposed by a kafka process. The reason I have one is that 
each graph needs to have a meaningful x/y axes label and a meaningful title. 
But, if we think the fully qualified mbean name is not too bad for a graph 
title, and the mbean attribute name is workable as the yaxes label, we could 
remove it. 
6.6. Removed the single_host_multiple_brokers_test

To see the graphs -

1. Run the test
cd system_test
python -B system_test_runner.py

2. Open the 
system_test/replication_testsuite/testcase_1/dashboards/metrics.html in a 
browser
                
> Add metrics collection and graphs to the system test framework
> --------------------------------------------------------------
>
>                 Key: KAFKA-489
>                 URL: https://issues.apache.org/jira/browse/KAFKA-489
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8
>            Reporter: Neha Narkhede
>            Assignee: Neha Narkhede
>              Labels: replication-testing
>         Attachments: kafka-489-v1.patch
>
>
> We have a new system test framework that allows defining a test cluster, 
> starting kafka processes in the cluster, running tests and collecting logs. 
> In addition to this, it will be great to have the ability to do the following 
> for each test case run -
> 1. collect metrics as exposed by mbeans
> 2. collect various system metrics exposed by sar/vmstat/jvm
> 3. graph the metrics
> The expected output of this work should be the ability to output a link to 
> all the graphs for each test case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (KAFKA-489) Add metrics collection and graphs to the system test framework

Reply via email to