[
https://issues.apache.org/jira/browse/KAFKA-489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Neha Narkhede updated KAFKA-489:
--------------------------------
Attachment: kafka-489-v1.patch
1. Added a new metrics definition file defined in json format. The purpose of
this file is to define the graphs that we want to plot at the end of each test
case run. The file is organized as a list of dashboards, one per role
(zookeeper/broker/producer/consumer). Each dashboard has all the graphs
associated for that role. Each graph is associated with a mbean and has -
1. Graph title
2. Mbean name
3. Attributes and respective y axis labels.
A separate graph is plotted for each attribute specified for a mbean.
2. Metrics are collected when an entity is started. The metrics definition json
file is read and the JmxTool is used to collect all the attributes for a
particular mbean in a separate csv file. So, we have one csv file per mbean
definied in metrics.json
3. At the end of each test case run, the metrics scripts go through the
metrics.json file, find the csv files for all entities associated with a mbean,
and plot one graph per attribute. These graphs are placed under
testcase/dashboards/[role]/.
4. Once the graphs are plotted, another script goes through metrics.json and
creates html dashboards arranging the graphs in html files, one per role. So we
have 4 dashboards, one for zookeeper, broker, producer and consumer. To view
the graphs, open testcase/dashboards/metrics.html
5. Here are the new packages used to build this framework -
5.1. The graphs are plotted using matplotlib. So matplotlib needs to be
installed in order to get these graphs
5.2. I included a very lightweight python package “pyh” to create the html
pages. This avoids writing boiler-plate code to create simple html pages.
6. Code changes -
6.1. metrics.py includes APIs to collect metrics, plot graphs and create
dashboards
6.2. Currently, we use SnapshotStats to collect metrics. The problem is that it
only supports collecting metrics at fixed time windows which is not
configurable. It was hard coded to be 1 minute, but for tests, this turns out
to be too large to get any meaningful metrics. Another issue is that we use
static objects to register mbeans in most places. I couldn't find a better way
to pass the monitoring duration config parameter. For now, I've hardcoded that
to 1s, I realize that this is not ideal. But since we will be scraping
SnapshotStats as part of KAFKA-203, we can punt on this for now.
6.3. The new codahale metrics for socket server and request purgatory attached
the broker id to the mbean name. This is inconvenient for most monitoring
systems since now they need to pick up the broker id from the application
config in order to create the right mbean name for metrics collection. Also,
since each broker starts in a separate JVM, there isn't much value in
identifying the mbeans by the broker id. So, I removed the broker id from all
mbeans exposed by the codahale metrics package.
6.4. Right now, there is one csv file output per mbean. The other alternative
is one csv file for all possible mbeans exposed by the kafka process. There are
2 concerns with including all metrics in one big csv file -
1. This file might become too large for long running tests.
2. If some column gets wedged for some mbean, that can affect the graphs for
all other mbeans
6.5. Arguably, we don't even need this metrics.json file if we decide to
collect all mbeans exposed by a kafka process. The reason I have one is that
each graph needs to have a meaningful x/y axes label and a meaningful title.
But, if we think the fully qualified mbean name is not too bad for a graph
title, and the mbean attribute name is workable as the yaxes label, we could
remove it.
6.6. Removed the single_host_multiple_brokers_test
To see the graphs -
1. Run the test
cd system_test
python -B system_test_runner.py
2. Open the
system_test/replication_testsuite/testcase_1/dashboards/metrics.html in a
browser
> Add metrics collection and graphs to the system test framework
> --------------------------------------------------------------
>
> Key: KAFKA-489
> URL: https://issues.apache.org/jira/browse/KAFKA-489
> Project: Kafka
> Issue Type: Bug
> Affects Versions: 0.8
> Reporter: Neha Narkhede
> Assignee: Neha Narkhede
> Labels: replication-testing
> Attachments: kafka-489-v1.patch
>
>
> We have a new system test framework that allows defining a test cluster,
> starting kafka processes in the cluster, running tests and collecting logs.
> In addition to this, it will be great to have the ability to do the following
> for each test case run -
> 1. collect metrics as exposed by mbeans
> 2. collect various system metrics exposed by sar/vmstat/jvm
> 3. graph the metrics
> The expected output of this work should be the ability to output a link to
> all the graphs for each test case.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira