Redefine Chukwa time series storage
-----------------------------------
Key: CHUKWA-444
URL: https://issues.apache.org/jira/browse/CHUKWA-444
Project: Hadoop Chukwa
Issue Type: New Feature
Components: Data Processors
Environment: Redhat EL 5.1, Java 6
Reporter: Eric Yang
The current Chukwa Record format is not suitable for data visualization. It is
more like an archive format which combines data from multiple sources (hosts),
and group them into a sorted time partitioned sequence file. Most of people
collected data for two reasons, archive and data analysis. The current chukwa
record format is fine for archive, but it is not so great for data analysis.
Data analysis could be further break down into two different types. 1) Data
can be aggregated and summarized, such as metrics. 2) Data that can not be
summarized, like job history. Type 1 data is useful for visualization by
graph, and type 2 data is useful by plain text viewing or search for a
particular event.
By the above rational, it probably makes sense to restructure Chukwa Records
for data analysis. Outside of Hadoop world, rrdtools is great for time series
data storage, and optimized for metrics from a single source, i.e. a host. RRD
data file fragments badly when there are hundred of thousands of sources.
Chukwa time series data storage should be able to combine multiple data sources
into one Chukwa file to combat file fragmentation problem.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.