According to metrics we have about 1TB of telemetry data in hadoop. This is almost a year worth of telemetry data. Our telemetry ping packets keep growing as we add more probes. As the hadoop database gets bigger, query times get worse, etc. We need to decide on what data we can throw away and when.

Backend Optimization:
Data is stored as json packets and compressed in 128mb (or something similarly large) gzip chunks. Metrics will investigate savings from moving to a more efficient format such as protobuf (http://code.google.com/p/protobuf/).
Is the backend hbase or raw hadoop?

Is the entire ping stored as a single record, or are separate fields stored separately? If they are stored together, is there a reason we couldn't store each field separately?

I've been spending a lot of time with the hbase/pig backend for crash-stats recently and I'm beginning to get a feel for it, so I'm really interested in the backend data storage model.

I'm a little surprised by the assertion that more data in general produces slower query times. In general it seems that larger date ranges or more reports-per-day would hurt query times, but the overall disk size of the cluster isn't that interesting.

Important data:
For example: CYCLE_COLLECTOR, MEMORY_RESIDENT, GC_SLICE_MS are histograms that we would like to see moving in the right direction over time. We should store them for as long as possible to show how much we improved over a release, a year, 2 years.
A lot of what we are/should/could be doing with the crash-stats data is creating (daily) aggregate snapshot reports of the interesting data and using those to produce the generic UI, querying/loading individual reports only when necessary to slice or aggregate on a piece of data that isn't in the aggregate reports. This reduces server load and improve response times significantly for the things that we know we want to track over long periods.

We obviously still keep the raw data for a while, at least long enough that we feel we aren't going to want to slice it a different way. Have we considered doing something similar with the telemetry data?


