Hi,
According to metrics we have about 1TB of telemetry data in hadoop. This is almost a year worth of telemetry data. Our telemetry ping packets keep growing as we add more probes. As the hadoop database gets bigger, query times get worse, etc. We need to decide on what data we can throw away and when.

Backend Optimization:
Data is stored as json packets and compressed in 128mb (or something similarly large) gzip chunks. Metrics will investigate savings from moving to a more efficient format such as protobuf (http://code.google.com/p/protobuf/).

Data Retention:
I think the way forward is to expire data soon as it is no longer useful while keeping important data for as long as possible.

Important data:
For example: CYCLE_COLLECTOR, MEMORY_RESIDENT, GC_SLICE_MS are histograms that we would like to see moving in the right direction over time. We should store them for as long as possible to show how much we improved over a release, a year, 2 years.

Data that's only relevant while fresh:
Bulk of data such as chromehangs, slowsql, sqlite-vfs histograms are mostly useful for prioritization within a single release cycle. It may be useful to track that the number of chromehangs/slowsqls goes down over time, but I do not see for keeping raw data for a long time.

Partitioning telemetry data into critical vs ok-to-expire:
We should be able to devise that by manually examining a ranked list of how frequently histograms are accessed in the telemetry frontend at http://arewesnappyyet.com

I propose that we expire non-critical data after 6 months.

Based on my personal telemetry experience, we should also restrict whether certain probes get uplifted by defining certain histograms as nightly-only.


Thoughts?

Taras
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Reply via email to