telemetry data retention strategy

Taras Glek Wed, 15 Aug 2012 14:05:20 -0700

Hi,

According to metrics we have about 1TB of telemetry data in hadoop. Thisis almost a year worth of telemetry data. Our telemetry ping packetskeep growing as we add more probes. As the hadoop database gets bigger,query times get worse, etc. We need to decide on what data we can throwaway and when.


Backend Optimization:

Data is stored as json packets and compressed in 128mb (or somethingsimilarly large) gzip chunks. Metrics will investigate savings frommoving to a more efficient format such as protobuf(http://code.google.com/p/protobuf/).


Data Retention:

I think the way forward is to expire data soon as it is no longer usefulwhile keeping important data for as long as possible.


Important data:

For example: CYCLE_COLLECTOR, MEMORY_RESIDENT, GC_SLICE_MS arehistograms that we would like to see moving in the right direction overtime. We should store them for as long as possible to show how much weimproved over a release, a year, 2 years.


Data that's only relevant while fresh:

Bulk of data such as chromehangs, slowsql, sqlite-vfs histograms aremostly useful for prioritization within a single release cycle. It maybe useful to track that the number of chromehangs/slowsqls goes downover time, but I do not see for keeping raw data for a long time.


Partitioning telemetry data into critical vs ok-to-expire:

We should be able to devise that by manually examining a ranked list ofhow frequently histograms are accessed in the telemetry frontend athttp://arewesnappyyet.com


I propose that we expire non-critical data after 6 months.

Based on my personal telemetry experience, we should also restrictwhether certain probes get uplifted by defining certain histograms asnightly-only.



Thoughts?

Taras
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

telemetry data retention strategy

Reply via email to