Hi,
According to metrics we have about 1TB of telemetry data in hadoop. This
is almost a year worth of telemetry data. Our telemetry ping packets
keep growing as we add more probes. As the hadoop database gets bigger,
query times get worse, etc. We need to decide on what data we can throw
away and when.
Backend Optimization:
Data is stored as json packets and compressed in 128mb (or something
similarly large) gzip chunks. Metrics will investigate savings from
moving to a more efficient format such as protobuf
(http://code.google.com/p/protobuf/).
Data Retention:
I think the way forward is to expire data soon as it is no longer useful
while keeping important data for as long as possible.
Important data:
For example: CYCLE_COLLECTOR, MEMORY_RESIDENT, GC_SLICE_MS are
histograms that we would like to see moving in the right direction over
time. We should store them for as long as possible to show how much we
improved over a release, a year, 2 years.
Data that's only relevant while fresh:
Bulk of data such as chromehangs, slowsql, sqlite-vfs histograms are
mostly useful for prioritization within a single release cycle. It may
be useful to track that the number of chromehangs/slowsqls goes down
over time, but I do not see for keeping raw data for a long time.
Partitioning telemetry data into critical vs ok-to-expire:
We should be able to devise that by manually examining a ranked list of
how frequently histograms are accessed in the telemetry frontend at
http://arewesnappyyet.com
I propose that we expire non-critical data after 6 months.
Based on my personal telemetry experience, we should also restrict
whether certain probes get uplifted by defining certain histograms as
nightly-only.
Thoughts?
Taras
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform