On 8/15/2012 2:42 PM, Benjamin Smedberg wrote:
On 8/15/2012 5:03 PM, Taras Glek wrote:
Hi,
According to metrics we have about 1TB of telemetry data in hadoop.
This is almost a year worth of telemetry data. Our telemetry ping
packets keep growing as we add more probes. As the hadoop database
gets bigger, query times get worse, etc. We need to decide on what
data we can throw away and when.


Backend Optimization:
Data is stored as json packets and compressed in 128mb (or something
similarly large) gzip chunks. Metrics will investigate savings from
moving to a more efficient format such as protobuf
(http://code.google.com/p/protobuf/).
Is the backend hbase or raw hadoop?

I'll let Xavier answer this. It would also be nice if Xavier/Daniel could explain the problems directly instead of me channeling them.


Is the entire ping stored as a single record, or are separate fields
stored separately? If they are stored together, is there a reason we
couldn't store each field separately?

I've been spending a lot of time with the hbase/pig backend for
crash-stats recently and I'm beginning to get a feel for it, so I'm
really interested in the backend data storage model.

I'm a little surprised by the assertion that more data in general
produces slower query times. In general it seems that larger date ranges
or more reports-per-day would hurt query times, but the overall disk
size of the cluster isn't that interesting.

More data does not produce slower query times IF you query with a data range. However it does result in higher storage costs.


Important data:
For example: CYCLE_COLLECTOR, MEMORY_RESIDENT, GC_SLICE_MS are
histograms that we would like to see moving in the right direction
over time. We should store them for as long as possible to show how
much we improved over a release, a year, 2 years.
  A lot of what we are/should/could be doing with the crash-stats data
is creating (daily) aggregate snapshot reports of the interesting data
and using those to produce the generic UI, querying/loading individual
reports only when necessary to slice or aggregate on a piece of data
that isn't in the aggregate reports. This reduces server load and
improve response times significantly for the things that we know we want
to track over long periods.

We obviously still keep the raw data for a while, at least long enough
that we feel we aren't going to want to slice it a different way. Have
we considered doing something similar with the telemetry data?

Telemetry frontend is operating from an aggregated Elastic Search database which gets updated(or regenerated?) daily. Daniel is there a doc describing telemetry serverside architecture somewhere?

However raw telemetry data is still useful for certain analyses(such as field trials like in bug 765850 that we plan to automate. It's also useful for slow-startup(& similar) analysis when you are groping for evidence of a problem.

Taras
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Reply via email to