Re: telemetry data retention strategy

Benjamin Smedberg Wed, 15 Aug 2012 14:42:42 -0700

On 8/15/2012 5:03 PM, Taras Glek wrote:

Hi,
According to metrics we have about 1TB of telemetry data in hadoop.This is almost a year worth of telemetry data. Our telemetry pingpackets keep growing as we add more probes. As the hadoop databasegets bigger, query times get worse, etc. We need to decide on whatdata we can throw away and when.

Backend Optimization:
Data is stored as json packets and compressed in 128mb (or somethingsimilarly large) gzip chunks. Metrics will investigate savings frommoving to a more efficient format such as protobuf(http://code.google.com/p/protobuf/).

Is the backend hbase or raw hadoop?

Is the entire ping stored as a single record, or are separate fieldsstored separately? If they are stored together, is there a reason wecouldn't store each field separately?

I've been spending a lot of time with the hbase/pig backend forcrash-stats recently and I'm beginning to get a feel for it, so I'mreally interested in the backend data storage model.

I'm a little surprised by the assertion that more data in generalproduces slower query times. In general it seems that larger date rangesor more reports-per-day would hurt query times, but the overall disksize of the cluster isn't that interesting.

Important data:
For example: CYCLE_COLLECTOR, MEMORY_RESIDENT, GC_SLICE_MS arehistograms that we would like to see moving in the right directionover time. We should store them for as long as possible to show howmuch we improved over a release, a year, 2 years.

A lot of what we are/should/could be doing with the crash-stats datais creating (daily) aggregate snapshot reports of the interesting dataand using those to produce the generic UI, querying/loading individualreports only when necessary to slice or aggregate on a piece of datathat isn't in the aggregate reports. This reduces server load andimprove response times significantly for the things that we know we wantto track over long periods.

We obviously still keep the raw data for a while, at least long enoughthat we feel we aren't going to want to slice it a different way. Havewe considered doing something similar with the telemetry data?


--BDS

_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: telemetry data retention strategy

Reply via email to