Averages are easy to rollup as well.

Rank statistics like median, min, max and quartiles are not much harder.

Total uniques are more difficult.  If you have decent distributional
information, these can be estimated reasonably well.

Mahout has code for the first two.

On Sun, Jul 17, 2011 at 9:30 AM, Arvind Jayaprakash <w...@anomalizer.net>wrote:

> On Jul 14, Andre Reiter wrote:
> >new we are running mapreduce jobs, to generate a report: for example we
> >want to know how many impressions were done by all users in last x
> >days. therefore the scan of the MR job is running over all data in our
> >hbase table for the particular family. this takes at the moment about
> >70 seconds, which is actually a bit too long, and with the data
> >growing, the time will increase, unless we add new workers to the
> >cluster. we have right now 22 regions
>
> Are you looking for average number of impressions per user in the last
> 'x' days or total number of impressions across all users in the last 'x'
> days? I assume it is the latter.
>
> The only reasonable way is to do frequent rollups (think count for every
> minute/hour) and store it for future use. The cost of performing these
> rollups wil always be a function of your traffic/data. However, the cost
> of retrieving your answer should be fixed for a given 'x' and the size
> of the rollup window regardless of how much traffic you see. This way,
> your online application (I'm guessing from your latency needs) is
> de-linked from raw data volumes.
>

Reply via email to