Hi there-

A few high-level suggestions...

re:  "to generate a report: for example we want to know how many
impressions were done by all users in last x days"

Can you create a summary table by day (via MR job), and then have your
ad-hoc report hit the summary table?

Re:  "and with the data growing, the time will increase"


Yes.  As you add more and more data processing times will slow.  That's
why you need to expect to periodically expand your cluster.



On 7/14/11 3:52 PM, "Andre Reiter" <a.rei...@web.de> wrote:

>Hi everybody,
>
>we have our hadoop + hbase cluster running at the moment with 6 servers
>
>everything is working just fine. We have a web application, where data is
>stored with the row key = user id (meaningless UUID). So our users have a
>cookie, which is the row key, behind this key are families with items,
>i.e. family "impressions", where every impression is stored with its time
>stamp etc...
>
>the row key is defined with the user id, to make the real time request
>possible, so we can retrieve all user data very fast
>
>new we are running mapreduce jobs, to generate a report: for example we
>want to know how many impressions were done by all users in last x days.
>therefore the scan of the MR job is running over all data in our hbase
>table for the particular family. this takes at the moment about 70
>seconds, which is actually a bit too long, and with the data growing, the
>time will increase, unless we add new workers to the cluster. we have
>right now 22 regions
>
>the problem i see, is that we can not define a filter for the scan, the
>row key (user id) is just an UUID, nothing meaningfull in it
>
>what can we do, to however improve (accelerate) the scan process? is it
>maybe advisable to store the data more redundant. so for example we
>create second table and store every impression twice, one time with the
>user id as row key in the first table, and the second one with a time
>stamp as a row key in the second table.
>the data volume would grow twice as fast, but our scans will work x times
>faster on the second table compared to now
>
>comments are very appreciated
>
>andre
>

Reply via email to