Re: HBase schema question

Invisible.Trust Fri, 20 Jan 2012 23:48:55 -0800

I think you need to design your schema with as many tables as manyindexes you want.

For example: tbl1 {user_id_timestamp}
tbl2 {md5(email)} [user_id_timestamp]


Also you may be want to look at google "design patterns hbase"
Also some examples here : "Oreilly.HBase.The.Definitive.Guide.Aug.2011"


21.01.12 11:32, Amit Gupta пишет:

I am not sure how I can do joins using HBase which is essentially what I am
trying to do. Based on what I have read it looks
like HBase is really good for scans or row key lookup. Please correct me if
I am wrong.

I can have a HBase table for users with {userid + timestamp} as the rowkey.
Using this lookup for a single user for given time
range will be fast. However I need to do lookups for millions of users for
different time range. Will that also be fast ?

Also lookups are not the only thing that I am trying to do. I need to
compute statistics like sum, min, max etc for each data
point for a user. How can I do that efficiently using Hbase ?


On Fri, Jan 20, 2012 at 2:20 PM, T Vinod Gupta<tvi...@readypulse.com>wrote:

from the little i have used hbase for, it is really good for the below use
case you mentioned. hbase takes care of scale and you can use map reduce to
do the kind of task you mentioned below.
but please remember that it is super important how you design the schema.
the schema should allow for your use case and allow for an efficient map
reduce.
if you decide with hbase, read the hbase book before deployment or schema
design/implementation.
thanks

On Fri, Jan 20, 2012 at 2:10 PM, Amit Gupta<dlgami...@gmail.com>  wrote:

Hi,



I am trying to figure out if Hbase is the right candidate for my use case
which is as follows :



I have a users table containing millions users and for each user I have a
bunch of data points for each day in past

2 years. Some of these data points are number of clicks in different

parts

of a web page, total # of clicks, total

searches, # of unique searches etc. So the data is in this form :



User Id

Date

X1 (Total Clicks)

X2 (Total Searches)

X3

…..

Xn

1

D1-730

4

0.8





90

1

D1-729

2

0.5





50

…













1

D1

30

0.9





20

2

D1-730

23

1.2





85

2

D1-729

56

2.3





56

….















My application has the following predominant query pattern - For a subset
of users (subset being quite large in order of 1 -5 mil), I want to do

sum,

min, max, mean, standard deviation of data points for different date

ranges

for the users. So for eg user1 may have a start and end date of {sd1,

ed1},

user2 may have {sd2, ed2} and so on. I want to compute sum, min, max etc
for data points X1, X2, … Xn over date ranges {sd1, ed1}, {sd2, ed2} ,
{sd3, ed3} for each user in the subset .



Currently we do this in db by creating a table for subset of the users

with

their start and end day and joining against the users tables. The query
however is extremely slow and takes hours to execute.



I am trying to figure out the following :

   1. Can I do the above query efficiently (I want to reduce the query
   time. Space is not that big of a concern for me) using Hbase ?


   1. Can someone please give me alternative solutions if Hbase is not the
   right solution for such a use case ?



Thanks,

dlg

Re: HBase schema question

Reply via email to