>> Not sure exactly what you mean here but doesn't seem you would really need a 
>> secondary index to do what you want.  When using versioning you can always 
>> ask for "give me 10 latest versions" or "give me the 100 latest versions 
>> that occur after date X".

How can I do this on hbase shell as well as API ? Say I want the latest version 
before a certain date?

-Avani

-----Original Message-----
From: Jonathan Gray [mailto:[email protected]] 
Sent: Wednesday, June 16, 2010 11:40 AM
To: [email protected]
Subject: RE: Hbase schema design question for time based data

> Hi,
> 
> I am trying design schema for some data to be moved from HDFS into
> HBase for real-time access.
> Questions -
> 
> 1. Is the use of new API for bulk upload recommended over old API? If
> yes, is the new API stable and is there sample executable code around ?

Not sure if there is much sample code in branch but Todd Lipcon has done some 
great work in trunk that includes some example code I believe.

There's going to be a short presentation on HFileOutputFormat and bulk loading 
at the HUG on June 30th if you're interested in attending 
(http://meetup.com/hbaseusergroup).

In general it came make lots of sense for particular use cases, so sometimes it 
is recommended and sometimes not.  Depends on the requirements.


> 2. The data is over time. I need to be able to retrieve the latest
> records before a particular date. Note that I do not know what
> timestamp that would be.
>    I could need a user's profile data from a month or year earlier. How
> can this be achieved using Hbase in terms of schema?
> 
>                 a. If the column values are small in size, can I use
> versioning for upto 100 values ?

Versioning can be used for thousands or possibly millions of versions of a 
single column.  There are some performance TODOs related to making TimeRange 
queries more efficient that I am working on that are in the pipeline for the 
next couple months.

If you're generally reading the more recent versions then performance should be 
acceptable.  Reading back into some of the older ones will work but is 
currently not nearly as efficient as it can be.


>                 b. Should I maintain a secondary index for each date
> and the latest date/timestamp when profile data is generated/applicable
> to that date?                                    Use this information
> to come up with user and timestamp key in the main table which would
> have user_ts as row_key and data in the columns ?

Not sure exactly what you mean here but doesn't seem you would really need a 
secondary index to do what you want.  When using versioning you can always ask 
for "give me 10 latest versions" or "give me the 100 latest versions that occur 
after date X".

> 
>                 c. for the columns, how do I decide between using
> multiple columns within a column family or multiple column families?

This depends on the read/write patterns.  Do the different families have 
different access patterns?  Do you often read from just one family and not the 
others, or write to just one family and not the others?  This would be a good 
reason to split up into families.  If the data all has a similar access pattern 
then should probably put them in a single family.  Each family is basically 
like a table, each is stored separately on disk.

I think an in-person discussion would help a lot, since you are local (I am 
guessing), see if you can come by the Hackathon or HUG in two weeks and we can 
talk more on it.  Can then post back to the list once we figure a decent 
solution to your use case.

JG 

Reply via email to