RE: Hbase schema design question for time based data

Sharma, Avani Thu, 17 Jun 2010 17:30:22 -0700

Ignore... found it in the API doc. Thanks !


-----Original Message-----
From: Sharma, Avani [mailto:[email protected]] 
Sent: Thursday, June 17, 2010 5:28 PM
To: [email protected]
Subject: RE: Hbase schema design question for time based data


Is timeT a timestamp or a date ? 
I am guessing YYYYMMDD date needs to converted to ts and stored -  then timeT 
should be old date (YYYYMMDD) converted to ts.

-----Original Message-----
From: Jonathan Gray [mailto:[email protected]] 
Sent: Thursday, June 17, 2010 12:11 PM
To: [email protected]
Subject: RE: Hbase schema design question for time based data

I'm not terribly familiar with the shell API and it does not fully cover the 
Java API (I don't think).

Let's say I want the 3 latest versions of rowX, columnY that occur before timeT.

With the Java API you can do something like:

new Get(somerow).setTimeRange(0, timeT).setMaxVersions(3)

That means, I want versions in the range from 0 to timeT (before timeT), and I 
only want the 3 latest versions.

JG

> -----Original Message-----
> From: Sharma, Avani [mailto:[email protected]]
> Sent: Wednesday, June 16, 2010 4:22 PM
> To: [email protected]
> Subject: RE: Hbase schema design question for time based data
> 
> >> Not sure exactly what you mean here but doesn't seem you would
> really need a secondary index to do what you want.  When using
> versioning you can always ask for "give me 10 latest versions" or "give
> me the 100 latest versions that occur after date X".
> 
> How can I do this on hbase shell as well as API ? Say I want the latest
> version before a certain date?
> 
> -Avani
> 
> -----Original Message-----
> From: Jonathan Gray [mailto:[email protected]]
> Sent: Wednesday, June 16, 2010 11:40 AM
> To: [email protected]
> Subject: RE: Hbase schema design question for time based data
> 
> > Hi,
> >
> > I am trying design schema for some data to be moved from HDFS into
> > HBase for real-time access.
> > Questions -
> >
> > 1. Is the use of new API for bulk upload recommended over old API? If
> > yes, is the new API stable and is there sample executable code around
> ?
> 
> Not sure if there is much sample code in branch but Todd Lipcon has
> done some great work in trunk that includes some example code I
> believe.
> 
> There's going to be a short presentation on HFileOutputFormat and bulk
> loading at the HUG on June 30th if you're interested in attending
> (http://meetup.com/hbaseusergroup).
> 
> In general it came make lots of sense for particular use cases, so
> sometimes it is recommended and sometimes not.  Depends on the
> requirements.
> 
> 
> > 2. The data is over time. I need to be able to retrieve the latest
> > records before a particular date. Note that I do not know what
> > timestamp that would be.
> >    I could need a user's profile data from a month or year earlier.
> How
> > can this be achieved using Hbase in terms of schema?
> >
> >                 a. If the column values are small in size, can I use
> > versioning for upto 100 values ?
> 
> Versioning can be used for thousands or possibly millions of versions
> of a single column.  There are some performance TODOs related to making
> TimeRange queries more efficient that I am working on that are in the
> pipeline for the next couple months.
> 
> If you're generally reading the more recent versions then performance
> should be acceptable.  Reading back into some of the older ones will
> work but is currently not nearly as efficient as it can be.
> 
> 
> >                 b. Should I maintain a secondary index for each date
> > and the latest date/timestamp when profile data is
> generated/applicable
> > to that date?                                    Use this information
> > to come up with user and timestamp key in the main table which would
> > have user_ts as row_key and data in the columns ?
> 
> Not sure exactly what you mean here but doesn't seem you would really
> need a secondary index to do what you want.  When using versioning you
> can always ask for "give me 10 latest versions" or "give me the 100
> latest versions that occur after date X".
> 
> >
> >                 c. for the columns, how do I decide between using
> > multiple columns within a column family or multiple column families?
> 
> This depends on the read/write patterns.  Do the different families
> have different access patterns?  Do you often read from just one family
> and not the others, or write to just one family and not the others?
> This would be a good reason to split up into families.  If the data all
> has a similar access pattern then should probably put them in a single
> family.  Each family is basically like a table, each is stored
> separately on disk.
> 
> I think an in-person discussion would help a lot, since you are local
> (I am guessing), see if you can come by the Hackathon or HUG in two
> weeks and we can talk more on it.  Can then post back to the list once
> we figure a decent solution to your use case.
> 
> JG

RE: Hbase schema design question for time based data

Reply via email to