Ignore... found it in the API doc. Thanks !
-----Original Message----- From: Sharma, Avani [mailto:[email protected]] Sent: Thursday, June 17, 2010 5:28 PM To: [email protected] Subject: RE: Hbase schema design question for time based data Is timeT a timestamp or a date ? I am guessing YYYYMMDD date needs to converted to ts and stored - then timeT should be old date (YYYYMMDD) converted to ts. -----Original Message----- From: Jonathan Gray [mailto:[email protected]] Sent: Thursday, June 17, 2010 12:11 PM To: [email protected] Subject: RE: Hbase schema design question for time based data I'm not terribly familiar with the shell API and it does not fully cover the Java API (I don't think). Let's say I want the 3 latest versions of rowX, columnY that occur before timeT. With the Java API you can do something like: new Get(somerow).setTimeRange(0, timeT).setMaxVersions(3) That means, I want versions in the range from 0 to timeT (before timeT), and I only want the 3 latest versions. JG > -----Original Message----- > From: Sharma, Avani [mailto:[email protected]] > Sent: Wednesday, June 16, 2010 4:22 PM > To: [email protected] > Subject: RE: Hbase schema design question for time based data > > >> Not sure exactly what you mean here but doesn't seem you would > really need a secondary index to do what you want. When using > versioning you can always ask for "give me 10 latest versions" or "give > me the 100 latest versions that occur after date X". > > How can I do this on hbase shell as well as API ? Say I want the latest > version before a certain date? > > -Avani > > -----Original Message----- > From: Jonathan Gray [mailto:[email protected]] > Sent: Wednesday, June 16, 2010 11:40 AM > To: [email protected] > Subject: RE: Hbase schema design question for time based data > > > Hi, > > > > I am trying design schema for some data to be moved from HDFS into > > HBase for real-time access. > > Questions - > > > > 1. Is the use of new API for bulk upload recommended over old API? If > > yes, is the new API stable and is there sample executable code around > ? > > Not sure if there is much sample code in branch but Todd Lipcon has > done some great work in trunk that includes some example code I > believe. > > There's going to be a short presentation on HFileOutputFormat and bulk > loading at the HUG on June 30th if you're interested in attending > (http://meetup.com/hbaseusergroup). > > In general it came make lots of sense for particular use cases, so > sometimes it is recommended and sometimes not. Depends on the > requirements. > > > > 2. The data is over time. I need to be able to retrieve the latest > > records before a particular date. Note that I do not know what > > timestamp that would be. > > I could need a user's profile data from a month or year earlier. > How > > can this be achieved using Hbase in terms of schema? > > > > a. If the column values are small in size, can I use > > versioning for upto 100 values ? > > Versioning can be used for thousands or possibly millions of versions > of a single column. There are some performance TODOs related to making > TimeRange queries more efficient that I am working on that are in the > pipeline for the next couple months. > > If you're generally reading the more recent versions then performance > should be acceptable. Reading back into some of the older ones will > work but is currently not nearly as efficient as it can be. > > > > b. Should I maintain a secondary index for each date > > and the latest date/timestamp when profile data is > generated/applicable > > to that date? Use this information > > to come up with user and timestamp key in the main table which would > > have user_ts as row_key and data in the columns ? > > Not sure exactly what you mean here but doesn't seem you would really > need a secondary index to do what you want. When using versioning you > can always ask for "give me 10 latest versions" or "give me the 100 > latest versions that occur after date X". > > > > > c. for the columns, how do I decide between using > > multiple columns within a column family or multiple column families? > > This depends on the read/write patterns. Do the different families > have different access patterns? Do you often read from just one family > and not the others, or write to just one family and not the others? > This would be a good reason to split up into families. If the data all > has a similar access pattern then should probably put them in a single > family. Each family is basically like a table, each is stored > separately on disk. > > I think an in-person discussion would help a lot, since you are local > (I am guessing), see if you can come by the Hackathon or HUG in two > weeks and we can talk more on it. Can then post back to the list once > we figure a decent solution to your use case. > > JG
