Re: Table design question

Rick Hangartner Fri, 25 Jul 2008 09:08:56 -0700

Hi,

After thinking about these two questions, and doing more searching onthe HBase site and the web, it occurs to me that the questionunderlying other questions and comments I found is this:

If one has inherently "dense" data --- data that would in fact make asensible table in an RDBMS --- what are the considerations that governwhether it would be useful to "sparsify" that data in HBase by mappingan RDBMS column to an HBase column family and using the values in thatRBDMS column as qualifiers in column keys in the HBase column family?

Is it the case that if the data is dense, is it best to keep in densein HBase? Beyond the general principle that it depends on thespecifics of the intended access patterns, are there structuralmetrics for the table contents and for access patterns that apply toachieve best performance?

At a minimum, one consideration might be whether the set of valuesthat can appear in the RDBMs is in practice finite. That is, morelike enumerations rather than arbitrary strings or integers which intheory could result in trillions of columns (disk space questionsaside). However, even the example of Fig. 1 in the Google BigTablepapers has arbitrary URLs as column family qualifiers. In that case,though, it is presumed that that size of the set of possible valuesfor any one row, even considering versions, is reasonably bounded,although the collection for all the rows is not.

Another consideration would be If one is intending to use MapReducerather than or in addition to other programmatic manipulation of datain HBase.

Also, the issue presumably turns on practical details of how regionsare stored, the performance of the different access methods (get(),getRow(), getScanner()), and basic configuration parameters for thecluster on which the HBase instance is deployed.

Of course, I know everyone is busy doing development, and so few wouldhave much time to do this kind of formal and empirical investigationthat is the subject of tech papers (and maybe a "forthcoming" book bythe HBase team?). I also would presume Google researchers havestudied this for BigTable, but I couldn't turn up much. But hasanybody essentially jotted down some summary notes somewhere thatperhaps could find their way onto the HBase wiki?


Thanks,
Rick



On Jul 24, 2008, at 6:56 PM, Rick Hangartner wrote:

Hi,
I did a quick search of the Hbase-0.1.3 and Hbase-0.2.0 code andcouldn't find any constants that would provide an answer to thefirst of two questions. In a Google search I did find in the Hbasearchive a comment that there is (was) no limit on row-key length inHbase-0.1.x and no limit on cell size was enforced in Hbase-0.1.xother than a cell cannot be larger than the maximum region size.
So perhaps the first question I was going to ask about limits on row-key length is better changed to ask if anyone has any empiricalknowledge to share whether there is some upper bound on row-keylength that insures best query performance?
This question actually arises out of a second question about bestpractices in table design. We have a table in which each item ofinterest could have a key with one primary component "K1" and twoadditional secondary components "K2", "K3". We will be keeping"many" versions of this item, differentiated by timestamp.
Is there any knowledge out there about which of these four optionswould be hypothesized to generally be the highest performance designin the absence of any empirical results or odd data patterns?
1) The row-key is the concatenation "K1::K2::K3" of the three keyswith inter-key separators chosen to make regular expression matchingon the components easy. The multiple copies of the item with aspecific set of values for the three key components, but differenttime stamps, are stored as versions of the item in the row.
2) The row-key "R"is sufficient to be unique for each timestamppedversion of an item and the keys "K1", "K2" and "K3" for the item arecolumns in a single column family. In this case, each version ofthe item is stored in a single row, and "R" would be generated from"K1" and the timestamp in way that gives good grouping to items withthe same value of "K1" under lexicographic ordering.
3) Combining 1) and 2), where the row key is "K1::K2::K3::T" and "T"is an externally generated timestamp so that each version of eachitem is stored in a separate row.
4) Combining 1) and 2) differently, where the row-key "R" isgenerated as a 1-1 alias for "K1::K2::K3" and the keys "K1", "K2"and "K3" for the item are columns in a single column family. Themultiple copies of the item with a specific set of values for thethree key components, but different time stamps, are stored asversions of the item in the row.
Right now we are using 4). This is conceptually simple and workswell, but before we lock this down we thought we ought to considerthe other options. 1) has some attraction because it should takeless space for key storage. But disk is cheap, right? 2) would seemto have has some advantages for queries. The combination would seemto be the best of both, but at the same time is not necessarily thebest if 2) actually performs significantly worse than 1) for somereason related to only storing a single version per row.
Thanks,
Rick

Re: Table design question

Reply via email to