Re: Table design question

Jean-Daniel Cryans Fri, 25 Jul 2008 12:42:09 -0700

Rick,

I think the lack of answers is the answer to your questions. Performance
was, up to date, not the emphasis during the development although it is for
HBase 0.3.0.


If you are interested in doing such a paper, I would love to participate as
it would fit nicely in my graduate studies. Email me in this case.

Thx,

J-D

On Fri, Jul 25, 2008 at 12:07 PM, Rick Hangartner <[EMAIL PROTECTED]>
wrote:

> Hi,
>
> After thinking about these two questions, and doing more searching on the
> HBase site and the web, it occurs to me that the question underlying other
> questions and comments I found is this:
>
> If one has inherently "dense" data --- data that would in fact make a
> sensible table in an RDBMS --- what are the considerations that govern
> whether it would be useful to "sparsify" that data in HBase by mapping an
> RDBMS column to an HBase column family and using the values in that RBDMS
> column as qualifiers in column keys in the HBase column family?
>
> Is it the case that if the data is dense, is it best to keep in dense in
> HBase?  Beyond the general principle that it depends on the specifics of the
> intended access patterns, are there structural metrics for the table
> contents and for access patterns that apply to achieve best performance?
>
> At a minimum, one consideration might be whether the set of values that can
> appear in the RDBMs is in practice finite.  That is, more like enumerations
> rather than arbitrary strings or integers which in theory could result in
> trillions of columns (disk space questions aside).  However, even the
> example of Fig. 1 in the Google BigTable papers has arbitrary URLs as column
> family qualifiers.  In that case, though,  it is presumed that that size of
> the set of possible values for any one row, even considering versions, is
> reasonably bounded, although the collection for all the rows is not.
>
> Another consideration would be If one is intending to use MapReduce rather
> than or in addition to other programmatic manipulation of data in HBase.
>
> Also, the issue presumably turns on practical details of how regions are
> stored, the performance of the different access methods (get(), getRow(),
> getScanner()), and basic configuration parameters for the cluster on which
> the HBase instance is deployed.
>
> Of course, I know everyone is busy doing development, and so few would have
> much time to do this kind of formal and empirical investigation that is the
> subject of tech papers (and maybe a "forthcoming" book by the HBase team?).
>  I also would presume Google researchers have studied this for BigTable, but
> I couldn't turn up much.  But has anybody essentially jotted down some
> summary notes somewhere that perhaps could find their way onto the HBase
> wiki?
>
> Thanks,
> Rick
>
>
>
>
> On Jul 24, 2008, at 6:56 PM, Rick Hangartner wrote:
>
>  Hi,
>>
>> I did a quick search of the Hbase-0.1.3 and Hbase-0.2.0 code and couldn't
>> find any constants that would provide an answer to the first of two
>> questions.  In a Google search I did find in the Hbase archive a comment
>> that there is (was) no limit on row-key length in Hbase-0.1.x and no limit
>> on cell size was enforced in Hbase-0.1.x other than a cell cannot be larger
>> than the maximum region size.
>>
>> So perhaps the first question I was going to ask about limits on row-key
>> length is better changed to ask if anyone has any empirical knowledge to
>> share whether there is some upper bound on row-key length that insures best
>> query performance?
>>
>> This question actually arises out of a second question about best
>> practices in table design.  We have a table in which each item of interest
>> could have a key with one primary component "K1" and two additional
>> secondary components "K2", "K3".  We will be keeping "many" versions of this
>> item, differentiated by timestamp.
>>
>> Is there any knowledge out there about which of these four options would
>> be hypothesized to generally be the highest performance design in the
>> absence of any empirical results or odd data patterns?
>>
>> 1) The row-key is the concatenation "K1::K2::K3" of the three keys with
>> inter-key separators chosen to make regular expression matching on the
>> components easy.  The multiple copies of the item with a specific set of
>> values for the three key components, but different time stamps, are stored
>> as versions of the item in the row.
>>
>> 2) The row-key "R"is sufficient to be unique for each timestampped version
>> of an item and the keys "K1", "K2" and "K3" for the item are columns in a
>> single column family.  In this case, each version of the item is stored in a
>> single row, and "R" would be generated from "K1" and the timestamp in way
>> that gives good grouping to items with the same value of "K1" under
>> lexicographic ordering.
>>
>> 3) Combining 1) and 2), where the row key is "K1::K2::K3::T" and "T" is an
>> externally generated timestamp so that each version of each item is stored
>> in a separate row.
>>
>> 4) Combining 1) and 2) differently, where the row-key "R" is generated as
>> a 1-1 alias for "K1::K2::K3" and the keys "K1", "K2" and "K3" for the item
>> are columns in a single column family.  The multiple copies of the item with
>> a specific set of values for the three key components, but different time
>> stamps, are stored as versions of the item in the row.
>>
>> Right now we are using 4).  This is conceptually simple and works well,
>> but before we lock this down we thought we ought to consider the other
>> options.  1) has some attraction because it should take less space for key
>> storage. But disk is cheap, right? 2) would seem to have has some advantages
>> for queries.  The combination would seem to be the best of both, but at the
>> same time is not necessarily the best if 2) actually performs significantly
>> worse than 1) for some reason related to only storing a single version per
>> row.
>>
>> Thanks,
>> Rick
>>
>>
>>
>

Re: Table design question

Reply via email to