Re: Newbie question: Rowkey design

Wilm Schumacher Tue, 17 Dec 2013 10:26:32 -0800

I was afraid of this answer and suspected it ;). I knew that the answer
would depend on the actual setting, but I hoped, that there is is a
little hint.


Thanks a lot for your time and the answer. I will try it out with test
data (and a simple table design) and will share my experiments when they
are done.

Thx for hbase to all and best wishes

Wilm

Am 17.12.2013 14:03, schrieb yonghu:
> In my opinion, it really depends on your queries.
> 
> The first one achieves data locality. There is no additional data  transmit
> between different nodes. But this strategy sacrifices parallelism and the
> node which stores A will be a hot node if too many applications try to
> access A.
> 
> The second approach gives you parallelism but you need somehow to merge the
> data together to generate the final results.  So, you can see there is a
> trade off between data locality and parallelism. So the performance of
> query will be influenced by following factors:
> 
> 1. data size;
> 2. data access frequency;
> 3. data access pattern, full scan or index scan;
> 4. network bandwidth.
> 
> So the best solution for one situation may not fit for the others.
> 
> 
> 
> On Tue, Dec 17, 2013 at 3:56 AM, Tao Xiao <xiaotao.cs....@gmail.com> wrote:
> 
>> Sometimes row key design is a trade-off issue between load-balance and
>> query : if you design row key such that you can query it very fast and
>> convenient, maybe the records are not spread evenly across the nodes; if
>> you design row key such that the records are spread evenly across the
>> nodes, maybe it's not convenient to query or impossible to get the record
>> through row key directly (say you have a random number as the row key's
>> prefix).
>>
>> You can have a look at secondary index. Secondary index is very helpful.
>>
>>
>>
>>
>> 2013/12/16 Wilm Schumacher <wilm.schumac...@cawoom.com>
>>
>>> Hi,
>>>
>>> I'm a newbie to hbase and have a question on the rowkey design and I
>>> hope this question isn't to newbie-like for this list. I have a question
>>> which cannot be answered by knoledge of code but by experience with
>>> large databases, thus this mail.
>>>
>>> For the sake of explaination I create a small example. Suppose you want
>>> to design a small "blogging" plattform. You just want to store the name
>>> of the user and a small text. And of course you want to get all postings
>>> of one user.
>>>
>>> Furthermore we have 4 users, let's call them A,B,C,D (and you can trust
>>> that the length of the username is fixed). Now let's say the A,B,C and D
>>> have N postings, and D has 6*N postings. BUT: the data of A is 3 times
>>> more often fetched than the data from the other users each!
>>>
>>> If you create a hbase cluster with 10 nodes, every node is holding N
>>> postings (of course I know, that the data is hold redundantly, but this
>>> is not so important for the question).
>>>
>>> Rowkey design #1:
>>> the i-th posting of user X would have the rowkey: "$X$i", e.g. "A003".
>>> The table just would be: "create 'postings' , 'text'"
>>>
>>> For this rowkey design the first node would hold the data of A, the
>>> second of B, the third of C and the fourth to the tenth node the data of
>> D.
>>>
>>> Fetching of data would be very easy, but half of the traffic would hit
>>> the first node.
>>>
>>> Rowkey design #2
>>> the rowkey would be random, e.g. an uuid. The table design would be now:
>>> "create 'postings' , 'user' , 'text'"
>>>
>>> the fetching of the data would be a "real" map-reduce job, checking for
>>> the user and emit etc..
>>>
>>> So, if a fetching takes place I have to do more computation cycles and
>>> IO. But in this scenario all traffic would hit all 10 servers.
>>>
>>> If the number of N (number of postings) is large enough that the disk
>>> space is critical, I'm also not able to adjust the key regions in a way
>>> that e.g. the data of D is only on the last server and the key space of
>>> A would span the first 5 nodes. Or making replication very broad (e.g.
>>> 10 times in this case)
>>>
>>> So basically the question is: What's the better plan? Trying to avoid
>>> computation cycles of map reducing and get the key design straight, or
>>> trying to scale the computation, but doing more IO?
>>>
>>> I hope that the small example helped to make the question more vivid.
>>>
>>> Best wishes
>>>
>>> Wilm
>>>
>>
>

Re: Newbie question: Rowkey design

Reply via email to