I was afraid of this answer and suspected it ;). I knew that the answer would depend on the actual setting, but I hoped, that there is is a little hint.
Thanks a lot for your time and the answer. I will try it out with test data (and a simple table design) and will share my experiments when they are done. Thx for hbase to all and best wishes Wilm Am 17.12.2013 14:03, schrieb yonghu: > In my opinion, it really depends on your queries. > > The first one achieves data locality. There is no additional data transmit > between different nodes. But this strategy sacrifices parallelism and the > node which stores A will be a hot node if too many applications try to > access A. > > The second approach gives you parallelism but you need somehow to merge the > data together to generate the final results. So, you can see there is a > trade off between data locality and parallelism. So the performance of > query will be influenced by following factors: > > 1. data size; > 2. data access frequency; > 3. data access pattern, full scan or index scan; > 4. network bandwidth. > > So the best solution for one situation may not fit for the others. > > > > On Tue, Dec 17, 2013 at 3:56 AM, Tao Xiao <xiaotao.cs....@gmail.com> wrote: > >> Sometimes row key design is a trade-off issue between load-balance and >> query : if you design row key such that you can query it very fast and >> convenient, maybe the records are not spread evenly across the nodes; if >> you design row key such that the records are spread evenly across the >> nodes, maybe it's not convenient to query or impossible to get the record >> through row key directly (say you have a random number as the row key's >> prefix). >> >> You can have a look at secondary index. Secondary index is very helpful. >> >> >> >> >> 2013/12/16 Wilm Schumacher <wilm.schumac...@cawoom.com> >> >>> Hi, >>> >>> I'm a newbie to hbase and have a question on the rowkey design and I >>> hope this question isn't to newbie-like for this list. I have a question >>> which cannot be answered by knoledge of code but by experience with >>> large databases, thus this mail. >>> >>> For the sake of explaination I create a small example. Suppose you want >>> to design a small "blogging" plattform. You just want to store the name >>> of the user and a small text. And of course you want to get all postings >>> of one user. >>> >>> Furthermore we have 4 users, let's call them A,B,C,D (and you can trust >>> that the length of the username is fixed). Now let's say the A,B,C and D >>> have N postings, and D has 6*N postings. BUT: the data of A is 3 times >>> more often fetched than the data from the other users each! >>> >>> If you create a hbase cluster with 10 nodes, every node is holding N >>> postings (of course I know, that the data is hold redundantly, but this >>> is not so important for the question). >>> >>> Rowkey design #1: >>> the i-th posting of user X would have the rowkey: "$X$i", e.g. "A003". >>> The table just would be: "create 'postings' , 'text'" >>> >>> For this rowkey design the first node would hold the data of A, the >>> second of B, the third of C and the fourth to the tenth node the data of >> D. >>> >>> Fetching of data would be very easy, but half of the traffic would hit >>> the first node. >>> >>> Rowkey design #2 >>> the rowkey would be random, e.g. an uuid. The table design would be now: >>> "create 'postings' , 'user' , 'text'" >>> >>> the fetching of the data would be a "real" map-reduce job, checking for >>> the user and emit etc.. >>> >>> So, if a fetching takes place I have to do more computation cycles and >>> IO. But in this scenario all traffic would hit all 10 servers. >>> >>> If the number of N (number of postings) is large enough that the disk >>> space is critical, I'm also not able to adjust the key regions in a way >>> that e.g. the data of D is only on the last server and the key space of >>> A would span the first 5 nodes. Or making replication very broad (e.g. >>> 10 times in this case) >>> >>> So basically the question is: What's the better plan? Trying to avoid >>> computation cycles of map reducing and get the key design straight, or >>> trying to scale the computation, but doing more IO? >>> >>> I hope that the small example helped to make the question more vivid. >>> >>> Best wishes >>> >>> Wilm >>> >> >