Hi tom , Isn't the Bucket we use the same thing? So if I understand correctly you are not using automatic splitting, but do this throughout a manual process or a process running along HBase? Regarding recommendation of merge of empty regions - how did merge regions? I thought this capability exists only in 0.96?
On Saturday, November 16, 2013, Tom Brown wrote: > We have solved this by prefixing each key with a single byte. The byte is > based on a very simple 8-bit hash of the record. If you know exactly which > row you are looking for you can rehash your row to create the true key. > > Scans are a little more complex because you have to issue 256 scans instead > of 1 scan, and interpolate the results. > > Another thing we did us write a utility to compute all the region sizes in > a list, and recommend merges of now-empty regions, and splits of hot > regions. > > Together, those two items solve the problem quite nicely for us. We haven't > quite got to your scale yet, so YMMV. > > --Tom > > On Friday, November 15, 2013, Ted Yu wrote: > > > bq. you must have your customerId, timestamp in the rowkey since you > query > > on it > > > > Have you looked at this API in Scan ? > > > > public Scan setTimeRange(long minStamp, long maxStamp) > > > > > > Cheers > > > > > > On Fri, Nov 15, 2013 at 1:28 PM, Asaf Mesika <asaf.mes...@gmail.com> > > wrote: > > > > > The problem is that I do know my rowkey design, and it follows people's > > > best practice, but generates a really bad use case which I can't seem > to > > > know how to solve yet. > > > > > > The rowkey as I said earlier is: > > > <customerId><bucket><timestampInMs><uniqueId> > > > So when ,for example, you have 1000 customer, and bucket ranges from 1 > to > > > 16, you eventually end up with: > > > * 30k regions - What happens, as I presume: you start with one region > > > hosting ALL customers, which is just one. As you pour in more customers > > and > > > more data, the region splitting kicks in. So, after a while, you get > to a > > > situation in which most regions hosts a specific customerId, bucket and > > > time duration. For example: customer #10001, bucket 6, 01/07/2013 > 00:00 - > > > 02/07/2013 17:00. > > > * Empty regions - the first really bad consequence of what I told > before > > is > > > that when the time duration is over, no data will ever be written to > this > > > region. and Worst - when the TTL you set (lets say 1 month) is over and > > > it's 03/08/2013, this region gets empty! > > > > > > The thing is that you must have your customerId, timestamp in the > rowkey > > > since you query on it, but when you do, you will essentially get > regions > > > which will not get any more writes to them, and after TTL become zombie > > > regions :) > > > > > > The second bad part of this rowkey design is that some customer will > have > > > significantly less traffic than other customers, thus in essence their > > > regions will get written in a very slow rate compared with the high > > traffic > > > customer. When this happens on the same RS - bam: the slow region Puts > > are > > > causing the WAL Queue to get bigger over time, since its region never > > gets > > > to Max Region Size (256MB in our case) thus never gets flushed, thus > > stays > > > in the 1st WAL file. Until when? Until we hit max logs file permitted > > (32) > > > and then regions are flushed forcely. When this happen, we get about > 100 > > > regions with 3k-3mb store files. You can imagine what happens next. > > > > > > The weirdest thing here is that this rowkey design is very common - > > nothing > > > fancy here, so in essence this phenomenon should have happened to a lot > > of > > > people - but from some reason, I don't see that much writing about it. > > > > > > Thanks! > > > > > > Asaf > > > > > > > > > > > > On Fri, Nov 15, 2013 at 3:51 AM, Jia Wang <ra...@appannie.com> wrote: > > > > > > > Then the case is simple, as i said "check your row key design, you > can > > > find > > > > the start and end row key for each region, from which you can know > why > > > your > > > > request with a specific row key doesn't hit a specified region" > > > > > > > > Cheers > > > > Ramon > > > > > > > > > > > > On Thu, Nov 14, 2013 at 8:47 PM, Asaf Mesika <asaf.mes...@gmail.com> > > > > wrote: > > > > > > > > > It's from the same table. > > > > > The thing is that some <customerId> simply have less data saved in > > > HBase, > > > > > while others have x50 (max) data. > > > > > I'm trying to check how people designed their rowkey around it, or > > had > > > > > other out-of-the-box solution for it. > > > > > > > > > > > > > > > > > > > > On Thu, Nov 14, 2013 at 12:06 PM, Jia Wang <ra...@appannie.com> > > wrote: > > > > > > > > > > > Hi > >