Not on the amount of data we have. We store roughly 50TB of data for this
table over 30RS. Since we use default max region size (10GB) and default
split policy, we get roughly 10k regions containing data, and 20k empty
regions (due to the duration issue in rowkey which has passed as explained
in previous replies).
So I guess when we started ingesting data, we came to the situation we had
1 region per customer, but due to size of it all, we quickly got the
situation a region was a specific customer id and bucket (out of 16
buckets) and then after a while, a specific date range within this bucket.



On Sat, Nov 16, 2013 at 8:16 AM, Ted Yu <yuzhih...@gmail.com> wrote:

> bq. all regions of that customer
>
> Since the rowkey starts with <customerId>, any single customer would only
> span few regions (normally 1 region), right ?
>
>
> On Fri, Nov 15, 2013 at 9:56 PM, Asaf Mesika <asaf.mes...@gmail.com>
> wrote:
>
> > But when you read, you have to approach all regions of that customer,
> > instead of pinpointing just one which contains that hour you want for
> > example.
> >
> > On Friday, November 15, 2013, Ted Yu wrote:
> >
> > > bq. you must have your customerId, timestamp in the rowkey since you
> > query
> > > on it
> > >
> > > Have you looked at this API in Scan ?
> > >
> > >   public Scan setTimeRange(long minStamp, long maxStamp)
> > >
> > >
> > > Cheers
> > >
> > >
> > > On Fri, Nov 15, 2013 at 1:28 PM, Asaf Mesika <asaf.mes...@gmail.com>
> > > wrote:
> > >
> > > > The problem is that I do know my rowkey design, and it follows
> people's
> > > > best practice, but generates a really bad use case which I can't seem
> > to
> > > > know how to solve yet.
> > > >
> > > > The rowkey as I said earlier is:
> > > > <customerId><bucket><timestampInMs><uniqueId>
> > > > So when ,for example, you have 1000 customer, and bucket ranges from
> 1
> > to
> > > > 16, you eventually end up with:
> > > > * 30k regions - What happens, as I presume: you start with one region
> > > > hosting ALL customers, which is just one. As you pour in more
> customers
> > > and
> > > > more data, the region splitting kicks in. So, after a while, you get
> > to a
> > > > situation in which most regions hosts a specific customerId, bucket
> and
> > > > time duration. For example: customer #10001, bucket 6, 01/07/2013
> > 00:00 -
> > > > 02/07/2013 17:00.
> > > > * Empty regions - the first really bad consequence of what I told
> > before
> > > is
> > > > that when the time duration is over, no data will ever be written to
> > this
> > > > region. and Worst - when the TTL you set (lets say 1 month) is over
> and
> > > > it's 03/08/2013, this region gets empty!
> > > >
> > > > The thing is that you must have your customerId, timestamp in the
> > rowkey
> > > > since you query on it, but when you do, you will essentially get
> > regions
> > > > which will not get any more writes to them, and after TTL become
> zombie
> > > > regions :)
> > > >
> > > > The second bad part of this rowkey design is that some customer will
> > have
> > > > significantly less traffic than other customers, thus in essence
> their
> > > > regions will get written in a very slow rate compared with the high
> > > traffic
> > > > customer. When this happens on the same RS - bam: the slow region
> Puts
> > > are
> > > > causing the WAL Queue to get bigger over time, since its region never
> > > gets
> > > > to Max Region Size (256MB in our case) thus never gets flushed, thus
> > > stays
> > > > in the 1st WAL file. Until when? Until we hit max logs file permitted
> > > (32)
> > > > and then regions are flushed forcely. When this happen, we get about
> > 100
> > > > regions with 3k-3mb store files. You can imagine what happens next.
> > > >
> > > > The weirdest thing here is that this rowkey design is very common -
> > > nothing
> > > > fancy here, so in essence this phenomenon should have happened to a
> lot
> > > of
> > > > people - but from some reason, I don't see that much writing about
> it.
> > > >
> > > > Thanks!
> > > >
> > > > Asaf
> > > >
> > > >
> > > >
> > > > On Fri, Nov 15, 2013 at 3:51 AM, Jia Wang <ra...@appannie.com>
> wrote:
> > > >
> > > > > Then the case is simple, as i said "check your row key design, you
> > can
> > > > find
> > > > > the start and end row key for each region, from which you can know
> > why
> > > > your
> > > > > request with a specific row key doesn't hit a specified region"
> > > > >
> > > > > Cheers
> > > > > Ramon
> > > > >
> > > > >
> > > > > On Thu, Nov 14, 2013 at 8:47 PM, Asaf Mesika <
> asaf.mes...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > It's from the same table.
> > > > > > The thing is that some <customerId> simply have less data saved
> in
> > > > HBase,
> > > > > > while others have x50 (max) data.
> > > > > > I'm trying to check how people designed their rowkey around it,
> or
> > > had
> > > > > > other out-of-the-box solution for it.
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Thu, Nov 14, 2013 at 12:06 PM, Jia Wang <ra...@appannie.com>
> > > wrote:
> > > > > >
> > > > > > > Hi
> > > > > > >
> > > > > > > Are the regions from the same table? If it was, check your row
> > key
> > > > > > design,
> > > > > > > you can find the start and end row key for each region, from
> > which
> > > > you
> > > > > > can
> > > > > > > know why your request with a specific row key doesn't hit a
> > > specified
> > > > > > > region.
> > > > > > >
> > > > > > > If the regions are for different table, you may consider to
> > combine
> > > > > some
> > > > > > > cold regions for some tables.
> > > > > > >
> > > > > > > Thanks
> > > > > > > Ramon
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Nov 14, 2013 at 4:59 PM, Asaf Mesika <
> >
>

Reply via email to