Re: Key formats and very low cardinality leading fields

Mohit Anchlia Mon, 03 Sep 2012 14:19:51 -0700

You can also look at pre-splitting the regions for timeseries type data.

On Mon, Sep 3, 2012 at 1:11 PM, Jean-Marc Spaggiari <jean-m...@spaggiari.org
> wrote:


> Initially your table will contain only one region.
>
> When you will reach its maximum size, it will split into 2 regions
> will are going to be distributed over the cluster.
>
> The 2 regions are going to be ordered by keys.So all entries starting
> with 1 will be on the first region. And the middle key (let's say
> 25......) will start the 2nd region.
>
> So region 1 will contain 1 to 24999. and the 2nd region will contain
> keys from 25
>
> And so on.
>
> Since keys are ordered, all keys starting with a 1 are going to be
> closeby on the same region, expect if the region is big enought to be
> splitted and the servers by more region servers.
>
> So when you will load all your entries starting with 1, or 3, they
> will go on one uniq region. Only entries starting with 2 are going to
> be sometime on region 1, sometime on region 2.
>
> Of course, the more data you will load, the more regions you will
> have, the less hotspoting you will have. But at the beginning, it
> might be difficult for some of your servers.
>
>
> 2012/9/3, Eric Czech <e...@nextbigsound.com>:
>  > With regards to:
> >
> >> If you have 3 region servers and your data is evenly distributed, that
> >> mean all the data starting with a 1 will be on server 1, and so on.
> >
> > Assuming there are multiple regions in existence for each prefix, why
> > would they not be distributed across all the machines?
> >
> > In other words, if there are many regions with keys that generally
> > start with 1, why would they ALL be on server 1 like you said?  It's
> > my understanding that the regions aren't placed around the cluster
> > according to the range of information they contain so I'm not quite
> > following that explanation.
> >
> > Putting the higher cardinality values in front of the key isn't
> > entirely out of the question, but I'd like to use the low cardinality
> > key out front for the sake of selecting rows for MapReduce jobs.
> > Otherwise, I always have to scan the full table for each job.
> >
> > On Mon, Sep 3, 2012 at 3:20 PM, Jean-Marc Spaggiari
> > <jean-m...@spaggiari.org> wrote:
> >> Yes, you're right, but again, it will depend on the number of
> >> regionservers and the distribution of your data.
> >>
> >> If you have 3 region servers and your data is evenly distributed, that
> >> mean all the data starting with a 1 will be on server 1, and so on.
> >>
> >> So if you write a million of lines starting with a 1, they will all
> >> land on the same server.
> >>
> >> Of course, you can pre-split your table. Like 1a to 1z and assign each
> >> region to one of you 3 servers. That way you will avoir hotspotting
> >> even if you write million of lines starting with a 1.
> >>
> >> If you have une hundred regions, you will face the same issue at the
> >> beginning, but the more data your will add, the more your table will
> >> be split across all the servers and the less hotspottig you will have.
> >>
> >> Can't you just revert your fields and put the 1 to 30 at the end of the
> >> key?
> >>
> >> 2012/9/3, Eric Czech <e...@nextbigsound.com>:
> >>> Thanks for the response Jean-Marc!
> >>>
> >>> I understand what you're saying but in a more extreme case, let's say
> >>> I'm choosing the leading number on the range 1 - 3 instead of 1 - 30.
> >>> In that case, it seems like all of the data for any one prefix would
> >>> already be split well across the cluster and as long as the second
> >>> value isn't written sequentially, there wouldn't be an issue.
> >>>
> >>> Is my reasoning there flawed at all?
> >>>
> >>> On Mon, Sep 3, 2012 at 2:31 PM, Jean-Marc Spaggiari
> >>> <jean-m...@spaggiari.org> wrote:
> >>>> Hi Eric,
> >>>>
> >>>> In HBase, data is stored sequentially based on the key alphabetical
> >>>> order.
> >>>>
> >>>> It will depend of the number of reqions and regionservers you have but
> >>>> if you write data from 23AAAAAA to 23ZZZZZZ they will most probably go
> >>>> to the same region even if the cardinality of the 2nd part of the key
> >>>> is high.
> >>>>
> >>>> If the first number is always changing between 1 and 30 for each
> >>>> write, then you will reach multiple region/servers if you have, else,
> >>>> you might have some hot-stopping.
> >>>>
> >>>> JM
> >>>>
> >>>> 2012/9/3, Eric Czech <e...@nextbigsound.com>:
> >>>>> Hi everyone,
> >>>>>
> >>>>> I was curious whether or not I should expect any write hot spots if I
> >>>>> structured my composite keys in a way such that the first field is a
> >>>>> low cardinality (maybe 30 distinct values) value and the next field
> >>>>> contains a very high cardinality value that would not be written
> >>>>> sequentially.
> >>>>>
> >>>>> More concisely, I want to do this:
> >>>>>
> >>>>> Given one number between 1 and 30, write many millions of rows with
> >>>>> keys like <number chosen> : <some generally distinct, non-sequential
> >>>>> value>
> >>>>>
> >>>>> Would there be any problem with the millions of writes happening with
> >>>>> the same first field key prefix even if the second field is largely
> >>>>> unique?
> >>>>>
> >>>>> Thank you!
> >>>>>
> >>>
> >
>

Re: Key formats and very low cardinality leading fields

Reply via email to