Re: Key formats and very low cardinality leading fields

Eric Czech Tue, 04 Sep 2012 10:32:07 -0700

You're the man Jean-Marc .. info is much appreciated.

On Tue, Sep 4, 2012 at 1:22 PM, Jean-Marc Spaggiari <jean-m...@spaggiari.org
> wrote:


> Hi Eric,
>
> Yes you can split and existing region. You can do that easily with the
> web interface. After the split, at some point, one of the 2 regions
> will be moved to another server to balanced the load. You can also
> move it manually.
>
> JM
>
> 2012/9/4, Eric Czech <e...@nextbigsound.com>:
> > Thanks again, both of you.
> >
> > I'll look at pre splitting the regions so that there isn't so much
> initial
> > contention.  The issue I'll have though is that I won't know all the
> prefix
> > values at first and will have to be able to add them later.
> >
> > Is it possible to split regions on an existing table?  Or is that
> > inadvisable in favor of doing the splits when the table is created?
> >
> > On Mon, Sep 3, 2012 at 5:19 PM, Mohit Anchlia
> > <mohitanch...@gmail.com>wrote:
> >
> >> You can also look at pre-splitting the regions for timeseries type data.
> >>
> >> On Mon, Sep 3, 2012 at 1:11 PM, Jean-Marc Spaggiari <
> >> jean-m...@spaggiari.org
> >> > wrote:
> >>
> >> > Initially your table will contain only one region.
> >> >
> >> > When you will reach its maximum size, it will split into 2 regions
> >> > will are going to be distributed over the cluster.
> >> >
> >> > The 2 regions are going to be ordered by keys.So all entries starting
> >> > with 1 will be on the first region. And the middle key (let's say
> >> > 25......) will start the 2nd region.
> >> >
> >> > So region 1 will contain 1 to 24999. and the 2nd region will contain
> >> > keys from 25
> >> >
> >> > And so on.
> >> >
> >> > Since keys are ordered, all keys starting with a 1 are going to be
> >> > closeby on the same region, expect if the region is big enought to be
> >> > splitted and the servers by more region servers.
> >> >
> >> > So when you will load all your entries starting with 1, or 3, they
> >> > will go on one uniq region. Only entries starting with 2 are going to
> >> > be sometime on region 1, sometime on region 2.
> >> >
> >> > Of course, the more data you will load, the more regions you will
> >> > have, the less hotspoting you will have. But at the beginning, it
> >> > might be difficult for some of your servers.
> >> >
> >> >
> >> > 2012/9/3, Eric Czech <e...@nextbigsound.com>:
> >> >  > With regards to:
> >> > >
> >> > >> If you have 3 region servers and your data is evenly distributed,
> >> > >> that
> >> > >> mean all the data starting with a 1 will be on server 1, and so on.
> >> > >
> >> > > Assuming there are multiple regions in existence for each prefix,
> why
> >> > > would they not be distributed across all the machines?
> >> > >
> >> > > In other words, if there are many regions with keys that generally
> >> > > start with 1, why would they ALL be on server 1 like you said?  It's
> >> > > my understanding that the regions aren't placed around the cluster
> >> > > according to the range of information they contain so I'm not quite
> >> > > following that explanation.
> >> > >
> >> > > Putting the higher cardinality values in front of the key isn't
> >> > > entirely out of the question, but I'd like to use the low
> cardinality
> >> > > key out front for the sake of selecting rows for MapReduce jobs.
> >> > > Otherwise, I always have to scan the full table for each job.
> >> > >
> >> > > On Mon, Sep 3, 2012 at 3:20 PM, Jean-Marc Spaggiari
> >> > > <jean-m...@spaggiari.org> wrote:
> >> > >> Yes, you're right, but again, it will depend on the number of
> >> > >> regionservers and the distribution of your data.
> >> > >>
> >> > >> If you have 3 region servers and your data is evenly distributed,
> >> > >> that
> >> > >> mean all the data starting with a 1 will be on server 1, and so on.
> >> > >>
> >> > >> So if you write a million of lines starting with a 1, they will all
> >> > >> land on the same server.
> >> > >>
> >> > >> Of course, you can pre-split your table. Like 1a to 1z and assign
> >> > >> each
> >> > >> region to one of you 3 servers. That way you will avoir hotspotting
> >> > >> even if you write million of lines starting with a 1.
> >> > >>
> >> > >> If you have une hundred regions, you will face the same issue at
> the
> >> > >> beginning, but the more data your will add, the more your table
> will
> >> > >> be split across all the servers and the less hotspottig you will
> >> > >> have.
> >> > >>
> >> > >> Can't you just revert your fields and put the 1 to 30 at the end of
> >> the
> >> > >> key?
> >> > >>
> >> > >> 2012/9/3, Eric Czech <e...@nextbigsound.com>:
> >> > >>> Thanks for the response Jean-Marc!
> >> > >>>
> >> > >>> I understand what you're saying but in a more extreme case, let's
> >> > >>> say
> >> > >>> I'm choosing the leading number on the range 1 - 3 instead of 1 -
> >> > >>> 30.
> >> > >>> In that case, it seems like all of the data for any one prefix
> >> > >>> would
> >> > >>> already be split well across the cluster and as long as the second
> >> > >>> value isn't written sequentially, there wouldn't be an issue.
> >> > >>>
> >> > >>> Is my reasoning there flawed at all?
> >> > >>>
> >> > >>> On Mon, Sep 3, 2012 at 2:31 PM, Jean-Marc Spaggiari
> >> > >>> <jean-m...@spaggiari.org> wrote:
> >> > >>>> Hi Eric,
> >> > >>>>
> >> > >>>> In HBase, data is stored sequentially based on the key
> >> > >>>> alphabetical
> >> > >>>> order.
> >> > >>>>
> >> > >>>> It will depend of the number of reqions and regionservers you
> have
> >> but
> >> > >>>> if you write data from 23AAAAAA to 23ZZZZZZ they will most
> >> > >>>> probably
> >> go
> >> > >>>> to the same region even if the cardinality of the 2nd part of the
> >> key
> >> > >>>> is high.
> >> > >>>>
> >> > >>>> If the first number is always changing between 1 and 30 for each
> >> > >>>> write, then you will reach multiple region/servers if you have,
> >> else,
> >> > >>>> you might have some hot-stopping.
> >> > >>>>
> >> > >>>> JM
> >> > >>>>
> >> > >>>> 2012/9/3, Eric Czech <e...@nextbigsound.com>:
> >> > >>>>> Hi everyone,
> >> > >>>>>
> >> > >>>>> I was curious whether or not I should expect any write hot spots
> >> if I
> >> > >>>>> structured my composite keys in a way such that the first field
> >> > >>>>> is
> >> a
> >> > >>>>> low cardinality (maybe 30 distinct values) value and the next
> >> > >>>>> field
> >> > >>>>> contains a very high cardinality value that would not be written
> >> > >>>>> sequentially.
> >> > >>>>>
> >> > >>>>> More concisely, I want to do this:
> >> > >>>>>
> >> > >>>>> Given one number between 1 and 30, write many millions of rows
> >> > >>>>> with
> >> > >>>>> keys like <number chosen> : <some generally distinct,
> >> non-sequential
> >> > >>>>> value>
> >> > >>>>>
> >> > >>>>> Would there be any problem with the millions of writes happening
> >> with
> >> > >>>>> the same first field key prefix even if the second field is
> >> > >>>>> largely
> >> > >>>>> unique?
> >> > >>>>>
> >> > >>>>> Thank you!
> >> > >>>>>
> >> > >>>
> >> > >
> >> >
> >>
> >
>

Re: Key formats and very low cardinality leading fields

Reply via email to