You can also look at pre-splitting the regions for timeseries type data. On Mon, Sep 3, 2012 at 1:11 PM, Jean-Marc Spaggiari <jean-m...@spaggiari.org > wrote:
> Initially your table will contain only one region. > > When you will reach its maximum size, it will split into 2 regions > will are going to be distributed over the cluster. > > The 2 regions are going to be ordered by keys.So all entries starting > with 1 will be on the first region. And the middle key (let's say > 25......) will start the 2nd region. > > So region 1 will contain 1 to 24999. and the 2nd region will contain > keys from 25 > > And so on. > > Since keys are ordered, all keys starting with a 1 are going to be > closeby on the same region, expect if the region is big enought to be > splitted and the servers by more region servers. > > So when you will load all your entries starting with 1, or 3, they > will go on one uniq region. Only entries starting with 2 are going to > be sometime on region 1, sometime on region 2. > > Of course, the more data you will load, the more regions you will > have, the less hotspoting you will have. But at the beginning, it > might be difficult for some of your servers. > > > 2012/9/3, Eric Czech <e...@nextbigsound.com>: > > With regards to: > > > >> If you have 3 region servers and your data is evenly distributed, that > >> mean all the data starting with a 1 will be on server 1, and so on. > > > > Assuming there are multiple regions in existence for each prefix, why > > would they not be distributed across all the machines? > > > > In other words, if there are many regions with keys that generally > > start with 1, why would they ALL be on server 1 like you said? It's > > my understanding that the regions aren't placed around the cluster > > according to the range of information they contain so I'm not quite > > following that explanation. > > > > Putting the higher cardinality values in front of the key isn't > > entirely out of the question, but I'd like to use the low cardinality > > key out front for the sake of selecting rows for MapReduce jobs. > > Otherwise, I always have to scan the full table for each job. > > > > On Mon, Sep 3, 2012 at 3:20 PM, Jean-Marc Spaggiari > > <jean-m...@spaggiari.org> wrote: > >> Yes, you're right, but again, it will depend on the number of > >> regionservers and the distribution of your data. > >> > >> If you have 3 region servers and your data is evenly distributed, that > >> mean all the data starting with a 1 will be on server 1, and so on. > >> > >> So if you write a million of lines starting with a 1, they will all > >> land on the same server. > >> > >> Of course, you can pre-split your table. Like 1a to 1z and assign each > >> region to one of you 3 servers. That way you will avoir hotspotting > >> even if you write million of lines starting with a 1. > >> > >> If you have une hundred regions, you will face the same issue at the > >> beginning, but the more data your will add, the more your table will > >> be split across all the servers and the less hotspottig you will have. > >> > >> Can't you just revert your fields and put the 1 to 30 at the end of the > >> key? > >> > >> 2012/9/3, Eric Czech <e...@nextbigsound.com>: > >>> Thanks for the response Jean-Marc! > >>> > >>> I understand what you're saying but in a more extreme case, let's say > >>> I'm choosing the leading number on the range 1 - 3 instead of 1 - 30. > >>> In that case, it seems like all of the data for any one prefix would > >>> already be split well across the cluster and as long as the second > >>> value isn't written sequentially, there wouldn't be an issue. > >>> > >>> Is my reasoning there flawed at all? > >>> > >>> On Mon, Sep 3, 2012 at 2:31 PM, Jean-Marc Spaggiari > >>> <jean-m...@spaggiari.org> wrote: > >>>> Hi Eric, > >>>> > >>>> In HBase, data is stored sequentially based on the key alphabetical > >>>> order. > >>>> > >>>> It will depend of the number of reqions and regionservers you have but > >>>> if you write data from 23AAAAAA to 23ZZZZZZ they will most probably go > >>>> to the same region even if the cardinality of the 2nd part of the key > >>>> is high. > >>>> > >>>> If the first number is always changing between 1 and 30 for each > >>>> write, then you will reach multiple region/servers if you have, else, > >>>> you might have some hot-stopping. > >>>> > >>>> JM > >>>> > >>>> 2012/9/3, Eric Czech <e...@nextbigsound.com>: > >>>>> Hi everyone, > >>>>> > >>>>> I was curious whether or not I should expect any write hot spots if I > >>>>> structured my composite keys in a way such that the first field is a > >>>>> low cardinality (maybe 30 distinct values) value and the next field > >>>>> contains a very high cardinality value that would not be written > >>>>> sequentially. > >>>>> > >>>>> More concisely, I want to do this: > >>>>> > >>>>> Given one number between 1 and 30, write many millions of rows with > >>>>> keys like <number chosen> : <some generally distinct, non-sequential > >>>>> value> > >>>>> > >>>>> Would there be any problem with the millions of writes happening with > >>>>> the same first field key prefix even if the second field is largely > >>>>> unique? > >>>>> > >>>>> Thank you! > >>>>> > >>> > > >