You're the man Jean-Marc .. info is much appreciated. On Tue, Sep 4, 2012 at 1:22 PM, Jean-Marc Spaggiari <jean-m...@spaggiari.org > wrote:
> Hi Eric, > > Yes you can split and existing region. You can do that easily with the > web interface. After the split, at some point, one of the 2 regions > will be moved to another server to balanced the load. You can also > move it manually. > > JM > > 2012/9/4, Eric Czech <e...@nextbigsound.com>: > > Thanks again, both of you. > > > > I'll look at pre splitting the regions so that there isn't so much > initial > > contention. The issue I'll have though is that I won't know all the > prefix > > values at first and will have to be able to add them later. > > > > Is it possible to split regions on an existing table? Or is that > > inadvisable in favor of doing the splits when the table is created? > > > > On Mon, Sep 3, 2012 at 5:19 PM, Mohit Anchlia > > <mohitanch...@gmail.com>wrote: > > > >> You can also look at pre-splitting the regions for timeseries type data. > >> > >> On Mon, Sep 3, 2012 at 1:11 PM, Jean-Marc Spaggiari < > >> jean-m...@spaggiari.org > >> > wrote: > >> > >> > Initially your table will contain only one region. > >> > > >> > When you will reach its maximum size, it will split into 2 regions > >> > will are going to be distributed over the cluster. > >> > > >> > The 2 regions are going to be ordered by keys.So all entries starting > >> > with 1 will be on the first region. And the middle key (let's say > >> > 25......) will start the 2nd region. > >> > > >> > So region 1 will contain 1 to 24999. and the 2nd region will contain > >> > keys from 25 > >> > > >> > And so on. > >> > > >> > Since keys are ordered, all keys starting with a 1 are going to be > >> > closeby on the same region, expect if the region is big enought to be > >> > splitted and the servers by more region servers. > >> > > >> > So when you will load all your entries starting with 1, or 3, they > >> > will go on one uniq region. Only entries starting with 2 are going to > >> > be sometime on region 1, sometime on region 2. > >> > > >> > Of course, the more data you will load, the more regions you will > >> > have, the less hotspoting you will have. But at the beginning, it > >> > might be difficult for some of your servers. > >> > > >> > > >> > 2012/9/3, Eric Czech <e...@nextbigsound.com>: > >> > > With regards to: > >> > > > >> > >> If you have 3 region servers and your data is evenly distributed, > >> > >> that > >> > >> mean all the data starting with a 1 will be on server 1, and so on. > >> > > > >> > > Assuming there are multiple regions in existence for each prefix, > why > >> > > would they not be distributed across all the machines? > >> > > > >> > > In other words, if there are many regions with keys that generally > >> > > start with 1, why would they ALL be on server 1 like you said? It's > >> > > my understanding that the regions aren't placed around the cluster > >> > > according to the range of information they contain so I'm not quite > >> > > following that explanation. > >> > > > >> > > Putting the higher cardinality values in front of the key isn't > >> > > entirely out of the question, but I'd like to use the low > cardinality > >> > > key out front for the sake of selecting rows for MapReduce jobs. > >> > > Otherwise, I always have to scan the full table for each job. > >> > > > >> > > On Mon, Sep 3, 2012 at 3:20 PM, Jean-Marc Spaggiari > >> > > <jean-m...@spaggiari.org> wrote: > >> > >> Yes, you're right, but again, it will depend on the number of > >> > >> regionservers and the distribution of your data. > >> > >> > >> > >> If you have 3 region servers and your data is evenly distributed, > >> > >> that > >> > >> mean all the data starting with a 1 will be on server 1, and so on. > >> > >> > >> > >> So if you write a million of lines starting with a 1, they will all > >> > >> land on the same server. > >> > >> > >> > >> Of course, you can pre-split your table. Like 1a to 1z and assign > >> > >> each > >> > >> region to one of you 3 servers. That way you will avoir hotspotting > >> > >> even if you write million of lines starting with a 1. > >> > >> > >> > >> If you have une hundred regions, you will face the same issue at > the > >> > >> beginning, but the more data your will add, the more your table > will > >> > >> be split across all the servers and the less hotspottig you will > >> > >> have. > >> > >> > >> > >> Can't you just revert your fields and put the 1 to 30 at the end of > >> the > >> > >> key? > >> > >> > >> > >> 2012/9/3, Eric Czech <e...@nextbigsound.com>: > >> > >>> Thanks for the response Jean-Marc! > >> > >>> > >> > >>> I understand what you're saying but in a more extreme case, let's > >> > >>> say > >> > >>> I'm choosing the leading number on the range 1 - 3 instead of 1 - > >> > >>> 30. > >> > >>> In that case, it seems like all of the data for any one prefix > >> > >>> would > >> > >>> already be split well across the cluster and as long as the second > >> > >>> value isn't written sequentially, there wouldn't be an issue. > >> > >>> > >> > >>> Is my reasoning there flawed at all? > >> > >>> > >> > >>> On Mon, Sep 3, 2012 at 2:31 PM, Jean-Marc Spaggiari > >> > >>> <jean-m...@spaggiari.org> wrote: > >> > >>>> Hi Eric, > >> > >>>> > >> > >>>> In HBase, data is stored sequentially based on the key > >> > >>>> alphabetical > >> > >>>> order. > >> > >>>> > >> > >>>> It will depend of the number of reqions and regionservers you > have > >> but > >> > >>>> if you write data from 23AAAAAA to 23ZZZZZZ they will most > >> > >>>> probably > >> go > >> > >>>> to the same region even if the cardinality of the 2nd part of the > >> key > >> > >>>> is high. > >> > >>>> > >> > >>>> If the first number is always changing between 1 and 30 for each > >> > >>>> write, then you will reach multiple region/servers if you have, > >> else, > >> > >>>> you might have some hot-stopping. > >> > >>>> > >> > >>>> JM > >> > >>>> > >> > >>>> 2012/9/3, Eric Czech <e...@nextbigsound.com>: > >> > >>>>> Hi everyone, > >> > >>>>> > >> > >>>>> I was curious whether or not I should expect any write hot spots > >> if I > >> > >>>>> structured my composite keys in a way such that the first field > >> > >>>>> is > >> a > >> > >>>>> low cardinality (maybe 30 distinct values) value and the next > >> > >>>>> field > >> > >>>>> contains a very high cardinality value that would not be written > >> > >>>>> sequentially. > >> > >>>>> > >> > >>>>> More concisely, I want to do this: > >> > >>>>> > >> > >>>>> Given one number between 1 and 30, write many millions of rows > >> > >>>>> with > >> > >>>>> keys like <number chosen> : <some generally distinct, > >> non-sequential > >> > >>>>> value> > >> > >>>>> > >> > >>>>> Would there be any problem with the millions of writes happening > >> with > >> > >>>>> the same first field key prefix even if the second field is > >> > >>>>> largely > >> > >>>>> unique? > >> > >>>>> > >> > >>>>> Thank you! > >> > >>>>> > >> > >>> > >> > > > >> > > >> > > >