What is the impact when a compaction happens on a large 20G region? Given that the FS will do writes at 30 MB/s (over a single 1 GigE link), it will take about 1500 seconds to read/write the region. Is the region out of service for 25 mins (= 1500 seconds)?
On Fri, Feb 17, 2012 at 11:25 PM, Pan, Thomas <[email protected]> wrote: > > Jacques, thanks for the details on region size. We've observed that > regions per region server could skew big time at the table level. We do > have tool to balance regions. Still, it is sort of annoying to maintain > the balance. $0.02, -Thomas > > On 2/17/12 2:46 PM, "Jacques" <[email protected]> wrote: > > >You should be fine having multiple tables with high region counts. I > >would > >avoid making thousands of tables. However, if you have three separate > >business needs, make three different tables. > > > >You seem to be starting with a perspective that there would be some kind > >of > >issues with multiple tables. Why do you think this exists? You said > >"Otherwise, runtime tuning seems to add quite amount of operational cost." > >I'm not sure what you are thinking here and where your thoughts are coming > >from. Additionally, if you have separate tables, then you can modify them > >differently (e.g. setting them to different region sizes if it makes > >sense-- for example, some of our tables have smaller region sizes so we'll > >have more maps rather than fewer when we run map reduce jobs). > > > >Regarding region size: the HTable v1 format in 0.90 and below suffered > >from > >taking a long time to transition as individual regions got too big. With > >0.92 and HTablev2 that isn't as much of a problem as I understand it. If > >I > >recall correctly, there are numerous organizations using 10gb regions with > >sucess-- (among others, I believe this what Yahoo reported they were using > >for their web crawl tables on their thousand node cluster). While I > >haven't run any stats, I believe that there is negligible scan performance > >impact as region size grows. There is definitely no exponential negative > >performance impact. > > > > > > > >On Fri, Feb 17, 2012 at 10:55 AM, Pan, Thomas <[email protected]> wrote: > > > >> > >> Vladimire and Jacques, Thanks for the information! Unless Hbase well > >> handles multiple big sized tables (relatively high region count) in one > >> cluster, it seems to me that one big table is the way to go. Otherwise, > >> runtime tuning seems to add quite amount of operational cost. That leads > >> to another question. Do we see big region size as an issue? If so, > >>what's > >> the pivot point as region size grows further, the scan performance > >>starts > >> to degrade exponentially? > >> > >> On 2/15/12 4:11 PM, "Vladimir Rodionov" <[email protected]> > wrote: > >> > >> >10 tables are fine. 1000 are not, especially when one does table > >> >pre-splitting to increase write perf. > >> > > >> >Too many regions kill HBase. > >> > > >> >Best regards, > >> >Vladimir Rodionov > >> >Principal Platform Engineer > >> >Carrier IQ, www.carrieriq.com > >> >e-mail: [email protected] > >> > > >> >________________________________________ > >> >From: Jacques [[email protected]] > >> >Sent: Wednesday, February 15, 2012 3:45 PM > >> >To: [email protected] > >> >Subject: Re: Scan performance on a big table as combination of multiple > >> >logic tables > >> > > >> >Out of curiosity, what do you perceive as the benefit to having only > >>one > >> >table? Are there reasons that you think one table would perform better > >> >than a few? > >> > > >> >If you're splitting data within a table because you'd otherwise have > >> >millions of tables, I understand that and would concur with Vladimir's > >> >approach below. However, if you're really looking at 10 tables versus > >>one > >> >table, it seems like HBase is built exactly to make that work well > >>(rather > >> >than having to make all sorts of application level code to do what > >>HBase > >> >already does). > >> > > >> >thanks, > >> >Jacques > >> > > >> >On Wed, Feb 15, 2012 at 1:57 PM, Pan, Thomas <[email protected]> wrote: > >> > > >> >> > >> >> Since Hbase is tailored to handle one table very well, we are > >>thinking > >> >>to > >> >> put multiple tables into one big table but on different column family > >> >>sets. > >> >> Our use case is full table scan against single column value filters. > >>As > >> >> records from different "logical tables" are at different column > >> >>families, > >> >> could we speed up the scan performance by simply checking the column > >> >>family > >> >> referenced by these single column value filters first before really > >> >>going > >> >> through all the underlying K-V pairs? It would be great if the Hbase > >> >>code > >> >> is already coded that way. > >> >> > >> >> > >> >> $0.02, > >> >> Thomas > >> >> > >> >> > >> > > >> >Confidentiality Notice: The information contained in this message, > >> >including any attachments hereto, may be confidential and is intended > >>to > >> >be read only by the individual or entity to whom this message is > >> >addressed. If the reader of this message is not the intended recipient > >>or > >> >an agent or designee of the intended recipient, please note that any > >> >review, use, disclosure or distribution of this message or its > >> >attachments, in any form, is strictly prohibited. If you have received > >> >this message in error, please immediately notify the sender and/or > >> >[email protected] and delete or destroy any copy of this > >> >message and its attachments. > >> > >> > >
