Re: Scan performance on a big table as combination of multiple logic tables

M. C. Srivas Sun, 19 Feb 2012 08:38:39 -0800

What is the impact when a compaction happens on a large 20G region?   Given
that the FS will do writes at 30 MB/s (over a single 1 GigE link), it will
take about 1500 seconds to read/write the region. Is the region out of
service for 25 mins (= 1500 seconds)?



On Fri, Feb 17, 2012 at 11:25 PM, Pan, Thomas <[email protected]> wrote:

>
> Jacques, thanks for the details on region size. We've observed that
> regions per region server could skew big time at the table level. We do
> have tool to balance regions. Still, it is sort of annoying to maintain
> the balance. $0.02, -Thomas
>
> On 2/17/12 2:46 PM, "Jacques" <[email protected]> wrote:
>
> >You should be fine having multiple tables with high region counts.  I
> >would
> >avoid making thousands of tables.  However, if you have three separate
> >business needs, make three different tables.
> >
> >You seem to be starting with a perspective that there would be some kind
> >of
> >issues with multiple tables.  Why do you think this exists?  You said
> >"Otherwise, runtime tuning seems to add quite amount of operational cost."
> >I'm not sure what you are thinking here and where your thoughts are coming
> >from.  Additionally, if you have separate tables, then you can modify them
> >differently (e.g. setting them to different region sizes if it makes
> >sense-- for example, some of our tables have smaller region sizes so we'll
> >have more maps rather than fewer when we run map reduce jobs).
> >
> >Regarding region size: the HTable v1 format in 0.90 and below suffered
> >from
> >taking a long time to transition as individual regions got too big.  With
> >0.92 and HTablev2 that isn't as much of a problem as I understand it.  If
> >I
> >recall correctly, there are numerous organizations using 10gb regions with
> >sucess-- (among others, I believe this what Yahoo reported they were using
> >for their web crawl tables on their thousand node cluster).  While I
> >haven't run any stats, I believe that there is negligible scan performance
> >impact as region size grows.  There is definitely no  exponential negative
> >performance impact.
> >
> >
> >
> >On Fri, Feb 17, 2012 at 10:55 AM, Pan, Thomas <[email protected]> wrote:
> >
> >>
> >> Vladimire and Jacques, Thanks for the information! Unless Hbase well
> >> handles multiple big sized tables (relatively high region count) in one
> >> cluster, it seems to me that one big table is the way to go. Otherwise,
> >> runtime tuning seems to add quite amount of operational cost. That leads
> >> to another question. Do we see big region size as an issue? If so,
> >>what's
> >> the pivot point as region size grows further, the scan performance
> >>starts
> >> to degrade exponentially?
> >>
> >> On 2/15/12 4:11 PM, "Vladimir Rodionov" <[email protected]>
> wrote:
> >>
> >> >10 tables are fine. 1000 are not, especially when one does table
> >> >pre-splitting to increase write perf.
> >> >
> >> >Too many regions kill HBase.
> >> >
> >> >Best regards,
> >> >Vladimir Rodionov
> >> >Principal Platform Engineer
> >> >Carrier IQ, www.carrieriq.com
> >> >e-mail: [email protected]
> >> >
> >> >________________________________________
> >> >From: Jacques [[email protected]]
> >> >Sent: Wednesday, February 15, 2012 3:45 PM
> >> >To: [email protected]
> >> >Subject: Re: Scan performance on a big table as combination of multiple
> >> >logic tables
> >> >
> >> >Out of curiosity,  what do you perceive as the benefit to having only
> >>one
> >> >table?  Are there reasons that you think one table would perform better
> >> >than a few?
> >> >
> >> >If you're splitting data within a table because you'd otherwise have
> >> >millions of tables, I understand that and would concur with Vladimir's
> >> >approach below.  However, if you're really looking at 10 tables versus
> >>one
> >> >table, it seems like HBase is built exactly to make that work well
> >>(rather
> >> >than having to make all sorts of application level code to do what
> >>HBase
> >> >already does).
> >> >
> >> >thanks,
> >> >Jacques
> >> >
> >> >On Wed, Feb 15, 2012 at 1:57 PM, Pan, Thomas <[email protected]> wrote:
> >> >
> >> >>
> >> >> Since Hbase is tailored to handle one table very well, we are
> >>thinking
> >> >>to
> >> >> put multiple tables into one big table but on different column family
> >> >>sets.
> >> >> Our use case is full table scan against single column value filters.
> >>As
> >> >> records from different "logical tables" are at different column
> >> >>families,
> >> >> could we speed up the scan performance by simply checking the column
> >> >>family
> >> >> referenced by these single column value filters first before really
> >> >>going
> >> >> through all the underlying K-V pairs? It would be great if the Hbase
> >> >>code
> >> >> is already coded that way.
> >> >>
> >> >>
> >> >> $0.02,
> >> >> Thomas
> >> >>
> >> >>
> >> >
> >> >Confidentiality Notice:  The information contained in this message,
> >> >including any attachments hereto, may be confidential and is intended
> >>to
> >> >be read only by the individual or entity to whom this message is
> >> >addressed. If the reader of this message is not the intended recipient
> >>or
> >> >an agent or designee of the intended recipient, please note that any
> >> >review, use, disclosure or distribution of this message or its
> >> >attachments, in any form, is strictly prohibited.  If you have received
> >> >this message in error, please immediately notify the sender and/or
> >> >[email protected] and delete or destroy any copy of this
> >> >message and its attachments.
> >>
> >>
>
>

Re: Scan performance on a big table as combination of multiple logic tables

Reply via email to