Re: Smaller Region Size?

Jean-Daniel Cryans Thu, 24 Dec 2009 10:45:42 -0800

It will save you a lot of trouble since by default the version of a
cell is set to System.currenTimeInMillis by the region server. Let's
say you delete a value, the region gets reassigned minutes later to
another region server which is running 60 minutes in the past and then
you do a Get on that cell with default ts. This will translate to a
time before the previous delete so you will get a deleted cell back
(unless a major compaction was run).


So a minor clock skew is ok but more than 20 minutes is asking for
trouble. This requirement is documented in the Getting Started.

J-D

On Thu, Dec 24, 2009 at 8:17 AM, Dhruba Borthakur <dhr...@gmail.com> wrote:
> Hi folks,
>
> Is it necessary to run keep the clocks synchronized on all Hbase region
> servers/master? I would appreciate it a lot if somebody can please explain
> if the HBase architecture depends on this fact.
>
> thanks,
> dhruba
>
>
> On Wed, Dec 23, 2009 at 9:57 AM, Mark Vigeant
> <mark.vige...@riskmetrics.com>wrote:
>
>> The clocks are all running in sync, though I am not using NTP shamefully. I
>> should.
>>
>> And no, I listed the errors backwards, that's not how they showed up in the
>> log, sorry, heh. I don't think they run backwards.
>>
>> -----Original Message-----
>> From: Andrew Purtell [mailto:apurt...@apache.org]
>> Sent: Wednesday, December 23, 2009 12:47 PM
>> To: hbase-user@hadoop.apache.org
>> Subject: Re: Smaller Region Size?
>>
>> How do you have clocks set up on your systems Mark? Are you using NTP to
>> keep
>> them sane? Am I correct that they are sometimes running backward?
>>
>>
>>   - Andy
>>
>>
>>
>> ----- Original Message ----
>> > From: Mark Vigeant <mark.vige...@riskmetrics.com>
>> > To: "hbase-user@hadoop.apache.org" <hbase-user@hadoop.apache.org>
>> > Sent: Wed, December 23, 2009 9:09:04 AM
>> > Subject: RE: Smaller Region Size?
>> >
>> > > The biggest legitimate reason to run smaller region size is if your
>> > > data set is small (lets say 400mb) but highly accessed, so you want a
>> > > good spread of regions across your cluster.
>> >
>> > That's exactly it, my input dataset was 500MB total (~1,000,000 rows) and
>> it was
>> > getting stored as just one region on one regionserver.
>> >
>> > In response to St. Ack, I don't think my regions are performing too many
>> splits:
>> > the regionserver logs mainly consist of the occasional ZooKeeper
>> Connection
>> > error and these two repeatedly:
>> >
>> > 2009-12-22 15:21:50,415 DEBUG
>> org.apache.hadoop.hbase.io.hfile.LruBlockCache:
>> > Cache Stats: Sizes: Total=6.556961MB (6875472), Free=792.61804MB
>> (831120240),
>> > Max=799.175MB (837995712), Counts: Blocks=0, Access=25755, Hit=0,
>> Miss=25755,
>> > Evictions=0, Evicted=0, Ratios: Hit Ratio=0.0%, Miss Ratio=100.0%,
>> > Evicted/Run=NaN
>> >
>> > 2009-12-22 15:20:35,073 DEBUG org.apache.hadoop.hbase.regionserver.Store:
>> > Skipping major compaction of Message because one (major) compacted file
>> only and
>> > elapsedTime 339624149ms is < ttl=9223372036854775807
>> >
>> > You're suggesting the performance would be improved if the dataset was
>> larger?
>> > What are other parameters that can be fine-tuned to optimize based off
>> data
>> > size?
>> >
>> > Thanks
>> > -Mark
>> > -----Original Message-----
>> > From: Ryan Rawson [mailto:ryano...@gmail.com]
>> > Sent: Tuesday, December 22, 2009 11:28 PM
>> > To: hbase-user@hadoop.apache.org
>> > Subject: Re: Smaller Region Size?
>> >
>> > The biggest legitimate reason to run smaller region size is if your
>> > data set is small (lets say 400mb) but highly accessed, so you want a
>> > good spread of regions across your cluster.
>> >
>> > Another is to run a larger region if you are having a huge table and
>> > you want to keep absolute region count low. I am not 100% sold on this
>> > yet.
>> >
>> > I have a patch that can keep performance high during a highly split
>> > table, by using parallel puts. This has been proven to keep aggregate
>> > performance really high, and I hope it will make 0.20.3.
>> >
>> > On Tue, Dec 22, 2009 at 2:31 PM, stack wrote:
>> > > On Tue, Dec 22, 2009 at 8:57 AM, Mark Vigeant
>> > > wrote:
>> > >
>> > >> J-D,
>> > >>
>> > >> I noticed that performance for uploading data into tables got a lot
>> better
>> > >> as I lowered the max file size -- but up until a certain point, where
>> the
>> > >> performance began slowing down again.
>> > >>
>> > >>
>> > > Tell us more.  What kinda size changes did you make?  How many regions
>> were
>> > > created?  Is the slow down because table is splitting all the time?  If
>> you
>> > > study regionserver logs, can you make out what the regionservers are
>> > > spending their times doing?
>> > >
>> > >
>> > >
>> > >> Is there a rule of thumb/formula/notion to rely on when setting this
>> > >> parameter for optimal performance? Thanks!
>> > >>
>> > >>
>> > > We have most experience running defaults.  Generally folks go up from
>> the
>> > > default size because they want to host more data in about same number
>> or
>> > > regions.  Going down from the default I've not seen much of.
>> > >
>> > > St.Ack
>> > >
>> >
>> > This email message and any attachments are for the sole use of the
>> intended
>> > recipients and may contain proprietary and/or confidential information
>> which may
>> > be privileged or otherwise protected from disclosure. Any unauthorized
>> review,
>> > use, disclosure or distribution is prohibited. If you are not an intended
>> > recipient, please contact the sender by reply email and destroy the
>> original
>> > message and any copies of the message as well as any attachments to the
>> original
>> > message.
>>
>>
>>
>>
>>
>>
>> This email message and any attachments are for the sole use of the intended
>> recipients and may contain proprietary and/or confidential information which
>> may be privileged or otherwise protected from disclosure. Any unauthorized
>> review, use, disclosure or distribution is prohibited. If you are not an
>> intended recipient, please contact the sender by reply email and destroy the
>> original message and any copies of the message as well as any attachments to
>> the original message.
>>
>
>
>
> --
> Connect to me at http://www.facebook.com/dhruba
>

Re: Smaller Region Size?

Reply via email to