Re: Optimizing compactions on super-low-cost HW

Serega Sheypak Mon, 25 May 2015 01:21:45 -0700

Ok, got it. Thank you.

2015-05-25 7:58 GMT+03:00 lars hofhansl <la...@apache.org>:


> Re: blockingStoreFiles
> With LSM stores you do not get a smooth behavior when you continuously try
> to pump more data into the cluster than the system can absorb.
> For a while the memstores can absorb the write in RAM, then they need to
> flush. If compactions cannot keep up with the influx of new HFiles, you
> have two choices: (1) you allow the number of the HFiles to grow at the
> expense of read performance, or (2) you tell the clients to slow down
> (there are various levels of sophistication about how you do that, but
> that's besides the point).
> blockingStoreFiles is the maximum number of files (per store, i.e. per
> column family) that HBase will allow to accumulate before it stops
> accepting writes from the clients.In 0.94 it would simply block for a
> while. In 0.98 it throws an exception back to the client to tell it to back
> off.
> -- Lars
>
>      From: Serega Sheypak <serega.shey...@gmail.com>
>  To: user <user@hbase.apache.org>; lars hofhansl <la...@apache.org>
>  Sent: Sunday, May 24, 2015 12:59 PM
>  Subject: Re: Optimizing compactions on super-low-cost HW
>
> Hi, thanks!
> > hbase.hstore.blockingStoreFiles
> Don't understand the idea of this setting, can I find explanation for
> "dummies"?
>
> >hbase.hregion.majorcompaction
> done already
>
> >DATA_BLOCK_ENCODING, SNAPPY
> I always use it by default, CPU OK
>
> > memstore flush size
> done
>
>
> >I assume only the 300g partitions are mirrored, right? (not the entire 2t
> drive)
> Aha
>
> >Can you add more machines?
> Will do it when earn money.
> Thank you :)
>
>
>
> 2015-05-24 21:42 GMT+03:00 lars hofhansl <la...@apache.org>:
>
> > Yeah, all you can do is drive your write amplification down.
> >
> >
> > As Stack said:
> > - Increase hbase.hstore.compactionThreshold, and
> > hbase.hstore.blockingStoreFiles. It'll hurt read, but in your case read
> is
> > already significantly hurt when compactions happen.
> >
> >
> > - Absolutely set hbase.hregion.majorcompaction to 1 week (with a jitter
> if
> > 1/2 week, that's the default in 0.98 and later). Minor compaction will
> > still happen, based on the compactionThreshold setting. Right now you're
> > rewriting _all_ you data _every_ day.
> >
> >
> > - Turning off WAL writing will safe you IO, but I doubt it'll help much.
> I
> > do not expect async WAL helps a lot as the aggregate IO is still the
> same.
> >
> > - See if you can enable DATA_BLOCK_ENCODING on your column families
> > (FAST_DIFF, or PREFIX are good). You can also try SNAPPY compression.
> That
> > would reduce you overall IO (Since your CPUs are also weak you'd have to
> > test the CPU/IO tradeoff)
> >
> >
> > - If you have RAM to spare, increase the memstore flush size (will lead
> to
> > initially larger and fewer files).
> >
> >
> > - Or (again if you have spare RAM) make your regions smaller, to curb
> > write amplification.
> >
> >
> > - I assume only the 300g partitions are mirrored, right? (not the entire
> > 2t drive)
> >
> >
> > I have some suggestions compiled here (if you don't mind the plug):
> >
> http://hadoop-hbase.blogspot.com/2015/05/my-hbasecon-talk-about-hbase.html
> >
> > Other than that, I'll repeat what others said, you have 14 extremely weak
> > machines, you can't expect the world from this.
> > You're aggregate IOPS are less than 3000, you aggregate IO bandwidth
> > ~3GB/s. Can you add more machines?
> >
> >
> > -- Lars
> >
> > ________________________________
> > From: Serega Sheypak <serega.shey...@gmail.com>
> > To: user <user@hbase.apache.org>
> > Sent: Friday, May 22, 2015 3:45 AM
> > Subject: Re: Optimizing compactions on super-low-cost HW
> >
> >
> > We don't have money, these nodes are the cheapest. I totally agree that
> we
> > need 4-6 HDD, but there is no chance to get it unfortunately.
> > Okay, I'll try yo apply Stack suggestions.
> >
> >
> >
> >
> > 2015-05-22 13:00 GMT+03:00 Michael Segel <michael_se...@hotmail.com>:
> >
> > > Look, to be blunt, you’re screwed.
> > >
> > > If I read your cluster spec.. it sounds like you have a single i7 (quad
> > > core) cpu. That’s 4 cores or 8 threads.
> > >
> > > Mirroring the OS is common practice.
> > > Using the same drives for Hadoop… not so good, but once the sever boots
> > > up… not so much I/O.
> > > Its not good, but you could live with it….
> > >
> > > Your best bet is to add a couple of more spindles. Ideally you’d want
> to
> > > have 6 drives. the 2 OS drives mirrored and separate. (Use the extra
> > space
> > > to stash / write logs.) Then have 4 drives / spindles in JBOD for
> Hadoop.
> > > This brings you to a 1:1 on physical cores.  If your box can handle
> more
> > > spindles, then going to a total of 10 drives would improve performance
> > > further.
> > >
> > > However, you need to level set your expectations… you can only go so
> far.
> > > If you have 4 drives spinning,  you could start to saturate a 1GbE
> > network
> > > so that will hurt performance.
> > >
> > > That’s pretty much your only option in terms of fixing the hardware and
> > > then you have to start tuning.
> > >
> > > > On May 21, 2015, at 4:04 PM, Stack <st...@duboce.net> wrote:
> > > >
> > > > On Thu, May 21, 2015 at 1:04 AM, Serega Sheypak <
> > > serega.shey...@gmail.com>
> > > > wrote:
> > > >
> > > >>> Do you have the system sharing
> > > >> There are 2 HDD 7200 2TB each. There is 300GB OS partition on each
> > drive
> > > >> with mirroring enabled. I can't persuade devops that mirroring could
> > > cause
> > > >> IO issues. What arguments can I bring? They use OS partition
> mirroring
> > > when
> > > >> disck fails, we can use other partition to boot OS and continue to
> > > work...
> > > >>
> > > >>
> > > > You are already compromised i/o-wise having two disks only. I have
> not
> > > the
> > > > experience to say for sure but basic physics would seem to dictate
> that
> > > > having your two disks (partially) mirrored compromises your i/o even
> > > more.
> > > >
> > > > You are in a bit of a hard place. Your operators want the machine to
> > boot
> > > > even after it loses 50% of its disk.
> > > >
> > > >
> > > >>> Do you have to compact? In other words, do you have read SLAs?
> > > >> Unfortunately, I have mixed workload from web applications. I need
> to
> > > write
> > > >> and read and SLA is < 50ms.
> > > >>
> > > >>
> > > > Ok. You get the bit that seeks are about 10ms or each so with two
> disks
> > > you
> > > > can do 2x100 seeks a second presuming no one else is using disk.
> > > >
> > > >
> > > >>> How are your read times currently?
> > > >> Cloudera manager says it's 4K reads per second and 500 writes per
> > second
> > > >>
> > > >>> Does your working dataset fit in RAM or do
> > > >> reads have to go to disk?
> > > >> I have several tables for 500GB each and many small tables 10-20 GB.
> > > Small
> > > >> tables loaded hourly/daily using bulkload (prepare HFiles using MR
> and
> > > move
> > > >> them to HBase using utility). Big tables are used by webapps, they
> > read
> > > and
> > > >> write them.
> > > >>
> > > >>
> > > > These hfiles are created on same cluster with MR? (i.e. they are
> using
> > up
> > > > i/os)
> > > >
> > > >
> > > >>> It looks like you are running at about three storefiles per column
> > > family
> > > >> is it hbase.hstore.compactionThreshold=3?
> > > >>
> > > >
> > > >
> > > >>> What if you upped the threshold at which minors run?
> > > >> you mean bump  hbase.hstore.compactionThreshold to 8 or 10?
> > > >>
> > > >>
> > > > Yes.
> > > >
> > > > Downside is that your reads may require more seeks to find a
> keyvalue.
> > > >
> > > > Can you cache more?
> > > >
> > > > Can you make it so files are bigger before you flush?
> > > >
> > > >
> > > >
> > > >>> Do you have a downtime during which you could schedule compactions?
> > > >> Unfortunately no. It should work 24/7 and sometimes it doesn't do
> it.
> > > >>
> > > >>
> > > > So, it is running at full bore 24/7?  There is no 'downtime'... a
> time
> > > when
> > > > the traffic is not so heavy?
> > > >
> > > >
> > > >
> > > >>> Are you managing the major compactions yourself or are you having
> > > hbase do
> > > >> it for you?
> > > >> HBase, once a day hbase.hregion.majorcompaction=1day
> > > >>
> > > >>
> > > > Have you studied your compactions?  You realize that a major
> compaction
> > > > will do full rewrite of your dataset?  When they run, how many
> > storefiles
> > > > are there?
> > > >
> > > > Do you have to run once a day?  Can you not run once a week?  Can you
> > > > manage the compactions yourself... and run them a region at a time
> in a
> > > > rolling manner across the cluster rather than have them just run
> > whenever
> > > > it suits them once a day?
> > > >
> > > >
> > > >
> > > >> I can disable WAL. It's ok to loose some data in case of RS failure.
> > I'm
> > > >> not doing banking transactions.
> > > >> If I disable WAL, could it help?
> > > >>
> > > >>
> > > > It could but don't. Enable deferring sync'ing first if you can 'lose'
> > > some
> > > > data.
> > > >
> > > > Work on your flushing and compactions before you mess w/ WAL.
> > > >
> > > > What version of hbase are you on? You say CDH but the newer your
> hbase,
> > > the
> > > > better it does generally.
> > > >
> > > > St.Ack
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >> 2015-05-20 18:04 GMT+03:00 Stack <st...@duboce.net>:
> > > >>
> > > >>> On Mon, May 18, 2015 at 4:26 PM, Serega Sheypak <
> > > >> serega.shey...@gmail.com>
> > > >>> wrote:
> > > >>>
> > > >>>> Hi, we are using extremely cheap HW:
> > > >>>> 2 HHD 7200
> > > >>>> 4*2 core (Hyperthreading)
> > > >>>> 32GB RAM
> > > >>>>
> > > >>>> We met serious IO performance issues.
> > > >>>> We have more or less even distribution of read/write requests. The
> > > same
> > > >>> for
> > > >>>> datasize.
> > > >>>>
> > > >>>> ServerName Request Per Second Read Request Count Write Request
> Count
> > > >>>> node01.domain.com,60020,1430172017193 195 171871826 16761699
> > > >>>> node02.domain.com,60020,1426925053570 24 34314930 16006603
> > > >>>> node03.domain.com,60020,1430860939797 22 32054801 16913299
> > > >>>> node04.domain.com,60020,1431975656065 33 1765121 253405
> > > >>>> node05.domain.com,60020,1430484646409 27 42248883 16406280
> > > >>>> node07.domain.com,60020,1426776403757 27 36324492 16299432
> > > >>>> node08.domain.com,60020,1426775898757 26 38507165 13582109
> > > >>>> node09.domain.com,60020,1430440612531 27 34360873 15080194
> > > >>>> node11.domain.com,60020,1431989669340 28 44307 13466
> > > >>>> node12.domain.com,60020,1431927604238 30 5318096 2020855
> > > >>>> node13.domain.com,60020,1431372874221 29 31764957 15843688
> > > >>>> node14.domain.com,60020,1429640630771 41 36300097 13049801
> > > >>>>
> > > >>>> ServerName Num. Stores Num. Storefiles Storefile Size Uncompressed
> > > >>>> Storefile
> > > >>>> Size Index Size Bloom Size
> > > >>>> node01.domain.com,60020,1430172017193 82 186 1052080m 76496mb
> > 641849k
> > > >>>> 310111k
> > > >>>> node02.domain.com,60020,1426925053570 82 179 1062730m 79713mb
> > 649610k
> > > >>>> 318854k
> > > >>>> node03.domain.com,60020,1430860939797 82 179 1036597m 76199mb
> > 627346k
> > > >>>> 307136k
> > > >>>> node04.domain.com,60020,1431975656065 82 400 1034624m 76405mb
> > 655954k
> > > >>>> 289316k
> > > >>>> node05.domain.com,60020,1430484646409 82 185 1111807m 81474mb
> > 688136k
> > > >>>> 334127k
> > > >>>> node07.domain.com,60020,1426776403757 82 164 1023217m 74830mb
> > 631774k
> > > >>>> 296169k
> > > >>>> node08.domain.com,60020,1426775898757 81 171 1086446m 79933mb
> > 681486k
> > > >>>> 312325k
> > > >>>> node09.domain.com,60020,1430440612531 81 160 1073852m 77874mb
> > 658924k
> > > >>>> 309734k
> > > >>>> node11.domain.com,60020,1431989669340 81 166 1006322m 75652mb
> > 664753k
> > > >>>> 264081k
> > > >>>> node12.domain.com,60020,1431927604238 82 188 1050229m 75140mb
> > 652970k
> > > >>>> 304137k
> > > >>>> node13.domain.com,60020,1431372874221 82 178 937557m 70042mb
> > 601684k
> > > >>>> 257607k
> > > >>>> node14.domain.com,60020,1429640630771 82 145 949090m 69749mb
> > 592812k
> > > >>>> 266677k
> > > >>>>
> > > >>>>
> > > >>>> When compaction starts  random node gets I/O 100%, io wait for
> > > seconds,
> > > >>>> even tenth of seconds.
> > > >>>>
> > > >>>> What are the approaches to optimize minor and major compactions
> when
> > > >> you
> > > >>>> are I/O bound..?
> > > >>>>
> > > >>>
> > > >>> Yeah, with two disks, you will be crimped. Do you have the system
> > > sharing
> > > >>> with hbase/hdfs or is hdfs running on one disk only?
> > > >>>
> > > >>> Do you have to compact? In other words, do you have read SLAs?  How
> > are
> > > >>> your read times currently?  Does your working dataset fit in RAM or
> > do
> > > >>> reads have to go to disk?  It looks like you are running at about
> > three
> > > >>> storefiles per column family.  What if you upped the threshold at
> > which
> > > >>> minors run? Do you have a downtime during which you could schedule
> > > >>> compactions? Are you managing the major compactions yourself or are
> > you
> > > >>> having hbase do it for you?
> > > >>>
> > > >>> St.Ack
> > > >>>
> > > >>
> > >
> > >
> >
>
>
>

Re: Optimizing compactions on super-low-cost HW

Reply via email to