Re: how to optimize for heavy writes scenario

Hef Tue, 21 Mar 2017 08:11:01 -0700

There were several curious things we have observed:
One the region servers, there were abnormal much more reads than writes:
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             608.00      6552.00         0.00       6552          0
sdb             345.00      2692.00     78868.00       2692      78868
sdc             406.00     14548.00     63960.00      14548      63960
sdd               2.00         0.00        32.00          0         32
sde              62.00      8764.00         0.00       8764          0
sdf             498.00     11100.00        32.00      11100         32
sdg            2080.00     11712.00         0.00      11712          0
sdh             109.00      5072.00         0.00       5072          0
sdi             158.00         4.00     32228.00          4      32228
sdj              43.00      5648.00        32.00       5648         32
sdk             255.00      3784.00         0.00       3784          0
sdl              86.00      1412.00      9176.00       1412       9176


In CDH region server dashboard, the Average Disk IOPS for writes were
stable on 735/s, while the reads raised from 900/s to 5000/s every 5
minutes.

iotop shown the following processes were eating the most io:
 6447 be/4 hdfs        2.70 M/s    0.00 B/s  0.00 % 94.54 % du -sk
/data/12/dfs/dn/curre~632-10.1.1.100-1457937043486
 6023 be/4 hdfs        2.54 M/s    0.00 B/s  0.00 % 92.14 % du -sk
/data/9/dfs/dn/curren~632-10.1.1.100-1457937043486
 6186 be/4 hdfs     1379.58 K/s    0.00 B/s  0.00 % 90.78 % du -sk
/data/11/dfs/dn/curre~632-10.1.1.100-1457937043486

What were all this reading for? And what are thos du -sk processes? Could
this be a reason to slow down the write throughput?



On Tue, Mar 21, 2017 at 7:48 PM, Hef <hef.onl...@gmail.com> wrote:

> Hi guys,
> Thanks for all your hints.
> Let me summarize the tuning I have done these days.
> Initially, before tuning, HBase cluster worked at an average write tps of
> 400k tps (600k tps at max). The total network TX throughputs from
> clients(aggregated from multiple servers) to RegionServers  shown 300Mb/s
> in average.
>
> I adopted the following steps for tuning:
> 1. optimized the HBase schema for our table, deducted the cells size by
> 40%.
>     Result:
>     failed,  tps not obviously increased
>
> 2. Recreated the table by more evenly distribution of pre-split keyspace
>     Result:
>     failed, tps not obviously increased
>
> 3. Adjusted RS GC strategy:
>     Before:
>         -XX:+UseParNewGC
>         -XX:+UseConcMarkSweepGC
>         -XX:CMSInitiatingOccupancyFraction=70
>         -XX:+CMSParallelRemarkEnabled
>         -Xmx100g
>         -Xms100g
>         -Xmn20g
>
>     After:
>         -XX:+UseG1GC
>         -XX:+UnlockExperimentalVMOptions
>         -XX:MaxGCPauseMillis=50
>         -XX:-OmitStackTraceInFastThrow
>         -XX:ParallelGCThreads=18
>         -XX:+ParallelRefProcEnabled
>         -XX:+PerfDisableSharedMem
>         -XX:-ResizePLAB
>         -XX:G1NewSizePercent=8
>         -Xms100G -Xmx100G
>         -XX:MaxTenuringThreshold=1
>         -XX:G1HeapWastePercent=10
>         -XX:G1MixedGCCountTarget=16
>         -XX:G1HeapRegionSize=32M
>
>     Result:
>     Success. GC pause time reduced, tps increased by at least 10%
>
> 4. Upgraded to CDH5.9.1 HBase 1.2, also updated client lib to HBase1.2
>     Success:
>     1. total client TX  throughput raised to 700Mb/s
>     2. HBase write tps raised to 600k/s in average and 800k/s at max
>
> 5. Other configurations:
>     hbase.hstore.compactionThreshold = 10
>     hbase.hstore.blockingStoreFiles = 300
>     hbase.hstore.compaction.max = 20
>     hbase.regionserver.thread.compaction.small = 30
>
>     hbase.hregion.memstore.flush.size = 128
>     hbase.regionserver.global.memstore.lowerLimit = 0.3
>     hbase.regionserver.global.memstore.upperLimit = 0.7
>
>     hbase.regionserver.maxlogs = 100
>     hbase.wal.regiongrouping.numgroups = 5
>     hbase.wal.provider = Multiple HDFS WAL
>
>
>
> Summary:
>     1. HBase 1.2 does have better performance than 1.0
>     2. 300k/s tps per RegionServer still looks not satisfied, as I can see
> the CPU/network/IO/memory  still have a lot idle resources.
>         Per RS:
>         1. CPU 50% used (Not sure why cpu is so high for only 300K writer
> requests)
>         2. JVM Heap, 40% used
>         3. total disks throughput over 12 HDDs, 91MB/s on write and 40MB/s
> on read
>         4. Network in/out 560Mb/s on 1G NIC
>
>
> Further questions:
> Does anyone confront a similiar heavy write scenario like this?
> How much concurrent writes can a RegionServer handle?  Can any one share
> how much tps can your RS reach at max?
>
> Thanks
> Hef
>
>
>
>
>
>
> On Sat, Mar 18, 2017 at 1:11 PM, Yu Li <car...@gmail.com> wrote:
>
>> First please try out stack's suggestion, all good ones.
>>
>> And some supplement: since all disks in use are HDD w/ normal IO
>> capability, it's important to control big IO rate like flush and
>> compaction. Try below features out:
>> 1. HBASE-8329 <https://issues.apache.org/jira/browse/HBASE-8329>: Limit
>> compaction speed (available in 1.1.0+)
>> 2. HBASE-14969 <https://issues.apache.org/jira/browse/HBASE-14969>: Add
>> throughput controller for flush (available in 1.3.0)
>> 3. HBASE-10201 <https://issues.apache.org/jira/browse/HBASE-10201>: Per
>> column family flush (available in 1.1.0+)
>>     * HBASE-14906 <https://issues.apache.org/jira/browse/HBASE-14906>:
>> Improvements on FlushLargeStoresPolicy (only available in 2.0, not
>> released
>> yet)
>>
>> Also try out multiple WAL, we observed ~20% write perf boost in prod. See
>> more details in the doc attached in below JIRA:
>> - HBASE-14457 <https://issues.apache.org/jira/browse/HBASE-14457>:
>> Umbrella:
>> Improve Multiple WAL for production usage
>>
>> And please note that if you decided to pick up a branch-1.1 release, make
>> sure to use 1.1.3+, or you may hit some perf regression issue on writes,
>> see HBASE-14460 <https://issues.apache.org/jira/browse/HBASE-14460> for
>> more details.
>>
>> Hope these information helps.
>>
>> Best Regards,
>> Yu
>>
>> On 18 March 2017 at 05:51, Vladimir Rodionov <vladrodio...@gmail.com>
>> wrote:
>>
>> > >> In my opinion,  1M/s input data will result in only  70MByte/s write
>> >
>> > Times 3 (default HDFS replication factor) Plus ...
>> >
>> > Do not forget about compaction read/write amplification. If you flush
>> 10 MB
>> > and your max region size is 10 GB, with default min file to compact (3)
>> > your amplification is 6-7 That gives us 70 x 3 x 6 = 1260 MB/s
>> read/write
>> > or 210 MB/sec read and writes (210 MB/s reads and 210 MB/sec writes)
>> >
>> > per RS
>> >
>> > This IO load is way above sustainable.
>> >
>> >
>> > -Vlad
>> >
>> >
>> > On Fri, Mar 17, 2017 at 2:14 PM, Kevin O'Dell <ke...@rocana.com> wrote:
>> >
>> > > Hey Hef,
>> > >
>> > >   What is the memstore size setting(how much heap is it allowed) that
>> you
>> > > have on that cluster?  What is your region count per node?  Are you
>> > writing
>> > > evenly across all those regions or are only a few regions active per
>> > region
>> > > server at a time?  Can you paste your GC settings that you are
>> currently
>> > > using?
>> > >
>> > > On Fri, Mar 17, 2017 at 3:30 PM, Stack <st...@duboce.net> wrote:
>> > >
>> > > > On Fri, Mar 17, 2017 at 9:31 AM, Hef <hef.onl...@gmail.com> wrote:
>> > > >
>> > > > > Hi group,
>> > > > > I'm using HBase to store large amount of time series data, the
>> usage
>> > > case
>> > > > > is heavy on writes then reads. My application stops at writing
>> 600k
>> > > > > requests per second and I can't tune up for better tps.
>> > > > >
>> > > > > Hardware:
>> > > > > I have 6 Region Servers, each has 128G memory, 12 HDDs, 2cores
>> with
>> > > > > 24threads,
>> > > > >
>> > > > > Schema:
>> > > > > The schema for these time series data is similar as OpenTSDB that
>> the
>> > > > data
>> > > > > points of a same metric within an hour are store in one row, and
>> > there
>> > > > > could be maximum 3600 columns per row.
>> > > > > The cell is about 70bytes on its size, including the rowkey,
>> column
>> > > > > qualifier, column family and value.
>> > > > >
>> > > > > HBase config:
>> > > > > CDH 5.6 HBase 1.0.0
>> > > > >
>> > > >
>> > > > Can you upgrade? There's a big diff between 1.2 and 1.0.
>> > > >
>> > > >
>> > > > > 100G memory for each RegionServer
>> > > > > hbase.hstore.compactionThreshold = 50
>> > > > > hbase.hstore.blockingStoreFiles = 100
>> > > > > hbase.hregion.majorcompaction disable
>> > > > > hbase.client.write.buffer = 20MB
>> > > > > hbase.regionserver.handler.count = 100
>> > > > >
>> > > >
>> > > > Could try halving the handler count.
>> > > >
>> > > >
>> > > > > hbase.hregion.memstore.flush.size = 128MB
>> > > > >
>> > > > >
>> > > > > Why are you flushing? If it is because you are hitting this flush
>> > > limit,
>> > > > can you try upping it?
>> > > >
>> > > >
>> > > >
>> > > > > HBase Client:
>> > > > > write in BufferedMutator with 100000/batch
>> > > > >
>> > > > > Inputs Volumes:
>> > > > > The input data throughput is more than 2millions/sec from Kafka
>> > > > >
>> > > > >
>> > > > How is the distribution? Evenly over the keyspace?
>> > > >
>> > > >
>> > > > > My writer applications are distributed, how ever I scaled them up,
>> > the
>> > > > > total write throughput won't get larger than 600K/sec.
>> > > > >
>> > > >
>> > > >
>> > > > Tell us more about this scaling up? How many writers?
>> > > >
>> > > >
>> > > >
>> > > > > The severs have 20% CPU usage and 5.6 wa,
>> > > > >
>> > > >
>> > > > 5.6 is high enough. Is the i/o spread over the disks?
>> > > >
>> > > >
>> > > >
>> > > > > GC  doesn't look good though, it shows a lot 10s+.
>> > > > >
>> > > > >
>> > > > What settings do you have?
>> > > >
>> > > >
>> > > >
>> > > > > In my opinion,  1M/s input data will result in only  70MByte/s
>> write
>> > > > > throughput to the cluster, which is quite a small amount compare
>> to
>> > > the 6
>> > > > > region servers. The performance should not be bad like this.
>> > > > >
>> > > > > Is anybody has idea why the performance stops at 600K/s?
>> > > > > Is there anything I have to tune to increase the HBase write
>> > > throughput?
>> > > > >
>> > > >
>> > > >
>> > > > If you double the clients writing do you see an up in the
>> throughput?
>> > > >
>> > > > If you thread dump the servers, can you tell where they are held
>> up? Or
>> > > if
>> > > > they are doing any work at all relative?
>> > > >
>> > > > St.Ack
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > Kevin O'Dell
>> > > Field Engineer
>> > > 850-496-1298 <(850)%20496-1298> | ke...@rocana.com
>> > > @kevinrodell
>> > > <http://www.rocana.com>
>> > >
>> >
>>
>
>

Re: how to optimize for heavy writes scenario

Reply via email to