Re: HBase 2 slower than HBase 1?

Andrew Purtell Thu, 25 Jun 2020 12:27:56 -0700

I repeated this test with pe --filterAll and the results were revealing, at
least for this case. I also patched in thread local hash map for atomic
counters that I could update from code paths in SQM, StoreScanner,
HFileReader*, and HFileBlock. Because a RPC is processed by a single
handler thread I could update counters and accumulate micro-timings via
System#nanoTime() per RPC and dump them out of CallRunner in some new trace
logging. I spent a couple of days making sure the instrumentation was
placed equivalently in both 1.6 and 2.2 code bases and was producing
consistent results. I can provide these patches upon request.


Again, test tables with one family and 1, 5, 10, 20, 50, and 100 distinct
column-qualifiers per row. After loading the table I made a snapshot and
cloned the snapshot for testing, for both 1.6 and 2.2, so both versions
were tested using the exact same data files on HDFS. I also used the 1.6
version of PE for both, so the only change is on the server (1.6 vs 2.2
masters and regionservers).

It appears a refactor to ScanQueryMatcher and friends has disabled the
ability of filters to provide SKIP hints, which prevents us from bypassing
version checking (so some extra cost in SQM), and appears to disable an
optimization that avoids reseeking, leading to a serious and proportional
regression in reseek activity and time spent in that code path. So for
queries that use filters, there can be a substantial regression.

Other test cases that did not use filters did not show a regression.

A test case where I used ROW_INDEX_V1 encoding showed an expected modest
proportional regression in seeking time, due to the fact it is optimized
for point queries and not optimized for the full table scan case.

I will come back here when I understand this better.

Here are the results for the pe --filterAll case:


1.6.0 c1 2.2.5 c1
1.6.0 c5 2.2.5 c5
1.6.0 c10 2.2.5 c10
1.6.0 c20 2.2.5 c20
1.6.0 c50 2.2.5 c50
1.6.0 c100 2.2.5 c100
Counts






















    (better heartbeating)
    (better heartbeating)
    (better heartbeating)
    (better heartbeating)
    (better heartbeating)
rpcs 1 2 200% 2 6 300% 2 10 500% 3 17 567% 4 37 925% 8 72 900%
block_reads 11507 11508 100% 57255 57257 100% 114471 114474 100% 230372
230377 100% 578292 578298 100% 1157955 1157963 100%
block_unpacks 11507 11508 100% 57255 57257 100% 114471 114474 100% 230372
230377 100% 578292 578298 100% 1157955 1157963 100%
seeker_next 10000000 10000000 100% 50000000 50000000 100% 100000000
100000000 100% 200000000 200000000 100% 500000000 500000000 100% 1000000000
1000000000 100%
store_next 10000000 9988268 100% 50000000 49940082 100% 100000000 99879401
100% 200000000 199766539 100% 500000000 499414653 100% 1000000000 998836518
100%
store_reseek 1 11733 > ! 2 59924 > ! 8 120607 > ! 6 233467 > ! 10 585357 > !
8 1163490 > !



















cells_matched 20000000 20000000 100% 60000000 60000000 100% 110000000
110000000 100% 210000000 210000000 100% 510000000 510000000 100% 1010000000
1010000000 100%
column_hint_include 10000000 10000000 100% 50000000 50000000 100% 100000000
100000000 100% 200000000 200000000 100% 500000000 500000000 100% 1000000000
1000000000 100%
filter_hint_skip 10000000 10000000 100% 50000000 50000000 100% 100000000
100000000 100% 200000000 200000000 100% 500000000 500000000 100% 1000000000
1000000000 100%
sqm_hint_done 9999999 9999999 100% 9999998 9999998 100% 9999998 9999998 100%
9999997 9999997 100% 9999996 9999996 100% 9999992 9999992 100%
sqm_hint_seek_next_col 1 0 0% 2 40000002 > ! 2 90000002 > ! 3 190000003 > !
4 490000004 > ! 8 990000008 > !
sqm_hint_seek_next_row 0 10000000 > ! 2 10000000 > ! 0 10000000 > ! 0
10000000 > ! 0 10000000 > ! 0 10000000 > !
sqm_hint_skip 10000000 0 0% 50000000 0 0% 100000000 0 0% 200000000 0 0%
500000000 0 0% 1000000000 0 0%
versions_hint_include_and_seek_next_col 0 0 0% 0 40000000 > ! 0 90000000 > !
0 190000000 > ! 0 490000000 > ! 0 990000000 > !
versions_hint_include_and_seek_next_row 0 0 0% 0 10000000 > ! 0 10000000 > !
0 10000000 > ! 0 10000000 > ! 0 10000000 > !



















Times




































block_read_ms 174.12 215.96 124% 683.78 883.02 129% 1724.66 1937.87 112%
3199.93 3470.92 108% 7582.96 7941.45 105% 14701.52 16022.57 109%
block_unpack_ms 0.63 0.62 99% 2.35 2.35 100% 4.36 4.56 105% 9.10 8.30 91%
21.59 22.01 102% 40.97 40.93 100%
seeker_next_ms 1073.45 1053.51 98% 4452.52 4728.96 106% 9359.61 9706.06 104%
18879.28 18330.93 97% 47664.95 46404.26 97% 96511.83 92121.30 95%
store_next_ms 2651.11 2423.27 91% 12048.64 11398.63 95% 24833.68 23586.20
95% 48493.26 46574.38 96% 120786.00 115603.00 96% 240627.00 227945.00 95%
store_reseek_ms 2.71 62.17 2297% 10.93 233.67 2139% 10.88 401.88 3693% 13.98
783.18 5601% 22.26 1939.28 8714% 24.32 4065.25 16714%






































Rows=10000000

















ValueSize=20

















BlockEncoding=FAST_DIFF

















Compress=NONE
Filter=ALL







































On Thu, Jun 11, 2020 at 5:08 PM Andrew Purtell <[email protected]> wrote:

> I used PE to generate 10M row tables with one family with either 1, 10,
> 20, 50, or 100 values per row (unique column-qualifiers). An increase in
> wall clock time was noticeable, for example:
>
> 1.6.0
>
> time ./bin/hbase pe --rows=5000000 --table=TestTable_f1_c20 --columns=20
> --nomapred scan 2
> real 1m20.179s
> user 0m45.470s
> sys 0m16.572s
>
> 2.2.5
>
> time ./bin/hbase pe --rows=5000000 --table=TestTable_f1_c20 --columns=20
> --nomapred scan 2
> real 1m31.016s
> user 0m48.658s
> sys 0m16.965s
>
> It didnt really make a difference if I used 1 thread or 4 or 10, the delta
> was about the same, proportionally. I picked two threads in the end so I'd
> have enough time to launch async-profiler twice in another shell, to
> capture flamegraph and call tree output, respectively. async-profiler
> captured 10 seconds at steady state per test case. Upon first inspection
> what jumps out is an increasing proportion of CPU time spent in GC in 2.2.5
> vs 1.6.0. The difference increases as the number of column families
> increase. There is little apparent difference at 1 column family, but a 2x
> or more difference at 20 columns, and a 10x or more difference at 100
> columns, eyeballing the charts, flipping back and forth between browser
> windows. This seems more than coincidental but obviously calls for capture
> and analysis of GC trace, with JFR. Will do that next.
>
> JVM: openjdk version "1.8.0_232" OpenJDK Runtime Environment (Zulu
> 8.42.0.21-CA-macosx) (build 1.8.0_232-b18) OpenJDK 64-Bit Server VM (Zulu
> 8.42.0.21-CA-macosx) (build 25.232-b18, mixed mode)
>
> Regionserver JVM flags: -Xms10g -Xmx10g -XX:+UseG1GC -XX:+AlwaysPreTouch
> -XX:+UseNUMA -XX:-UseBiasedLocking -XX:+ParallelRefProcEnabled
>
>
> On Thu, Jun 11, 2020 at 7:06 AM Jan Van Besien <[email protected]> wrote:
>
>> This is promising, thanks a lot. Testing with hbase 2.2.5 shows an
>> improvement, but we're not there yet.
>>
>> As reported earlier, hbase 2.1.0 was about 60% slower than hbase 1.2.0
>> in a test that simply scans all the regions in parallel without any
>> filter. A test with hbase 2.2.5 shows it to be about 40% slower than
>> 1.2.0. So that is better than 2.1.0, but still substantially slower
>> than what hbase 1.2.0 was.
>>
>> As before, I tested this both on a 3 node cluster as well as with a
>> unit test using HBaseTestingUtility. Both tests show very similar
>> relative differences.
>>
>> Jan
>>
>> On Thu, Jun 11, 2020 at 2:16 PM Anoop John <[email protected]> wrote:
>> >
>> > In another mail thread Zheng Hu brought up an important Jra fix
>> > https://issues.apache.org/jira/browse/HBASE-21657
>> > Can u pls check with this once?
>> >
>> > Anoop
>> >
>> >
>> > On Tue, Jun 9, 2020 at 8:08 PM Jan Van Besien <[email protected]> wrote:
>> >
>> > > On Sun, Jun 7, 2020 at 7:49 AM Anoop John <[email protected]>
>> wrote:
>> > > > As per the above configs, it looks like Bucket Cache is not being
>> used.
>> > > > Only on heap LRU cache in use.
>> > >
>> > > True (but it is large enough to hold everything, so I don't think it
>> > > matters).
>> > >
>> > > > @Jan - Is it possible for you to test with off heap Bucket Cache?
>> > >  Config
>> > > > bucket cache off heap mode with size ~7.5 GB
>> > >
>> > > I did a quick test but it seems not to make a measurable difference.
>> > > If anything, it is actually slightly slower even. I see 100% hit ratio
>> > > in the L1
>> > > LruBlockCache and effectively also 100% in the L2 BucketCache (hit
>> > > ratio is not yet at 100% but hits increase with every test and misses
>> > > do not).
>> > >
>> > > Given that the LruBlockCache was already large enough to cache all the
>> > > data anyway, I did not expect this to help either, to be honest.
>> > >
>> > > > Do you have any DataBlockEncoding enabled on the CF?
>> > >
>> > > Yes, FAST_DIFF. But this is of course true in both the tests with
>> > > hbase2 and hbase1.
>> > >
>> > > Jan
>> > >
>>
>
>
> --
> Best regards,
> Andrew
>
> Words like orphans lost among the crosstalk, meaning torn from truth's
> decrepit hands
>    - A23, Crosstalk
>


-- 
Best regards,
Andrew

Words like orphans lost among the crosstalk, meaning torn from truth's
decrepit hands
   - A23, Crosstalk

Re: HBase 2 slower than HBase 1?

Reply via email to