Re: HBase 2 slower than HBase 1?

Andrew Purtell Thu, 25 Jun 2020 13:27:40 -0700

Related, I think I found a bug in branch-1 where we don’t heartbeat in the 
filter all case until we switch store files, so scanning a very large store 
file might time out with client defaults. Remarking on this here so I don’t 
forget to follow up.


> On Jun 25, 2020, at 12:27 PM, Andrew Purtell <[email protected]> wrote:
> 
> 
> I repeated this test with pe --filterAll and the results were revealing, at 
> least for this case. I also patched in thread local hash map for atomic 
> counters that I could update from code paths in SQM, StoreScanner, 
> HFileReader*, and HFileBlock. Because a RPC is processed by a single handler 
> thread I could update counters and accumulate micro-timings via 
> System#nanoTime() per RPC and dump them out of CallRunner in some new trace 
> logging. I spent a couple of days making sure the instrumentation was placed 
> equivalently in both 1.6 and 2.2 code bases and was producing consistent 
> results. I can provide these patches upon request. 
> 
> Again, test tables with one family and 1, 5, 10, 20, 50, and 100 distinct 
> column-qualifiers per row. After loading the table I made a snapshot and 
> cloned the snapshot for testing, for both 1.6 and 2.2, so both versions were 
> tested using the exact same data files on HDFS. I also used the 1.6 version 
> of PE for both, so the only change is on the server (1.6 vs 2.2 masters and 
> regionservers). 
> 
> It appears a refactor to ScanQueryMatcher and friends has disabled the 
> ability of filters to provide SKIP hints, which prevents us from bypassing 
> version checking (so some extra cost in SQM), and appears to disable an 
> optimization that avoids reseeking, leading to a serious and proportional 
> regression in reseek activity and time spent in that code path. So for 
> queries that use filters, there can be a substantial regression. 
> 
> Other test cases that did not use filters did not show a regression. 
> 
> A test case where I used ROW_INDEX_V1 encoding showed an expected modest 
> proportional regression in seeking time, due to the fact it is optimized for 
> point queries and not optimized for the full table scan case. 
> 
> I will come back here when I understand this better. 
> 
> Here are the results for the pe --filterAll case:
> 
> 
> 1.6.0 c1      2.2.5 c1        
> 1.6.0 c5      2.2.5 c5        
> 1.6.0 c10     2.2.5 c10       
> 1.6.0 c20     2.2.5 c20       
> 1.6.0 c50     2.2.5 c50       
> 1.6.0 c100    2.2.5 c100      
> Counts        
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>     (better heartbeating)     
>     (better heartbeating)     
>     (better heartbeating)     
>     (better heartbeating)     
>     (better heartbeating)
> rpcs  1       2       200%    2       6       300%    2       10      500%    
> 3       17      567%    4       37      925%    8       72      900%
> block_reads   11507   11508   100%    57255   57257   100%    114471  114474  
> 100%    230372  230377  100%    578292  578298  100%    1157955 1157963 100%
> block_unpacks 11507   11508   100%    57255   57257   100%    114471  114474  
> 100%    230372  230377  100%    578292  578298  100%    1157955 1157963 100%
> seeker_next   10000000        10000000        100%    50000000        
> 50000000        100%    100000000       100000000       100%    200000000     
>   200000000       100%    500000000       500000000       100%    1000000000  
>     1000000000      100%
> store_next    10000000        9988268 100%    50000000        49940082        
> 100%    100000000       99879401        100%    200000000       199766539     
>   100%    500000000       499414653       100%    1000000000      998836518   
>     100%
> store_reseek  1       11733   > !     2       59924   > !     8       120607  
> > !     6       233467  > !     10      585357  > !     8       1163490 > !
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> cells_matched 20000000        20000000        100%    60000000        
> 60000000        100%    110000000       110000000       100%    210000000     
>   210000000       100%    510000000       510000000       100%    1010000000  
>     1010000000      100%
> column_hint_include   10000000        10000000        100%    50000000        
> 50000000        100%    100000000       100000000       100%    200000000     
>   200000000       100%    500000000       500000000       100%    1000000000  
>     1000000000      100%
> filter_hint_skip      10000000        10000000        100%    50000000        
> 50000000        100%    100000000       100000000       100%    200000000     
>   200000000       100%    500000000       500000000       100%    1000000000  
>     1000000000      100%
> sqm_hint_done 9999999 9999999 100%    9999998 9999998 100%    9999998 9999998 
> 100%    9999997 9999997 100%    9999996 9999996 100%    9999992 9999992 100%
> sqm_hint_seek_next_col        1       0       0%      2       40000002        
> > !     2       90000002        > !     3       190000003       > !     4     
>   490000004       > !     8       990000008       > !
> sqm_hint_seek_next_row        0       10000000        > !     2       
> 10000000        > !     0       10000000        > !     0       10000000      
>   > !     0       10000000        > !     0       10000000        > !
> sqm_hint_skip 10000000        0       0%      50000000        0       0%      
> 100000000       0       0%      200000000       0       0%      500000000     
>   0       0%      1000000000      0       0%
> versions_hint_include_and_seek_next_col       0       0       0%      0       
> 40000000        > !     0       90000000        > !     0       190000000     
>   > !     0       490000000       > !     0       990000000       > !
> versions_hint_include_and_seek_next_row       0       0       0%      0       
> 10000000        > !     0       10000000        > !     0       10000000      
>   > !     0       10000000        > !     0       10000000        > !
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Times 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> block_read_ms 174.12  215.96  124%    683.78  883.02  129%    1724.66 1937.87 
> 112%    3199.93 3470.92 108%    7582.96 7941.45 105%    14701.52        
> 16022.57        109%
> block_unpack_ms       0.63    0.62    99%     2.35    2.35    100%    4.36    
> 4.56    105%    9.10    8.30    91%     21.59   22.01   102%    40.97   40.93 
>   100%
> seeker_next_ms        1073.45 1053.51 98%     4452.52 4728.96 106%    9359.61 
> 9706.06 104%    18879.28        18330.93        97%     47664.95        
> 46404.26        97%     96511.83        92121.30        95%
> store_next_ms 2651.11 2423.27 91%     12048.64        11398.63        95%     
> 24833.68        23586.20        95%     48493.26        46574.38        96%   
>   120786.00       115603.00       96%     240627.00       227945.00       95%
> store_reseek_ms       2.71    62.17   2297%   10.93   233.67  2139%   10.88   
> 401.88  3693%   13.98   783.18  5601%   22.26   1939.28 8714%   24.32   
> 4065.25 16714%
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Rows=10000000 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> ValueSize=20  
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> BlockEncoding=FAST_DIFF       
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Compress=NONE
> Filter=ALL
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> On Thu, Jun 11, 2020 at 5:08 PM Andrew Purtell <[email protected]> wrote:
>> I used PE to generate 10M row tables with one family with either 1, 10, 20, 
>> 50, or 100 values per row (unique column-qualifiers). An increase in wall 
>> clock time was noticeable, for example:
>> 
>> 1.6.0
>> 
>> time ./bin/hbase pe --rows=5000000 --table=TestTable_f1_c20 --columns=20 
>> --nomapred scan 2
>> real 1m20.179s
>> user 0m45.470s
>> sys 0m16.572s
>> 
>> 2.2.5
>> 
>> time ./bin/hbase pe --rows=5000000 --table=TestTable_f1_c20 --columns=20 
>> --nomapred scan 2
>> real 1m31.016s
>> user 0m48.658s
>> sys 0m16.965s
>> 
>> It didnt really make a difference if I used 1 thread or 4 or 10, the delta 
>> was about the same, proportionally. I picked two threads in the end so I'd 
>> have enough time to launch async-profiler twice in another shell, to capture 
>> flamegraph and call tree output, respectively. async-profiler captured 10 
>> seconds at steady state per test case. Upon first inspection what jumps out 
>> is an increasing proportion of CPU time spent in GC in 2.2.5 vs 1.6.0. The 
>> difference increases as the number of column families increase. There is 
>> little apparent difference at 1 column family, but a 2x or more difference 
>> at 20 columns, and a 10x or more difference at 100 columns, eyeballing the 
>> charts, flipping back and forth between browser windows. This seems more 
>> than coincidental but obviously calls for capture and analysis of GC trace, 
>> with JFR. Will do that next.
>> 
>> JVM: openjdk version "1.8.0_232" OpenJDK Runtime Environment (Zulu 
>> 8.42.0.21-CA-macosx) (build 1.8.0_232-b18) OpenJDK 64-Bit Server VM (Zulu 
>> 8.42.0.21-CA-macosx) (build 25.232-b18, mixed mode)
>> 
>> Regionserver JVM flags: -Xms10g -Xmx10g -XX:+UseG1GC -XX:+AlwaysPreTouch 
>> -XX:+UseNUMA -XX:-UseBiasedLocking -XX:+ParallelRefProcEnabled
>> 
>> 
>> On Thu, Jun 11, 2020 at 7:06 AM Jan Van Besien <[email protected]> wrote:
>>> This is promising, thanks a lot. Testing with hbase 2.2.5 shows an
>>> improvement, but we're not there yet.
>>> 
>>> As reported earlier, hbase 2.1.0 was about 60% slower than hbase 1.2.0
>>> in a test that simply scans all the regions in parallel without any
>>> filter. A test with hbase 2.2.5 shows it to be about 40% slower than
>>> 1.2.0. So that is better than 2.1.0, but still substantially slower
>>> than what hbase 1.2.0 was.
>>> 
>>> As before, I tested this both on a 3 node cluster as well as with a
>>> unit test using HBaseTestingUtility. Both tests show very similar
>>> relative differences.
>>> 
>>> Jan
>>> 
>>> On Thu, Jun 11, 2020 at 2:16 PM Anoop John <[email protected]> wrote:
>>> >
>>> > In another mail thread Zheng Hu brought up an important Jra fix
>>> > https://issues.apache.org/jira/browse/HBASE-21657
>>> > Can u pls check with this once?
>>> >
>>> > Anoop
>>> >
>>> >
>>> > On Tue, Jun 9, 2020 at 8:08 PM Jan Van Besien <[email protected]> wrote:
>>> >
>>> > > On Sun, Jun 7, 2020 at 7:49 AM Anoop John <[email protected]> wrote:
>>> > > > As per the above configs, it looks like Bucket Cache is not being 
>>> > > > used.
>>> > > > Only on heap LRU cache in use.
>>> > >
>>> > > True (but it is large enough to hold everything, so I don't think it
>>> > > matters).
>>> > >
>>> > > > @Jan - Is it possible for you to test with off heap Bucket Cache?
>>> > >  Config
>>> > > > bucket cache off heap mode with size ~7.5 GB
>>> > >
>>> > > I did a quick test but it seems not to make a measurable difference.
>>> > > If anything, it is actually slightly slower even. I see 100% hit ratio
>>> > > in the L1
>>> > > LruBlockCache and effectively also 100% in the L2 BucketCache (hit
>>> > > ratio is not yet at 100% but hits increase with every test and misses
>>> > > do not).
>>> > >
>>> > > Given that the LruBlockCache was already large enough to cache all the
>>> > > data anyway, I did not expect this to help either, to be honest.
>>> > >
>>> > > > Do you have any DataBlockEncoding enabled on the CF?
>>> > >
>>> > > Yes, FAST_DIFF. But this is of course true in both the tests with
>>> > > hbase2 and hbase1.
>>> > >
>>> > > Jan
>>> > >
>> 
>> 
>> -- 
>> Best regards,
>> Andrew
>> 
>> Words like orphans lost among the crosstalk, meaning torn from truth's 
>> decrepit hands
>>    - A23, Crosstalk
> 
> 
> -- 
> Best regards,
> Andrew
> 
> Words like orphans lost among the crosstalk, meaning torn from truth's 
> decrepit hands
>    - A23, Crosstalk

Re: HBase 2 slower than HBase 1?

Reply via email to