Related, I think I found a bug in branch-1 where we don’t heartbeat in the filter all case until we switch store files, so scanning a very large store file might time out with client defaults. Remarking on this here so I don’t forget to follow up.
> On Jun 25, 2020, at 12:27 PM, Andrew Purtell <apurt...@apache.org> wrote: > > > I repeated this test with pe --filterAll and the results were revealing, at > least for this case. I also patched in thread local hash map for atomic > counters that I could update from code paths in SQM, StoreScanner, > HFileReader*, and HFileBlock. Because a RPC is processed by a single handler > thread I could update counters and accumulate micro-timings via > System#nanoTime() per RPC and dump them out of CallRunner in some new trace > logging. I spent a couple of days making sure the instrumentation was placed > equivalently in both 1.6 and 2.2 code bases and was producing consistent > results. I can provide these patches upon request. > > Again, test tables with one family and 1, 5, 10, 20, 50, and 100 distinct > column-qualifiers per row. After loading the table I made a snapshot and > cloned the snapshot for testing, for both 1.6 and 2.2, so both versions were > tested using the exact same data files on HDFS. I also used the 1.6 version > of PE for both, so the only change is on the server (1.6 vs 2.2 masters and > regionservers). > > It appears a refactor to ScanQueryMatcher and friends has disabled the > ability of filters to provide SKIP hints, which prevents us from bypassing > version checking (so some extra cost in SQM), and appears to disable an > optimization that avoids reseeking, leading to a serious and proportional > regression in reseek activity and time spent in that code path. So for > queries that use filters, there can be a substantial regression. > > Other test cases that did not use filters did not show a regression. > > A test case where I used ROW_INDEX_V1 encoding showed an expected modest > proportional regression in seeking time, due to the fact it is optimized for > point queries and not optimized for the full table scan case. > > I will come back here when I understand this better. > > Here are the results for the pe --filterAll case: > > > 1.6.0 c1 2.2.5 c1 > 1.6.0 c5 2.2.5 c5 > 1.6.0 c10 2.2.5 c10 > 1.6.0 c20 2.2.5 c20 > 1.6.0 c50 2.2.5 c50 > 1.6.0 c100 2.2.5 c100 > Counts > > > > > > > > > > > > > > > > > > > > > > > (better heartbeating) > (better heartbeating) > (better heartbeating) > (better heartbeating) > (better heartbeating) > rpcs 1 2 200% 2 6 300% 2 10 500% > 3 17 567% 4 37 925% 8 72 900% > block_reads 11507 11508 100% 57255 57257 100% 114471 114474 > 100% 230372 230377 100% 578292 578298 100% 1157955 1157963 100% > block_unpacks 11507 11508 100% 57255 57257 100% 114471 114474 > 100% 230372 230377 100% 578292 578298 100% 1157955 1157963 100% > seeker_next 10000000 10000000 100% 50000000 > 50000000 100% 100000000 100000000 100% 200000000 > 200000000 100% 500000000 500000000 100% 1000000000 > 1000000000 100% > store_next 10000000 9988268 100% 50000000 49940082 > 100% 100000000 99879401 100% 200000000 199766539 > 100% 500000000 499414653 100% 1000000000 998836518 > 100% > store_reseek 1 11733 > ! 2 59924 > ! 8 120607 > > ! 6 233467 > ! 10 585357 > ! 8 1163490 > ! > > > > > > > > > > > > > > > > > > > > cells_matched 20000000 20000000 100% 60000000 > 60000000 100% 110000000 110000000 100% 210000000 > 210000000 100% 510000000 510000000 100% 1010000000 > 1010000000 100% > column_hint_include 10000000 10000000 100% 50000000 > 50000000 100% 100000000 100000000 100% 200000000 > 200000000 100% 500000000 500000000 100% 1000000000 > 1000000000 100% > filter_hint_skip 10000000 10000000 100% 50000000 > 50000000 100% 100000000 100000000 100% 200000000 > 200000000 100% 500000000 500000000 100% 1000000000 > 1000000000 100% > sqm_hint_done 9999999 9999999 100% 9999998 9999998 100% 9999998 9999998 > 100% 9999997 9999997 100% 9999996 9999996 100% 9999992 9999992 100% > sqm_hint_seek_next_col 1 0 0% 2 40000002 > > ! 2 90000002 > ! 3 190000003 > ! 4 > 490000004 > ! 8 990000008 > ! > sqm_hint_seek_next_row 0 10000000 > ! 2 > 10000000 > ! 0 10000000 > ! 0 10000000 > > ! 0 10000000 > ! 0 10000000 > ! > sqm_hint_skip 10000000 0 0% 50000000 0 0% > 100000000 0 0% 200000000 0 0% 500000000 > 0 0% 1000000000 0 0% > versions_hint_include_and_seek_next_col 0 0 0% 0 > 40000000 > ! 0 90000000 > ! 0 190000000 > > ! 0 490000000 > ! 0 990000000 > ! > versions_hint_include_and_seek_next_row 0 0 0% 0 > 10000000 > ! 0 10000000 > ! 0 10000000 > > ! 0 10000000 > ! 0 10000000 > ! > > > > > > > > > > > > > > > > > > > > Times > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > block_read_ms 174.12 215.96 124% 683.78 883.02 129% 1724.66 1937.87 > 112% 3199.93 3470.92 108% 7582.96 7941.45 105% 14701.52 > 16022.57 109% > block_unpack_ms 0.63 0.62 99% 2.35 2.35 100% 4.36 > 4.56 105% 9.10 8.30 91% 21.59 22.01 102% 40.97 40.93 > 100% > seeker_next_ms 1073.45 1053.51 98% 4452.52 4728.96 106% 9359.61 > 9706.06 104% 18879.28 18330.93 97% 47664.95 > 46404.26 97% 96511.83 92121.30 95% > store_next_ms 2651.11 2423.27 91% 12048.64 11398.63 95% > 24833.68 23586.20 95% 48493.26 46574.38 96% > 120786.00 115603.00 96% 240627.00 227945.00 95% > store_reseek_ms 2.71 62.17 2297% 10.93 233.67 2139% 10.88 > 401.88 3693% 13.98 783.18 5601% 22.26 1939.28 8714% 24.32 > 4065.25 16714% > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Rows=10000000 > > > > > > > > > > > > > > > > > > ValueSize=20 > > > > > > > > > > > > > > > > > > BlockEncoding=FAST_DIFF > > > > > > > > > > > > > > > > > > Compress=NONE > Filter=ALL > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Jun 11, 2020 at 5:08 PM Andrew Purtell <apurt...@apache.org> wrote: >> I used PE to generate 10M row tables with one family with either 1, 10, 20, >> 50, or 100 values per row (unique column-qualifiers). An increase in wall >> clock time was noticeable, for example: >> >> 1.6.0 >> >> time ./bin/hbase pe --rows=5000000 --table=TestTable_f1_c20 --columns=20 >> --nomapred scan 2 >> real 1m20.179s >> user 0m45.470s >> sys 0m16.572s >> >> 2.2.5 >> >> time ./bin/hbase pe --rows=5000000 --table=TestTable_f1_c20 --columns=20 >> --nomapred scan 2 >> real 1m31.016s >> user 0m48.658s >> sys 0m16.965s >> >> It didnt really make a difference if I used 1 thread or 4 or 10, the delta >> was about the same, proportionally. I picked two threads in the end so I'd >> have enough time to launch async-profiler twice in another shell, to capture >> flamegraph and call tree output, respectively. async-profiler captured 10 >> seconds at steady state per test case. Upon first inspection what jumps out >> is an increasing proportion of CPU time spent in GC in 2.2.5 vs 1.6.0. The >> difference increases as the number of column families increase. There is >> little apparent difference at 1 column family, but a 2x or more difference >> at 20 columns, and a 10x or more difference at 100 columns, eyeballing the >> charts, flipping back and forth between browser windows. This seems more >> than coincidental but obviously calls for capture and analysis of GC trace, >> with JFR. Will do that next. >> >> JVM: openjdk version "1.8.0_232" OpenJDK Runtime Environment (Zulu >> 8.42.0.21-CA-macosx) (build 1.8.0_232-b18) OpenJDK 64-Bit Server VM (Zulu >> 8.42.0.21-CA-macosx) (build 25.232-b18, mixed mode) >> >> Regionserver JVM flags: -Xms10g -Xmx10g -XX:+UseG1GC -XX:+AlwaysPreTouch >> -XX:+UseNUMA -XX:-UseBiasedLocking -XX:+ParallelRefProcEnabled >> >> >> On Thu, Jun 11, 2020 at 7:06 AM Jan Van Besien <ja...@ngdata.com> wrote: >>> This is promising, thanks a lot. Testing with hbase 2.2.5 shows an >>> improvement, but we're not there yet. >>> >>> As reported earlier, hbase 2.1.0 was about 60% slower than hbase 1.2.0 >>> in a test that simply scans all the regions in parallel without any >>> filter. A test with hbase 2.2.5 shows it to be about 40% slower than >>> 1.2.0. So that is better than 2.1.0, but still substantially slower >>> than what hbase 1.2.0 was. >>> >>> As before, I tested this both on a 3 node cluster as well as with a >>> unit test using HBaseTestingUtility. Both tests show very similar >>> relative differences. >>> >>> Jan >>> >>> On Thu, Jun 11, 2020 at 2:16 PM Anoop John <anoop.hb...@gmail.com> wrote: >>> > >>> > In another mail thread Zheng Hu brought up an important Jra fix >>> > https://issues.apache.org/jira/browse/HBASE-21657 >>> > Can u pls check with this once? >>> > >>> > Anoop >>> > >>> > >>> > On Tue, Jun 9, 2020 at 8:08 PM Jan Van Besien <ja...@ngdata.com> wrote: >>> > >>> > > On Sun, Jun 7, 2020 at 7:49 AM Anoop John <anoop.hb...@gmail.com> wrote: >>> > > > As per the above configs, it looks like Bucket Cache is not being >>> > > > used. >>> > > > Only on heap LRU cache in use. >>> > > >>> > > True (but it is large enough to hold everything, so I don't think it >>> > > matters). >>> > > >>> > > > @Jan - Is it possible for you to test with off heap Bucket Cache? >>> > > Config >>> > > > bucket cache off heap mode with size ~7.5 GB >>> > > >>> > > I did a quick test but it seems not to make a measurable difference. >>> > > If anything, it is actually slightly slower even. I see 100% hit ratio >>> > > in the L1 >>> > > LruBlockCache and effectively also 100% in the L2 BucketCache (hit >>> > > ratio is not yet at 100% but hits increase with every test and misses >>> > > do not). >>> > > >>> > > Given that the LruBlockCache was already large enough to cache all the >>> > > data anyway, I did not expect this to help either, to be honest. >>> > > >>> > > > Do you have any DataBlockEncoding enabled on the CF? >>> > > >>> > > Yes, FAST_DIFF. But this is of course true in both the tests with >>> > > hbase2 and hbase1. >>> > > >>> > > Jan >>> > > >> >> >> -- >> Best regards, >> Andrew >> >> Words like orphans lost among the crosstalk, meaning torn from truth's >> decrepit hands >> - A23, Crosstalk > > > -- > Best regards, > Andrew > > Words like orphans lost among the crosstalk, meaning torn from truth's > decrepit hands > - A23, Crosstalk