Great investigation Andy. Do you know any Jiras which made changes in SQM?
Would be great if you can attach your patch which tracks the scan flow. If
we have a Jira for this issue, can you pls attach?
Anoop
On Fri, Jun 26, 2020 at 1:56 AM Andrew Purtell <andrew.purt...@gmail.com>
wrote:
Related, I think I found a bug in branch-1 where we don’t heartbeat in
the
filter all case until we switch store files, so scanning a very large
store
file might time out with client defaults. Remarking on this here so I
don’t
forget to follow up.
On Jun 25, 2020, at 12:27 PM, Andrew Purtell <apurt...@apache.org>
wrote:
I repeated this test with pe --filterAll and the results were
revealing,
at least for this case. I also patched in thread local hash map for
atomic
counters that I could update from code paths in SQM, StoreScanner,
HFileReader*, and HFileBlock. Because a RPC is processed by a single
handler thread I could update counters and accumulate micro-timings via
System#nanoTime() per RPC and dump them out of CallRunner in some new
trace
logging. I spent a couple of days making sure the instrumentation was
placed equivalently in both 1.6 and 2.2 code bases and was producing
consistent results. I can provide these patches upon request.
Again, test tables with one family and 1, 5, 10, 20, 50, and 100
distinct column-qualifiers per row. After loading the table I made a
snapshot and cloned the snapshot for testing, for both 1.6 and 2.2, so
both
versions were tested using the exact same data files on HDFS. I also used
the 1.6 version of PE for both, so the only change is on the server (1.6
vs
2.2 masters and regionservers).
It appears a refactor to ScanQueryMatcher and friends has disabled the
ability of filters to provide SKIP hints, which prevents us from
bypassing
version checking (so some extra cost in SQM), and appears to disable an
optimization that avoids reseeking, leading to a serious and proportional
regression in reseek activity and time spent in that code path. So for
queries that use filters, there can be a substantial regression.
Other test cases that did not use filters did not show a regression.
A test case where I used ROW_INDEX_V1 encoding showed an expected
modest
proportional regression in seeking time, due to the fact it is optimized
for point queries and not optimized for the full table scan case.
I will come back here when I understand this better.
Here are the results for the pe --filterAll case:
1.6.0 c1 2.2.5 c1
1.6.0 c5 2.2.5 c5
1.6.0 c10 2.2.5 c10
1.6.0 c20 2.2.5 c20
1.6.0 c50 2.2.5 c50
1.6.0 c100 2.2.5 c100
Counts
(better heartbeating)
(better heartbeating)
(better heartbeating)
(better heartbeating)
(better heartbeating)
rpcs 1 2 200% 2 6 300% 2 10
500% 3 17 567% 4 37 925% 8 72
900%
block_reads 11507 11508 100% 57255 57257 100% 114471
114474 100% 230372 230377 100% 578292 578298 100% 1157955
1157963 100%
block_unpacks 11507 11508 100% 57255 57257 100% 114471
114474 100% 230372 230377 100% 578292 578298 100% 1157955
1157963 100%
seeker_next 10000000 10000000 100% 50000000
50000000 100% 100000000 100000000 100% 200000000
200000000 100% 500000000 500000000 100%
1000000000 1000000000 100%
store_next 10000000 9988268 100% 50000000 49940082
100% 100000000 99879401 100% 200000000
199766539 100% 500000000 499414653 100%
1000000000 998836518 100%
store_reseek 1 11733 > ! 2 59924 > ! 8
120607 > ! 6 233467 > ! 10 585357 > ! 8
1163490 > !
cells_matched 20000000 20000000 100% 60000000
60000000 100% 110000000 110000000 100% 210000000
210000000 100% 510000000 510000000 100%
1010000000 1010000000 100%
column_hint_include 10000000 10000000 100% 50000000
50000000 100% 100000000 100000000 100%
200000000 200000000 100% 500000000 500000000
100% 1000000000 1000000000 100%
filter_hint_skip 10000000 10000000 100% 50000000
50000000 100% 100000000 100000000 100%
200000000 200000000 100% 500000000 500000000
100% 1000000000 1000000000 100%
sqm_hint_done 9999999 9999999 100% 9999998 9999998 100% 9999998
9999998 100% 9999997 9999997 100% 9999996 9999996 100% 9999992
9999992 100%
sqm_hint_seek_next_col 1 0 0% 2 40000002
> ! 2 90000002 > ! 3 190000003 > !
4 490000004 > ! 8 990000008 > !
sqm_hint_seek_next_row 0 10000000 > ! 2
10000000 > ! 0 10000000 > ! 0 10000000
> ! 0 10000000 > ! 0 10000000 > !
sqm_hint_skip 10000000 0 0% 50000000 0
0% 100000000 0 0% 200000000 0 0%
500000000 0 0% 1000000000 0 0%
versions_hint_include_and_seek_next_col 0 0 0% 0
40000000 > ! 0 90000000 > ! 0
190000000 > ! 0 490000000 > ! 0
990000000
> !
versions_hint_include_and_seek_next_row 0 0 0% 0
10000000 > ! 0 10000000 > ! 0
10000000 > ! 0 10000000 > ! 0 10000000
> !
Times
block_read_ms 174.12 215.96 124% 683.78 883.02 129% 1724.66
1937.87 112% 3199.93 3470.92 108% 7582.96 7941.45 105% 14701.52
16022.57 109%
block_unpack_ms 0.63 0.62 99% 2.35 2.35 100%
4.36 4.56 105% 9.10 8.30 91% 21.59 22.01 102%
40.97 40.93 100%
seeker_next_ms 1073.45 1053.51 98% 4452.52 4728.96 106%
9359.61 9706.06 104% 18879.28 18330.93 97% 47664.95
46404.26 97% 96511.83 92121.30 95%
store_next_ms 2651.11 2423.27 91% 12048.64 11398.63
95% 24833.68 23586.20 95% 48493.26 46574.38
96% 120786.00 115603.00 96% 240627.00
227945.00 95%
store_reseek_ms 2.71 62.17 2297% 10.93 233.67 2139%
10.88 401.88 3693% 13.98 783.18 5601% 22.26 1939.28 8714%
24.32 4065.25 16714%
Rows=10000000
ValueSize=20
BlockEncoding=FAST_DIFF
Compress=NONE
Filter=ALL
On Thu, Jun 11, 2020 at 5:08 PM Andrew Purtell <apurt...@apache.org>
wrote:
I used PE to generate 10M row tables with one family with either 1,
10,
20, 50, or 100 values per row (unique column-qualifiers). An increase in
wall clock time was noticeable, for example:
1.6.0
time ./bin/hbase pe --rows=5000000 --table=TestTable_f1_c20
--columns=20 --nomapred scan 2
real 1m20.179s
user 0m45.470s
sys 0m16.572s
2.2.5
time ./bin/hbase pe --rows=5000000 --table=TestTable_f1_c20
--columns=20 --nomapred scan 2
real 1m31.016s
user 0m48.658s
sys 0m16.965s
It didnt really make a difference if I used 1 thread or 4 or 10, the
delta was about the same, proportionally. I picked two threads in the end
so I'd have enough time to launch async-profiler twice in another shell,
to
capture flamegraph and call tree output, respectively. async-profiler
captured 10 seconds at steady state per test case. Upon first inspection
what jumps out is an increasing proportion of CPU time spent in GC in
2.2.5
vs 1.6.0. The difference increases as the number of column families
increase. There is little apparent difference at 1 column family, but a
2x
or more difference at 20 columns, and a 10x or more difference at 100
columns, eyeballing the charts, flipping back and forth between browser
windows. This seems more than coincidental but obviously calls for
capture
and analysis of GC trace, with JFR. Will do that next.
JVM: openjdk version "1.8.0_232" OpenJDK Runtime Environment (Zulu
8.42.0.21-CA-macosx) (build 1.8.0_232-b18) OpenJDK 64-Bit Server VM (Zulu
8.42.0.21-CA-macosx) (build 25.232-b18, mixed mode)
Regionserver JVM flags: -Xms10g -Xmx10g -XX:+UseG1GC
-XX:+AlwaysPreTouch -XX:+UseNUMA -XX:-UseBiasedLocking
-XX:+ParallelRefProcEnabled
On Thu, Jun 11, 2020 at 7:06 AM Jan Van Besien <ja...@ngdata.com>
wrote:
This is promising, thanks a lot. Testing with hbase 2.2.5 shows an
improvement, but we're not there yet.
As reported earlier, hbase 2.1.0 was about 60% slower than hbase
1.2.0
in a test that simply scans all the regions in parallel without any
filter. A test with hbase 2.2.5 shows it to be about 40% slower than
1.2.0. So that is better than 2.1.0, but still substantially slower
than what hbase 1.2.0 was.
As before, I tested this both on a 3 node cluster as well as with a
unit test using HBaseTestingUtility. Both tests show very similar
relative differences.
Jan
On Thu, Jun 11, 2020 at 2:16 PM Anoop John <anoop.hb...@gmail.com>
wrote:
In another mail thread Zheng Hu brought up an important Jra fix
https://issues.apache.org/jira/browse/HBASE-21657
Can u pls check with this once?
Anoop
On Tue, Jun 9, 2020 at 8:08 PM Jan Van Besien <ja...@ngdata.com>
wrote:
On Sun, Jun 7, 2020 at 7:49 AM Anoop John <anoop.hb...@gmail.com
wrote:
As per the above configs, it looks like Bucket Cache is not
being used.
Only on heap LRU cache in use.
True (but it is large enough to hold everything, so I don't think
it
matters).
@Jan - Is it possible for you to test with off heap Bucket
Cache?
Config
bucket cache off heap mode with size ~7.5 GB
I did a quick test but it seems not to make a measurable
difference.
If anything, it is actually slightly slower even. I see 100% hit
ratio
in the L1
LruBlockCache and effectively also 100% in the L2 BucketCache
(hit
ratio is not yet at 100% but hits increase with every test and
misses
do not).
Given that the LruBlockCache was already large enough to cache
all
the
data anyway, I did not expect this to help either, to be honest.
Do you have any DataBlockEncoding enabled on the CF?
Yes, FAST_DIFF. But this is of course true in both the tests with
hbase2 and hbase1.
Jan
--
Best regards,
Andrew
Words like orphans lost among the crosstalk, meaning torn from truth's
decrepit hands
- A23, Crosstalk
--
Best regards,
Andrew
Words like orphans lost among the crosstalk, meaning torn from truth's
decrepit hands
- A23, Crosstalk