apurtell opened a new pull request, #8075:
URL: https://github.com/apache/hbase/pull/8075
PreviousBlockCompressionRatePredicator has three algorithmic deficiencies
that cause compressed blocks to systematically undershoot the configured block
size target: integer division truncation, single-sample estimation, and no
smoothing of the estimated compression ratio.
EWMABlockSizePredicator addresses these issues with double-precision
arithmetic and weighted moving average smoothed estimation of the compression
ratio. This produces compressed HFile blocks that are closer to the configured
target block size.
The ratio is smoothed using a default alpha of 0.5. This adapts quickly to
changing data while dampening single-block variance. After 3 blocks, the EWMA
captures 87.5% of the true ratio. Alpha = 0.5 is chosen because HFile blocks
within a single file tend to have similar compression ratios (same column
family, similar data distribution), and fast adaptation matters more than
long-term smoothing since predicator state is per-file.
Adds HFileBlockPerformanceEvaluation to microbenchmark HFileBlock related
concerns.
```text
==========================================================================================
PREDICATOR ACCURACY RESULTS
==========================================================================================
Predicator Compr Encoding BlkSize Blocks MeanOnDisk Stddev
Min Max Dev%
-------------- ------ ---------- ------- ------ ---------- --------
---------- ---------- -------
Uncompressed none NONE 65536 2907 66596 926
16681 66613 1.6%
PrevBlock none NONE 65536 2907 66596 926
16681 66613 1.6%
EWMA none NONE 65536 2907 66596 926
16681 66613 1.6%
Uncompressed none FAST_DIFF 65536 2907 64460 896
16171 64481 1.6%
PrevBlock none FAST_DIFF 65536 2819 66474 986
14159 66497 1.4%
EWMA none FAST_DIFF 65536 2819 66474 986
14159 66497 1.4%
Uncompressed gz NONE 65536 2996 22354 5
22338 22369 65.9%
PrevBlock gz NONE 65536 2987 44206 400
22350 44233 32.5%
EWMA gz NONE 65536 2954 65700 1399
3264 65758 0.2%
Uncompressed gz FAST_DIFF 65536 2996 22204 5
22190 22227 66.1%
PrevBlock gz FAST_DIFF 65536 2987 45257 422
22193 45289 30.9%
EWMA gz FAST_DIFF 65536 2938 65563 999
22202 65616 0.0%
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]