apurtell opened a new pull request, #8075:
URL: https://github.com/apache/hbase/pull/8075

   PreviousBlockCompressionRatePredicator has three algorithmic deficiencies 
that cause compressed blocks to systematically undershoot the configured block 
size target: integer division truncation, single-sample estimation, and no 
smoothing of the estimated compression ratio.
   
   EWMABlockSizePredicator addresses these issues with double-precision 
arithmetic and weighted moving average smoothed estimation of the compression 
ratio. This produces compressed HFile blocks that are closer to the configured 
target block size.
   
   The ratio is smoothed using a default alpha of 0.5. This adapts quickly to 
changing data while dampening single-block variance. After 3 blocks, the EWMA 
captures 87.5% of the true ratio. Alpha = 0.5 is chosen because HFile blocks 
within a single file tend to have similar compression ratios (same column 
family, similar data distribution), and fast adaptation matters more than 
long-term smoothing since predicator state is per-file.
   
   Adds HFileBlockPerformanceEvaluation to microbenchmark HFileBlock related 
concerns.
   
   ```text
   
==========================================================================================
     PREDICATOR ACCURACY RESULTS
   
==========================================================================================
   
   Predicator     Compr  Encoding   BlkSize Blocks MeanOnDisk   Stddev        
Min        Max    Dev%
   -------------- ------ ---------- ------- ------ ---------- -------- 
---------- ---------- -------
   Uncompressed   none   NONE         65536   2907      66596      926      
16681      66613    1.6%
   PrevBlock      none   NONE         65536   2907      66596      926      
16681      66613    1.6%
   EWMA           none   NONE         65536   2907      66596      926      
16681      66613    1.6%
   
   Uncompressed   none   FAST_DIFF    65536   2907      64460      896      
16171      64481    1.6%
   PrevBlock      none   FAST_DIFF    65536   2819      66474      986      
14159      66497    1.4%
   EWMA           none   FAST_DIFF    65536   2819      66474      986      
14159      66497    1.4%
   
   Uncompressed   gz     NONE         65536   2996      22354        5      
22338      22369   65.9%
   PrevBlock      gz     NONE         65536   2987      44206      400      
22350      44233   32.5%
   EWMA           gz     NONE         65536   2954      65700     1399       
3264      65758    0.2%
   
   Uncompressed   gz     FAST_DIFF    65536   2996      22204        5      
22190      22227   66.1%
   PrevBlock      gz     FAST_DIFF    65536   2987      45257      422      
22193      45289   30.9%
   EWMA           gz     FAST_DIFF    65536   2938      65563      999      
22202      65616    0.0%
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to