[ 
https://issues.apache.org/jira/browse/HBASE-17756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17109719#comment-17109719
 ] 

Michael Stack commented on HBASE-17756:
---------------------------------------

Pushed writing of key and value sketches. Of mild interest. Changed the pretty 
printer to dump out 100 quantiles on size with 50/75th, and 95th. Changed 
writer to use the sketches figuring averages.

TODO: measure to see if this slows down writing.

I think we need a RegionPrettyPrinter, like HFilePrettyPrinter, only it reads 
the Region row-wise accumulating row size and column count sketches. It would 
optionally run HFilePrettyPrinter per file in the Region summing the per file 
key and value sketches. It would write out stuff like min/max and then 
quantiles for row size, row count, key sizees, and value sizes. Would be easy 
to graph perhaps producing graph files.

> We should have better introspection of HFiles
> ---------------------------------------------
>
>                 Key: HBASE-17756
>                 URL: https://issues.apache.org/jira/browse/HBASE-17756
>             Project: HBase
>          Issue Type: Brainstorming
>          Components: HFile
>            Reporter: Esteban Gutierrez
>            Assignee: Rushabh Shah
>            Priority: Major
>
> [~saint....@gmail.com] was suggesting to use DataSketches 
> (https://datasketches.github.io) in order to write additional statistics to 
> the HFiles. This could be used to improve our split decisions, 
> troubleshooting or potentially do other interesting analysis without having 
> to perform full table scans. The statistics could be stored as part of the 
> HFile but we could initially improve the visibility of the data by adding 
> some statistics to HFilePrettyPrinter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to