[jira] [Commented] (HBASE-9243) Add more useful statistics in the HFile tool

Alexandre Normand (JIRA) Fri, 16 Aug 2013 10:59:30 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-9243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13742463#comment-13742463
 ]


Alexandre Normand commented on HBASE-9243:
------------------------------------------

[~anoop.hbase]: Strictly speaking, it's technically more accurate to say that 
those are the number of key lengths/number of value lengths sampled. But since 
all distinct keys are sampled once and each key has a sampled length, it's also 
the number of rows. 

As for the number of values, it's slightly more complicated because it's 
technically possible to have multiple occurrences of the same 
key/family/qualifier/timestamp in the same hfile (I have an example at hand). 
In such a case, the reported number would include all value occurrences and not 
just the one that would be visible when querying the table. So, generally, it 
would be the number of values with the caveat that it's the number of values in 
the file and not necessarily the number of values one would see when querying 
hbase. 
                
> Add more useful statistics in the HFile tool
> --------------------------------------------
>
>                 Key: HBASE-9243
>                 URL: https://issues.apache.org/jira/browse/HBASE-9243
>             Project: HBase
>          Issue Type: Improvement
>          Components: HFile
>    Affects Versions: 0.96.0
>            Reporter: Alexandre Normand
>            Priority: Minor
>              Labels: newbie
>         Attachments: HBASE-9243-1.patch, HBASE-9243-2.patch, HBASE-9243.patch
>
>
> The [HFile tool|http://hbase.apache.org/book/regions.arch.html#hfile_tool] 
> has been very useful to us recently to get a better idea of the size of our 
> rows. However, it happened frequently that we wished for more statistics to 
> have a more complete picture of the distribution of the row sizes. 
> [~skuehn] requested that feature often enough in private that I decided to 
> give it a go. 
> Here's the patch that adds more nice little stats via yammer's histograms. It 
> was easy enough since {{com.yammer.metrics}} is already in hbase's 
> dependencies.
> Example of the new output from {{org.apache.hadoop.hbase.io.hfile.HFile -s -f 
> ...}}:
> {code}
> Stats:
>       Key length:
>                min = 24.00
>                max = 24.00
>               mean = 24.00
>             stddev = 0.00
>             median = 24.00
>               75% <= 24.00
>               95% <= 24.00
>               98% <= 24.00
>               99% <= 24.00
>             99.9% <= 24.00
>       Row size (bytes):
>                min = 33.00
>                max = 33.00
>               mean = 33.00
>             stddev = 0.00
>             median = 33.00
>               75% <= 33.00
>               95% <= 33.00
>               98% <= 33.00
>               99% <= 33.00
>             99.9% <= 33.00
>       Row size (columns):
>                min = 1.00
>                max = 1.00
>               mean = 1.00
>             stddev = 0.00
>             median = 1.00
>               75% <= 1.00
>               95% <= 1.00
>               98% <= 1.00
>               99% <= 1.00
>             99.9% <= 1.00
>       Val length:
>                min = 1.00
>                max = 1.00
>               mean = 1.00
>             stddev = 0.00
>             median = 1.00
>               75% <= 1.00
>               95% <= 1.00
>               98% <= 1.00
>               99% <= 1.00
>             99.9% <= 1.00
> Key of biggest row: \x00
> {code}  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-9243) Add more useful statistics in the HFile tool

Reply via email to