[jira] [Updated] (CASSANDRA-8883) Percentile computation should use ceil not floor in EstimatedHistogram

2015-11-23 Thread Carl Yeksigian (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-8883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carl Yeksigian updated CASSANDRA-8883:
--
Component/s: Observability

> Percentile computation should use ceil not floor in EstimatedHistogram
> --
>
> Key: CASSANDRA-8883
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8883
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability
>Reporter: Chris Lohfink
>Assignee: Carl Yeksigian
>Priority: Minor
> Fix For: 2.1.5
>
> Attachments: 8883-2.1.txt
>
>
> When computing the pcount Cassandra uses floor and the comparison with 
> elements is >= so given a simple example of there being a total of five 
> elements
> {code}
> // data
> [1, 1, 1, 1, 1]
> // offsets
> [1, 2, 3, 4, 5]
> {code}
> Cassandra  would report the 50th percentile as 2.  While 3 is the more 
> expected value.  As a comparison using numpy
> {code}
> import numpy as np
> np.percentile(np.array([1, 2, 3, 4, 5]), 50)
> ==> 3.0
> {code}
> The percentiles was added in CASSANDRA-4022 but is now used a lot in metrics 
> Cassandra reports.  I think it should error on the side on overestimating 
> instead of underestimating. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-8883) Percentile computation should use ceil not floor in EstimatedHistogram

2015-03-03 Thread Jonathan Ellis (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-8883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Ellis updated CASSANDRA-8883:
--
Reviewer: Chris Lohfink

[~cnlwsu] to review

> Percentile computation should use ceil not floor in EstimatedHistogram
> --
>
> Key: CASSANDRA-8883
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8883
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Chris Lohfink
>Assignee: Carl Yeksigian
>Priority: Minor
> Fix For: 2.1.4
>
> Attachments: 8883-2.1.txt
>
>
> When computing the pcount Cassandra uses floor and the comparison with 
> elements is >= so given a simple example of there being a total of five 
> elements
> {code}
> // data
> [1, 1, 1, 1, 1]
> // offsets
> [1, 2, 3, 4, 5]
> {code}
> Cassandra  would report the 50th percentile as 2.  While 3 is the more 
> expected value.  As a comparison using numpy
> {code}
> import numpy as np
> np.percentile(np.array([1, 2, 3, 4, 5]), 50)
> ==> 3.0
> {code}
> The percentiles was added in CASSANDRA-4022 but is now used a lot in metrics 
> Cassandra reports.  I think it should error on the side on overestimating 
> instead of underestimating. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-8883) Percentile computation should use ceil not floor in EstimatedHistogram

2015-03-03 Thread Carl Yeksigian (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-8883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carl Yeksigian updated CASSANDRA-8883:
--
Attachment: 8883-2.1.txt

Since numpy has access to the original values, it provides interpolation 
between the points if the percentile isn't exactly on a boundary:
{code}
np.percentile(np.array([1, 2, 3, 4, 5, 6]), 50)
==> 3.5
{code}
Since we are using the histogram, we don't really know where that lands, so we 
just need to return a value inside of the range. Currently we are returning the 
end of the range before where the percentile occurs.

I've changed EstimatedHistogram to use ceil instead of floor, and updated the 
tests accordingly.

> Percentile computation should use ceil not floor in EstimatedHistogram
> --
>
> Key: CASSANDRA-8883
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8883
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Chris Lohfink
>Assignee: Carl Yeksigian
>Priority: Minor
> Fix For: 2.1.4
>
> Attachments: 8883-2.1.txt
>
>
> When computing the pcount Cassandra uses floor and the comparison with 
> elements is >= so given a simple example of there being a total of five 
> elements
> {code}
> // data
> [1, 1, 1, 1, 1]
> // offsets
> [1, 2, 3, 4, 5]
> {code}
> Cassandra  would report the 50th percentile as 2.  While 3 is the more 
> expected value.  As a comparison using numpy
> {code}
> import numpy as np
> np.percentile(np.array([1, 2, 3, 4, 5]), 50)
> ==> 3.0
> {code}
> The percentiles was added in CASSANDRA-4022 but is now used a lot in metrics 
> Cassandra reports.  I think it should error on the side on overestimating 
> instead of underestimating. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)