[ 
https://issues.apache.org/jira/browse/SPARK-18111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15610192#comment-15610192
 ] 

Zhenhua Wang commented on SPARK-18111:
--------------------------------------

[~srowen] The minimum is not only skipped once in the whole data, but skipped 
per partition.
For example, we have two partitions of data: (1, 1, 3, 3) and (5, 5, 7, 7), 
then when we do global merging, the samples in QuantileSummaries is (1, 3, 3, 
5, 7, 7), and the percentiles result returned for query percentile_approx(0.25, 
0.5, 0.75) is (3.0, 5.0, 7.0), but the correct answer should be (1.0, 3.0, 
5.0). Of course we can say it's an approximate algorithm, but this error is 
already "beyond the error bound which the algo provides". And also, we can make 
the error even larger if we construct more such partitions and thus more 
skipped minimum elements.

> Wrong ApproximatePercentile answer when multiple records have the minimum 
> value
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-18111
>                 URL: https://issues.apache.org/jira/browse/SPARK-18111
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.1
>            Reporter: Zhenhua Wang
>
> When multiple records have the minimum value, the answer of 
> ApproximatePercentile is wrong.
> For example, the following query returns 2.0 for percentile 0.5, but the 
> correct answer should be 1.0
> 0: jdbc:hive2://localhost:10000> select key from src2;
> +------+--+
> | key  |
> +------+--+
> | 1    |
> | 1    |
> | 2    |
> | 2    |
> +------+--+
> 4 rows selected (0.185 seconds)
> 0: jdbc:hive2://localhost:10000> select percentile_approx(key, array(0.5)) 
> from src2;
> +------------------------------------------------------------+--+
> | percentile_approx(CAST(key AS DOUBLE), array(0.5), 10000)  |
> +------------------------------------------------------------+--+
> | [2.0]                                                      |
> +------------------------------------------------------------+--+
> 1 row selected (0.292 seconds)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to