[ https://issues.apache.org/jira/browse/SPARK-18111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15611229#comment-15611229 ]
Zhenhua Wang commented on SPARK-18111: -------------------------------------- Yes, it says '5'. Because the samples in QuantileSummaries after the final merging is (1, 2, 3, 4, 5), and the rank of 0.5 percentile is quantile * count, i.e. 0.5 * 12 = 6. The rank is larger than its length, in which case it just returns the last sample. > Wrong ApproximatePercentile answer when multiple records have the minimum > value > ------------------------------------------------------------------------------- > > Key: SPARK-18111 > URL: https://issues.apache.org/jira/browse/SPARK-18111 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.0.1 > Reporter: Zhenhua Wang > > When multiple records have the minimum value, the answer of > ApproximatePercentile is wrong. > Suppose we have a table with 12 records and 4 partitions, values of column > "col" in these partitions are: > 1, 1, 2 > 1, 1, 3 > 1, 1, 4 > 1, 1, 5 > If we query percentile_approx(col, array(0.5)), the current answer is "5", > which is far from the correct answer "1". > The test case is as below: > {code} > test("percentile_approx, multiple records with the minimum value in a > partition") { > withTempView(table) { > spark.sparkContext.makeRDD(Seq(1, 1, 2, 1, 1, 3, 1, 1, 4, 1, 1, 5), > 4).toDF("col") > .createOrReplaceTempView(table) > checkAnswer( > spark.sql(s"SELECT percentile_approx(col, array(0.5)) FROM $table"), > Row(Seq(1.0D)) > ) > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org