[ https://issues.apache.org/jira/browse/SPARK-32306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
L. C. Hsieh updated SPARK-32306: -------------------------------- Issue Type: Documentation (was: Bug) > `approx_percentile` in Spark SQL gives incorrect results > -------------------------------------------------------- > > Key: SPARK-32306 > URL: https://issues.apache.org/jira/browse/SPARK-32306 > Project: Spark > Issue Type: Documentation > Components: PySpark, SQL > Affects Versions: 2.4.4, 3.0.0, 3.1.0 > Reporter: Sean Malory > Assignee: Maxim Gekk > Priority: Major > Fix For: 3.1.0 > > > The `approx_percentile` function in Spark SQL does not give the correct > result. I'm not sure how incorrect it is; it may just be a boundary issue. > From the docs: > {quote}The accuracy parameter (default: 10000) is a positive numeric literal > which controls approximation accuracy at the cost of memory. Higher value of > accuracy yields better accuracy, 1.0/accuracy is the relative error of the > approximation. > {quote} > This is not true. Here is a minimum example in `pyspark` where, essentially, > the median of 5 and 8 is being calculated as 5: > {code:python} > import pyspark.sql.functions as psf > df = spark.createDataFrame( > [('bar', 5), ('bar', 8)], ['name', 'val'] > ) > median = psf.expr('percentile_approx(val, 0.5, 2147483647)') > df.groupBy('name').agg(median.alias('median')) # gives the median as 5 > {code} > I've tested this with Spark v2.4.4, pyspark v2.4.5- although I suspect this > is an issue with the underlying algorithm. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org