[jira] [Commented] (SPARK-32306) `approx_percentile` in Spark SQL gives incorrect results
[ https://issues.apache.org/jira/browse/SPARK-32306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17200564#comment-17200564 ] Sean Malory commented on SPARK-32306: - Thank you. > `approx_percentile` in Spark SQL gives incorrect results > > > Key: SPARK-32306 > URL: https://issues.apache.org/jira/browse/SPARK-32306 > Project: Spark > Issue Type: Documentation > Components: PySpark, SQL >Affects Versions: 2.4.4, 3.0.0, 3.1.0 >Reporter: Sean Malory >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.1.0 > > > The `approx_percentile` function in Spark SQL does not give the correct > result. I'm not sure how incorrect it is; it may just be a boundary issue. > From the docs: > {quote}The accuracy parameter (default: 1) is a positive numeric literal > which controls approximation accuracy at the cost of memory. Higher value of > accuracy yields better accuracy, 1.0/accuracy is the relative error of the > approximation. > {quote} > This is not true. Here is a minimum example in `pyspark` where, essentially, > the median of 5 and 8 is being calculated as 5: > {code:python} > import pyspark.sql.functions as psf > df = spark.createDataFrame( > [('bar', 5), ('bar', 8)], ['name', 'val'] > ) > median = psf.expr('percentile_approx(val, 0.5, 2147483647)') > df.groupBy('name').agg(median.alias('median'))# gives the median as 5 > {code} > I've tested this with Spark v2.4.4, pyspark v2.4.5- although I suspect this > is an issue with the underlying algorithm. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32306) `approx_percentile` in Spark SQL gives incorrect results
[ https://issues.apache.org/jira/browse/SPARK-32306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17199846#comment-17199846 ] Sean Malory commented on SPARK-32306: - [~maxgekk]; thanks for the definition. Can we please update the docs to state that this is how it's being calculated? > `approx_percentile` in Spark SQL gives incorrect results > > > Key: SPARK-32306 > URL: https://issues.apache.org/jira/browse/SPARK-32306 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.4 >Reporter: Sean Malory >Priority: Major > > The `approx_percentile` function in Spark SQL does not give the correct > result. I'm not sure how incorrect it is; it may just be a boundary issue. > From the docs: > {quote}The accuracy parameter (default: 1) is a positive numeric literal > which controls approximation accuracy at the cost of memory. Higher value of > accuracy yields better accuracy, 1.0/accuracy is the relative error of the > approximation. > {quote} > This is not true. Here is a minimum example in `pyspark` where, essentially, > the median of 5 and 8 is being calculated as 5: > {code:python} > import pyspark.sql.functions as psf > df = spark.createDataFrame( > [('bar', 5), ('bar', 8)], ['name', 'val'] > ) > median = psf.expr('percentile_approx(val, 0.5, 2147483647)') > df.groupBy('name').agg(median.alias('median'))# gives the median as 5 > {code} > I've tested this with Spark v2.4.4, pyspark v2.4.5- although I suspect this > is an issue with the underlying algorithm. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32306) `approx_percentile` in Spark SQL gives incorrect results
[ https://issues.apache.org/jira/browse/SPARK-32306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17199844#comment-17199844 ] Sean Malory commented on SPARK-32306: - Exactly; you should get the median, which is defined, almost universally, as the average of the middle two numbers if there are an even number of elements in the list. As you've hinted at, it doesn't really matter. If you decide that the percentile should always give you the lower of the two numbers (as it appears to do), that's fine, but I think it should be documented as such. The way this actually came about was me creating a median function and then testing that the function was doing the right thing by comparing it with the `pandas` equivalent: {code:python} import numpy as np import pandas as pd import pyspark.sql.functions as psf median = psf.expr('percentile_approx(val, 0.5, 2147483647)') xs = np.random.rand(10) ys = np.random.rand(10) data = [('foo', float(x)) for x in xs] + [('bar', float(y)) for y in ys] sparkdf = spark.createDataFrame(data, ['name', 'val']) spark_meds = sparkdf.groupBy('name').agg(median.alias('median')) pddf = pd.DataFrame(data, columns=['name', 'val']) pd_meds = pddf.groupby('name')['val'].median() {code} > `approx_percentile` in Spark SQL gives incorrect results > > > Key: SPARK-32306 > URL: https://issues.apache.org/jira/browse/SPARK-32306 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.4 >Reporter: Sean Malory >Priority: Major > > The `approx_percentile` function in Spark SQL does not give the correct > result. I'm not sure how incorrect it is; it may just be a boundary issue. > From the docs: > {quote}The accuracy parameter (default: 1) is a positive numeric literal > which controls approximation accuracy at the cost of memory. Higher value of > accuracy yields better accuracy, 1.0/accuracy is the relative error of the > approximation. > {quote} > This is not true. Here is a minimum example in `pyspark` where, essentially, > the median of 5 and 8 is being calculated as 5: > {code:python} > import pyspark.sql.functions as psf > df = spark.createDataFrame( > [('bar', 5), ('bar', 8)], ['name', 'val'] > ) > median = psf.expr('percentile_approx(val, 0.5, 2147483647)') > df.groupBy('name').agg(median.alias('median'))# gives the median as 5 > {code} > I've tested this with Spark v2.4.4, pyspark v2.4.5- although I suspect this > is an issue with the underlying algorithm. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32306) `approx_percentile` in Spark SQL gives incorrect results
[ https://issues.apache.org/jira/browse/SPARK-32306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17199817#comment-17199817 ] Sean Malory commented on SPARK-32306: - I expect the result to be (5 + 8) / 2 = 6.5. > `approx_percentile` in Spark SQL gives incorrect results > > > Key: SPARK-32306 > URL: https://issues.apache.org/jira/browse/SPARK-32306 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.4 >Reporter: Sean Malory >Priority: Major > > The `approx_percentile` function in Spark SQL does not give the correct > result. I'm not sure how incorrect it is; it may just be a boundary issue. > From the docs: > {quote}The accuracy parameter (default: 1) is a positive numeric literal > which controls approximation accuracy at the cost of memory. Higher value of > accuracy yields better accuracy, 1.0/accuracy is the relative error of the > approximation. > {quote} > This is not true. Here is a minimum example in `pyspark` where, essentially, > the median of 5 and 8 is being calculated as 5: > {code:python} > import pyspark.sql.functions as psf > df = spark.createDataFrame( > [('bar', 5), ('bar', 8)], ['name', 'val'] > ) > median = psf.expr('percentile_approx(val, 0.5, 2147483647)') > df.groupBy('name').agg(median.alias('median'))# gives the median as 5 > {code} > I've tested this with Spark v2.4.4, pyspark v2.4.5- although I suspect this > is an issue with the underlying algorithm. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32306) `approx_percentile` in Spark SQL gives incorrect results
[ https://issues.apache.org/jira/browse/SPARK-32306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158992#comment-17158992 ] Sean Malory commented on SPARK-32306: - Thanks Ankit. > `approx_percentile` in Spark SQL gives incorrect results > > > Key: SPARK-32306 > URL: https://issues.apache.org/jira/browse/SPARK-32306 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.4 >Reporter: Sean Malory >Priority: Major > > The `approx_percentile` function in Spark SQL does not give the correct > result. I'm not sure how incorrect it is; it may just be a boundary issue. > From the docs: > {quote}The accuracy parameter (default: 1) is a positive numeric literal > which controls approximation accuracy at the cost of memory. Higher value of > accuracy yields better accuracy, 1.0/accuracy is the relative error of the > approximation. > {quote} > This is not true. Here is a minimum example in `pyspark` where, essentially, > the median of 5 and 8 is being calculated as 5: > {code:python} > import pyspark.sql.functions as psf > df = spark.createDataFrame( > [('bar', 5), ('bar', 8)], ['name', 'val'] > ) > median = psf.expr('percentile_approx(val, 0.5, 2147483647)') > df.groupBy('name').agg(median.alias('median'))# gives the median as 5 > {code} > I've tested this with Spark v2.4.4, pyspark v2.4.5- although I suspect this > is an issue with the underlying algorithm. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32306) `approx_percentile` in Spark SQL gives incorrect results
Sean Malory created SPARK-32306: --- Summary: `approx_percentile` in Spark SQL gives incorrect results Key: SPARK-32306 URL: https://issues.apache.org/jira/browse/SPARK-32306 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 2.4.4 Reporter: Sean Malory The `approx_percentile` function in Spark SQL does not give the correct result. I'm not sure how incorrect it is; it may just be a boundary issue. From the docs: {quote}The accuracy parameter (default: 1) is a positive numeric literal which controls approximation accuracy at the cost of memory. Higher value of accuracy yields better accuracy, 1.0/accuracy is the relative error of the approximation. {quote} This is not true. Here is a minimum example in `pyspark` where, essentially, the median of 5 and 8 is being calculated as 5: {code:python} import pyspark.sql.functions as psf df = spark.createDataFrame( [('bar', 5), ('bar', 8)], ['name', 'val'] ) median = psf.expr('percentile_approx(val, 0.5, 2147483647)') df.groupBy('name').agg(median.alias('median'))# gives the median as 5 {code} I've tested this with Spark v2.4.4, pyspark v2.4.5- although I suspect this is an issue with the underlying algorithm. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org