[jira] [Commented] (SPARK-32306) `approx_percentile` in Spark SQL gives incorrect results

2020-09-22 Thread Sean Malory (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17200564#comment-17200564
 ] 

Sean Malory commented on SPARK-32306:
-

Thank you.

> `approx_percentile` in Spark SQL gives incorrect results
> 
>
> Key: SPARK-32306
> URL: https://issues.apache.org/jira/browse/SPARK-32306
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark, SQL
>Affects Versions: 2.4.4, 3.0.0, 3.1.0
>Reporter: Sean Malory
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.1.0
>
>
> The `approx_percentile` function in Spark SQL does not give the correct 
> result. I'm not sure how incorrect it is; it may just be a boundary issue. 
> From the docs:
> {quote}The accuracy parameter (default: 1) is a positive numeric literal 
> which controls approximation accuracy at the cost of memory. Higher value of 
> accuracy yields better accuracy, 1.0/accuracy is the relative error of the 
> approximation.
> {quote}
> This is not true. Here is a minimum example in `pyspark` where, essentially, 
> the median of 5 and 8 is being calculated as 5:
> {code:python}
> import pyspark.sql.functions as psf
> df = spark.createDataFrame(
> [('bar', 5), ('bar', 8)], ['name', 'val']
> )
> median = psf.expr('percentile_approx(val, 0.5, 2147483647)')
> df.groupBy('name').agg(median.alias('median'))# gives the median as 5
> {code}
> I've tested this with Spark v2.4.4, pyspark v2.4.5- although I suspect this 
> is an issue with the underlying algorithm.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32306) `approx_percentile` in Spark SQL gives incorrect results

2020-09-21 Thread Sean Malory (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17199846#comment-17199846
 ] 

Sean Malory commented on SPARK-32306:
-

[~maxgekk]; thanks for the definition. Can we please update the docs to state 
that this is how it's being calculated?

> `approx_percentile` in Spark SQL gives incorrect results
> 
>
> Key: SPARK-32306
> URL: https://issues.apache.org/jira/browse/SPARK-32306
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.4
>Reporter: Sean Malory
>Priority: Major
>
> The `approx_percentile` function in Spark SQL does not give the correct 
> result. I'm not sure how incorrect it is; it may just be a boundary issue. 
> From the docs:
> {quote}The accuracy parameter (default: 1) is a positive numeric literal 
> which controls approximation accuracy at the cost of memory. Higher value of 
> accuracy yields better accuracy, 1.0/accuracy is the relative error of the 
> approximation.
> {quote}
> This is not true. Here is a minimum example in `pyspark` where, essentially, 
> the median of 5 and 8 is being calculated as 5:
> {code:python}
> import pyspark.sql.functions as psf
> df = spark.createDataFrame(
> [('bar', 5), ('bar', 8)], ['name', 'val']
> )
> median = psf.expr('percentile_approx(val, 0.5, 2147483647)')
> df.groupBy('name').agg(median.alias('median'))# gives the median as 5
> {code}
> I've tested this with Spark v2.4.4, pyspark v2.4.5- although I suspect this 
> is an issue with the underlying algorithm.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32306) `approx_percentile` in Spark SQL gives incorrect results

2020-09-21 Thread Sean Malory (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17199844#comment-17199844
 ] 

Sean Malory commented on SPARK-32306:
-

Exactly; you should get the median, which is defined, almost universally, as 
the average of the middle two numbers if there are an even number of elements 
in the list.

As you've hinted at, it doesn't really matter. If you decide that the 
percentile should always give you the lower of the two numbers (as it appears 
to do), that's fine, but I think it should be documented as such.

The way this actually came about was me creating a median function and then 
testing that the function was doing the right thing by comparing it with the 
`pandas` equivalent:


{code:python}
import numpy as np
import pandas as pd
import pyspark.sql.functions as psf

median = psf.expr('percentile_approx(val, 0.5, 2147483647)')

xs = np.random.rand(10)
ys = np.random.rand(10)
data = [('foo', float(x)) for x in xs] + [('bar', float(y)) for y in ys]

sparkdf = spark.createDataFrame(data, ['name', 'val'])
spark_meds = sparkdf.groupBy('name').agg(median.alias('median'))

pddf = pd.DataFrame(data, columns=['name', 'val'])
pd_meds = pddf.groupby('name')['val'].median()
{code}

> `approx_percentile` in Spark SQL gives incorrect results
> 
>
> Key: SPARK-32306
> URL: https://issues.apache.org/jira/browse/SPARK-32306
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.4
>Reporter: Sean Malory
>Priority: Major
>
> The `approx_percentile` function in Spark SQL does not give the correct 
> result. I'm not sure how incorrect it is; it may just be a boundary issue. 
> From the docs:
> {quote}The accuracy parameter (default: 1) is a positive numeric literal 
> which controls approximation accuracy at the cost of memory. Higher value of 
> accuracy yields better accuracy, 1.0/accuracy is the relative error of the 
> approximation.
> {quote}
> This is not true. Here is a minimum example in `pyspark` where, essentially, 
> the median of 5 and 8 is being calculated as 5:
> {code:python}
> import pyspark.sql.functions as psf
> df = spark.createDataFrame(
> [('bar', 5), ('bar', 8)], ['name', 'val']
> )
> median = psf.expr('percentile_approx(val, 0.5, 2147483647)')
> df.groupBy('name').agg(median.alias('median'))# gives the median as 5
> {code}
> I've tested this with Spark v2.4.4, pyspark v2.4.5- although I suspect this 
> is an issue with the underlying algorithm.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32306) `approx_percentile` in Spark SQL gives incorrect results

2020-09-21 Thread Sean Malory (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17199817#comment-17199817
 ] 

Sean Malory commented on SPARK-32306:
-

I expect the result to be (5 + 8) / 2 = 6.5.

> `approx_percentile` in Spark SQL gives incorrect results
> 
>
> Key: SPARK-32306
> URL: https://issues.apache.org/jira/browse/SPARK-32306
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.4
>Reporter: Sean Malory
>Priority: Major
>
> The `approx_percentile` function in Spark SQL does not give the correct 
> result. I'm not sure how incorrect it is; it may just be a boundary issue. 
> From the docs:
> {quote}The accuracy parameter (default: 1) is a positive numeric literal 
> which controls approximation accuracy at the cost of memory. Higher value of 
> accuracy yields better accuracy, 1.0/accuracy is the relative error of the 
> approximation.
> {quote}
> This is not true. Here is a minimum example in `pyspark` where, essentially, 
> the median of 5 and 8 is being calculated as 5:
> {code:python}
> import pyspark.sql.functions as psf
> df = spark.createDataFrame(
> [('bar', 5), ('bar', 8)], ['name', 'val']
> )
> median = psf.expr('percentile_approx(val, 0.5, 2147483647)')
> df.groupBy('name').agg(median.alias('median'))# gives the median as 5
> {code}
> I've tested this with Spark v2.4.4, pyspark v2.4.5- although I suspect this 
> is an issue with the underlying algorithm.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32306) `approx_percentile` in Spark SQL gives incorrect results

2020-07-16 Thread Sean Malory (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158992#comment-17158992
 ] 

Sean Malory commented on SPARK-32306:
-

Thanks Ankit.

> `approx_percentile` in Spark SQL gives incorrect results
> 
>
> Key: SPARK-32306
> URL: https://issues.apache.org/jira/browse/SPARK-32306
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.4
>Reporter: Sean Malory
>Priority: Major
>
> The `approx_percentile` function in Spark SQL does not give the correct 
> result. I'm not sure how incorrect it is; it may just be a boundary issue. 
> From the docs:
> {quote}The accuracy parameter (default: 1) is a positive numeric literal 
> which controls approximation accuracy at the cost of memory. Higher value of 
> accuracy yields better accuracy, 1.0/accuracy is the relative error of the 
> approximation.
> {quote}
> This is not true. Here is a minimum example in `pyspark` where, essentially, 
> the median of 5 and 8 is being calculated as 5:
> {code:python}
> import pyspark.sql.functions as psf
> df = spark.createDataFrame(
> [('bar', 5), ('bar', 8)], ['name', 'val']
> )
> median = psf.expr('percentile_approx(val, 0.5, 2147483647)')
> df.groupBy('name').agg(median.alias('median'))# gives the median as 5
> {code}
> I've tested this with Spark v2.4.4, pyspark v2.4.5- although I suspect this 
> is an issue with the underlying algorithm.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32306) `approx_percentile` in Spark SQL gives incorrect results

2020-07-14 Thread Sean Malory (Jira)
Sean Malory created SPARK-32306:
---

 Summary: `approx_percentile` in Spark SQL gives incorrect results
 Key: SPARK-32306
 URL: https://issues.apache.org/jira/browse/SPARK-32306
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 2.4.4
Reporter: Sean Malory


The `approx_percentile` function in Spark SQL does not give the correct result. 
I'm not sure how incorrect it is; it may just be a boundary issue. From the 
docs:
{quote}The accuracy parameter (default: 1) is a positive numeric literal 
which controls approximation accuracy at the cost of memory. Higher value of 
accuracy yields better accuracy, 1.0/accuracy is the relative error of the 
approximation.
{quote}
This is not true. Here is a minimum example in `pyspark` where, essentially, 
the median of 5 and 8 is being calculated as 5:
{code:python}
import pyspark.sql.functions as psf

df = spark.createDataFrame(
[('bar', 5), ('bar', 8)], ['name', 'val']
)
median = psf.expr('percentile_approx(val, 0.5, 2147483647)')

df.groupBy('name').agg(median.alias('median'))# gives the median as 5
{code}
I've tested this with Spark v2.4.4, pyspark v2.4.5- although I suspect this is 
an issue with the underlying algorithm.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org