[jira] [Updated] (SPARK-32908) percentile_approx() returns incorrect results

2020-09-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32908:
--
Affects Version/s: 2.3.4

> percentile_approx() returns incorrect results
> -
>
> Key: SPARK-32908
> URL: https://issues.apache.org/jira/browse/SPARK-32908
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.7, 3.0.1, 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>  Labels: correctness
> Fix For: 2.4.8, 3.0.2, 3.1.0
>
> Attachments: percentile_approx-input.csv
>
>
> Read input data from the attached CSV file:
> {code:scala}
>   val df = spark.read.option("header", "true")
> .option("inferSchema", "true")
> .csv("/Users/maximgekk/tmp/percentile_approx-input.csv")
> .repartition(1)
>   df.createOrReplaceTempView(table)
> {code}
> Calculate the 0.77 percentile with accuracy 1e-05:
> {code:Scala}
>   spark.sql(
> s"""SELECT
>|  percentile_approx(tr_rat_resampling_score, 0.77, 10)
>|FROM $table
>""".stripMargin).show
> {code}
> {code}
> ++
> |percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 10)|
> ++
> |1000|
> ++
> {code}
>  The same for smaller accuracy 0.001:
> {code}
> +--+
> |percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 1000)|
> +--+
> |18|
> +--+
> {code} 
> and better accuracy 1e-06:
> {code}
> +-+
> |percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 100)|
> +-+
> |   17|
> +-+
> {code}
> For the accuracy 1e-05, the result must be around 17-18 but not 1000.
> Here is percentile calculation in Google Sheets for the same input:
> https://docs.google.com/spreadsheets/d/1Y1i4Td6s9jZQ-bD4IRTESLXP3UxKpqJSXGtmx0Q5TA0/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32908) percentile_approx() returns incorrect results

2020-09-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32908:
--
Affects Version/s: (was: 3.0.2)
   (was: 2.4.8)
   2.4.7
   3.0.1

> percentile_approx() returns incorrect results
> -
>
> Key: SPARK-32908
> URL: https://issues.apache.org/jira/browse/SPARK-32908
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.1, 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>  Labels: correctness
> Fix For: 2.4.8, 3.0.2, 3.1.0
>
> Attachments: percentile_approx-input.csv
>
>
> Read input data from the attached CSV file:
> {code:scala}
>   val df = spark.read.option("header", "true")
> .option("inferSchema", "true")
> .csv("/Users/maximgekk/tmp/percentile_approx-input.csv")
> .repartition(1)
>   df.createOrReplaceTempView(table)
> {code}
> Calculate the 0.77 percentile with accuracy 1e-05:
> {code:Scala}
>   spark.sql(
> s"""SELECT
>|  percentile_approx(tr_rat_resampling_score, 0.77, 10)
>|FROM $table
>""".stripMargin).show
> {code}
> {code}
> ++
> |percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 10)|
> ++
> |1000|
> ++
> {code}
>  The same for smaller accuracy 0.001:
> {code}
> +--+
> |percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 1000)|
> +--+
> |18|
> +--+
> {code} 
> and better accuracy 1e-06:
> {code}
> +-+
> |percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 100)|
> +-+
> |   17|
> +-+
> {code}
> For the accuracy 1e-05, the result must be around 17-18 but not 1000.
> Here is percentile calculation in Google Sheets for the same input:
> https://docs.google.com/spreadsheets/d/1Y1i4Td6s9jZQ-bD4IRTESLXP3UxKpqJSXGtmx0Q5TA0/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32908) percentile_approx() returns incorrect results

2020-09-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32908:
--
Labels: correctness  (was: )

> percentile_approx() returns incorrect results
> -
>
> Key: SPARK-32908
> URL: https://issues.apache.org/jira/browse/SPARK-32908
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.2, 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>  Labels: correctness
> Fix For: 2.4.8, 3.0.2, 3.1.0
>
> Attachments: percentile_approx-input.csv
>
>
> Read input data from the attached CSV file:
> {code:scala}
>   val df = spark.read.option("header", "true")
> .option("inferSchema", "true")
> .csv("/Users/maximgekk/tmp/percentile_approx-input.csv")
> .repartition(1)
>   df.createOrReplaceTempView(table)
> {code}
> Calculate the 0.77 percentile with accuracy 1e-05:
> {code:Scala}
>   spark.sql(
> s"""SELECT
>|  percentile_approx(tr_rat_resampling_score, 0.77, 10)
>|FROM $table
>""".stripMargin).show
> {code}
> {code}
> ++
> |percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 10)|
> ++
> |1000|
> ++
> {code}
>  The same for smaller accuracy 0.001:
> {code}
> +--+
> |percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 1000)|
> +--+
> |18|
> +--+
> {code} 
> and better accuracy 1e-06:
> {code}
> +-+
> |percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 100)|
> +-+
> |   17|
> +-+
> {code}
> For the accuracy 1e-05, the result must be around 17-18 but not 1000.
> Here is percentile calculation in Google Sheets for the same input:
> https://docs.google.com/spreadsheets/d/1Y1i4Td6s9jZQ-bD4IRTESLXP3UxKpqJSXGtmx0Q5TA0/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32908) percentile_approx() returns incorrect results

2020-09-17 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-32908:
---
Description: 
Read input data from the attached CSV file:
{code:scala}
  val df = spark.read.option("header", "true")
.option("inferSchema", "true")
.csv("/Users/maximgekk/tmp/percentile_approx-input.csv")
.repartition(1)
  df.createOrReplaceTempView(table)
{code}
Calculate the 0.77 percentile with accuracy 1e-05:
{code:Scala}
  spark.sql(
s"""SELECT
   |  percentile_approx(tr_rat_resampling_score, 0.77, 10)
   |FROM $table
   """.stripMargin).show
{code}
{code}
++
|percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 10)|
++
|1000|
++
{code}
 The same for smaller accuracy 0.001:
{code}
+--+
|percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 1000)|
+--+
|18|
+--+
{code} 
and better accuracy 1e-06:
{code}
+-+
|percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 100)|
+-+
|   17|
+-+
{code}

For the accuracy 1e-05, the result must be around 17-18 but not 1000.

Here is percentile calculation in Google Sheets for the same input:
https://docs.google.com/spreadsheets/d/1Y1i4Td6s9jZQ-bD4IRTESLXP3UxKpqJSXGtmx0Q5TA0/edit?usp=sharing

  was:
Read input data from the attached CSV file:
{code:scala}
  val df = spark.read.option("header", "true")
.option("inferSchema", "true")
.csv("/Users/maximgekk/tmp/tr_rat_resampling_score.csv")
.repartition(1)
  df.createOrReplaceTempView(table)
{code}
Calculate the 0.77 percentile with accuracy 1e-05:
{code:scala}
  spark.sql(
s"""SELECT
   |  percentile_approx(tr_rat_resampling_score, 0.77, 10)
   |FROM $table
   """.stripMargin).show
{code}
{code}
++
|percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 10)|
++
|1000|
++
{code}
 The same for smaller accuracy 0.001:
{code}
+--+
|percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 1000)|
+--+
|18|
+--+
{code} 
and better accuracy 1e-06:
{code}
+-+
|percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 100)|
+-+
|   17|
+-+
{code}

For the accuracy 1e-05, the result must be around 17-18 but not 1000.

Here is percentile calculation in Google Sheets for the same input:
https://docs.google.com/spreadsheets/d/1Y1i4Td6s9jZQ-bD4IRTESLXP3UxKpqJSXGtmx0Q5TA0/edit?usp=sharing


> percentile_approx() returns incorrect results
> -
>
> Key: SPARK-32908
> URL: https://issues.apache.org/jira/browse/SPARK-32908
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.2, 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
> Attachments: percentile_approx-input.csv
>
>
> Read input data from the attached CSV file:
> {code:scala}
>   val df = spark.read.option("header", "true")
> .option("inferSchema", "true")
> .csv("/Users/maximgekk/tmp/percentile_approx-input.csv")
> .repartition(1)
>   df.createOrReplaceTempView(table)
> {code}
> Calculate the 0.77 percentile with accuracy 1e-05:

[jira] [Updated] (SPARK-32908) percentile_approx() returns incorrect results

2020-09-17 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-32908:
---
Attachment: percentile_approx-input.csv

> percentile_approx() returns incorrect results
> -
>
> Key: SPARK-32908
> URL: https://issues.apache.org/jira/browse/SPARK-32908
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.2, 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
> Attachments: percentile_approx-input.csv
>
>
> Read input data from the attached CSV file:
> {code:scala}
>   val df = spark.read.option("header", "true")
> .option("inferSchema", "true")
> .csv("/Users/maximgekk/tmp/tr_rat_resampling_score.csv")
> .repartition(1)
>   df.createOrReplaceTempView(table)
> {code}
> Calculate the 0.77 percentile with accuracy 1e-05:
> {code:scala}
>   spark.sql(
> s"""SELECT
>|  percentile_approx(tr_rat_resampling_score, 0.77, 10)
>|FROM $table
>""".stripMargin).show
> {code}
> {code}
> ++
> |percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 10)|
> ++
> |1000|
> ++
> {code}
>  The same for smaller accuracy 0.001:
> {code}
> +--+
> |percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 1000)|
> +--+
> |18|
> +--+
> {code} 
> and better accuracy 1e-06:
> {code}
> +-+
> |percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 100)|
> +-+
> |   17|
> +-+
> {code}
> For the accuracy 1e-05, the result must be around 17-18 but not 1000.
> Here is percentile calculation in Google Sheets for the same input:
> https://docs.google.com/spreadsheets/d/1Y1i4Td6s9jZQ-bD4IRTESLXP3UxKpqJSXGtmx0Q5TA0/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org