[ 
https://issues.apache.org/jira/browse/SPARK-32908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-32908:
-------------------------------
    Description: 
Read input data from the attached CSV file:
{code:scala}
      val df = spark.read.option("header", "true")
        .option("inferSchema", "true")
        .csv("/Users/maximgekk/tmp/percentile_approx-input.csv")
        .repartition(1)
      df.createOrReplaceTempView(table)
{code}
Calculate the 0.77 percentile with accuracy 1e-05:
{code:Scala}
      spark.sql(
        s"""SELECT
           |  percentile_approx(tr_rat_resampling_score, 0.77, 100000)
           |FROM $table
           """.stripMargin).show
{code}
{code}
+------------------------------------------------------------------------+
|percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 100000)|
+------------------------------------------------------------------------+
|                                                                    1000|
+------------------------------------------------------------------------+
{code}
 The same for smaller accuracy 0.001:
{code}
+----------------------------------------------------------------------+
|percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 1000)|
+----------------------------------------------------------------------+
|                                                                    18|
+----------------------------------------------------------------------+
{code} 
and better accuracy 1e-06:
{code}
+-------------------------------------------------------------------------+
|percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 1000000)|
+-------------------------------------------------------------------------+
|                                                                       17|
+-------------------------------------------------------------------------+
{code}

For the accuracy 1e-05, the result must be around 17-18 but not 1000.

Here is percentile calculation in Google Sheets for the same input:
https://docs.google.com/spreadsheets/d/1Y1i4Td6s9jZQ-bD4IRTESLXP3UxKpqJSXGtmx0Q5TA0/edit?usp=sharing

  was:
Read input data from the attached CSV file:
{code:scala}
      val df = spark.read.option("header", "true")
        .option("inferSchema", "true")
        .csv("/Users/maximgekk/tmp/tr_rat_resampling_score.csv")
        .repartition(1)
      df.createOrReplaceTempView(table)
{code}
Calculate the 0.77 percentile with accuracy 1e-05:
{code:scala}
      spark.sql(
        s"""SELECT
           |  percentile_approx(tr_rat_resampling_score, 0.77, 100000)
           |FROM $table
           """.stripMargin).show
{code}
{code}
+------------------------------------------------------------------------+
|percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 100000)|
+------------------------------------------------------------------------+
|                                                                    1000|
+------------------------------------------------------------------------+
{code}
 The same for smaller accuracy 0.001:
{code}
+----------------------------------------------------------------------+
|percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 1000)|
+----------------------------------------------------------------------+
|                                                                    18|
+----------------------------------------------------------------------+
{code} 
and better accuracy 1e-06:
{code}
+-------------------------------------------------------------------------+
|percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 1000000)|
+-------------------------------------------------------------------------+
|                                                                       17|
+-------------------------------------------------------------------------+
{code}

For the accuracy 1e-05, the result must be around 17-18 but not 1000.

Here is percentile calculation in Google Sheets for the same input:
https://docs.google.com/spreadsheets/d/1Y1i4Td6s9jZQ-bD4IRTESLXP3UxKpqJSXGtmx0Q5TA0/edit?usp=sharing


> percentile_approx() returns incorrect results
> ---------------------------------------------
>
>                 Key: SPARK-32908
>                 URL: https://issues.apache.org/jira/browse/SPARK-32908
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.8, 3.0.2, 3.1.0
>            Reporter: Maxim Gekk
>            Priority: Major
>         Attachments: percentile_approx-input.csv
>
>
> Read input data from the attached CSV file:
> {code:scala}
>       val df = spark.read.option("header", "true")
>         .option("inferSchema", "true")
>         .csv("/Users/maximgekk/tmp/percentile_approx-input.csv")
>         .repartition(1)
>       df.createOrReplaceTempView(table)
> {code}
> Calculate the 0.77 percentile with accuracy 1e-05:
> {code:Scala}
>       spark.sql(
>         s"""SELECT
>            |  percentile_approx(tr_rat_resampling_score, 0.77, 100000)
>            |FROM $table
>            """.stripMargin).show
> {code}
> {code}
> +------------------------------------------------------------------------+
> |percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 100000)|
> +------------------------------------------------------------------------+
> |                                                                    1000|
> +------------------------------------------------------------------------+
> {code}
>  The same for smaller accuracy 0.001:
> {code}
> +----------------------------------------------------------------------+
> |percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 1000)|
> +----------------------------------------------------------------------+
> |                                                                    18|
> +----------------------------------------------------------------------+
> {code} 
> and better accuracy 1e-06:
> {code}
> +-------------------------------------------------------------------------+
> |percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 1000000)|
> +-------------------------------------------------------------------------+
> |                                                                       17|
> +-------------------------------------------------------------------------+
> {code}
> For the accuracy 1e-05, the result must be around 17-18 but not 1000.
> Here is percentile calculation in Google Sheets for the same input:
> https://docs.google.com/spreadsheets/d/1Y1i4Td6s9jZQ-bD4IRTESLXP3UxKpqJSXGtmx0Q5TA0/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to