[jira] [Updated] (SPARK-32908) percentile_approx() returns incorrect results
[ https://issues.apache.org/jira/browse/SPARK-32908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32908: -- Affects Version/s: 2.3.4 > percentile_approx() returns incorrect results > - > > Key: SPARK-32908 > URL: https://issues.apache.org/jira/browse/SPARK-32908 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.7, 3.0.1, 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Labels: correctness > Fix For: 2.4.8, 3.0.2, 3.1.0 > > Attachments: percentile_approx-input.csv > > > Read input data from the attached CSV file: > {code:scala} > val df = spark.read.option("header", "true") > .option("inferSchema", "true") > .csv("/Users/maximgekk/tmp/percentile_approx-input.csv") > .repartition(1) > df.createOrReplaceTempView(table) > {code} > Calculate the 0.77 percentile with accuracy 1e-05: > {code:Scala} > spark.sql( > s"""SELECT >| percentile_approx(tr_rat_resampling_score, 0.77, 10) >|FROM $table >""".stripMargin).show > {code} > {code} > ++ > |percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 10)| > ++ > |1000| > ++ > {code} > The same for smaller accuracy 0.001: > {code} > +--+ > |percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 1000)| > +--+ > |18| > +--+ > {code} > and better accuracy 1e-06: > {code} > +-+ > |percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 100)| > +-+ > | 17| > +-+ > {code} > For the accuracy 1e-05, the result must be around 17-18 but not 1000. > Here is percentile calculation in Google Sheets for the same input: > https://docs.google.com/spreadsheets/d/1Y1i4Td6s9jZQ-bD4IRTESLXP3UxKpqJSXGtmx0Q5TA0/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32908) percentile_approx() returns incorrect results
[ https://issues.apache.org/jira/browse/SPARK-32908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32908: -- Affects Version/s: (was: 3.0.2) (was: 2.4.8) 2.4.7 3.0.1 > percentile_approx() returns incorrect results > - > > Key: SPARK-32908 > URL: https://issues.apache.org/jira/browse/SPARK-32908 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.7, 3.0.1, 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Labels: correctness > Fix For: 2.4.8, 3.0.2, 3.1.0 > > Attachments: percentile_approx-input.csv > > > Read input data from the attached CSV file: > {code:scala} > val df = spark.read.option("header", "true") > .option("inferSchema", "true") > .csv("/Users/maximgekk/tmp/percentile_approx-input.csv") > .repartition(1) > df.createOrReplaceTempView(table) > {code} > Calculate the 0.77 percentile with accuracy 1e-05: > {code:Scala} > spark.sql( > s"""SELECT >| percentile_approx(tr_rat_resampling_score, 0.77, 10) >|FROM $table >""".stripMargin).show > {code} > {code} > ++ > |percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 10)| > ++ > |1000| > ++ > {code} > The same for smaller accuracy 0.001: > {code} > +--+ > |percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 1000)| > +--+ > |18| > +--+ > {code} > and better accuracy 1e-06: > {code} > +-+ > |percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 100)| > +-+ > | 17| > +-+ > {code} > For the accuracy 1e-05, the result must be around 17-18 but not 1000. > Here is percentile calculation in Google Sheets for the same input: > https://docs.google.com/spreadsheets/d/1Y1i4Td6s9jZQ-bD4IRTESLXP3UxKpqJSXGtmx0Q5TA0/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32908) percentile_approx() returns incorrect results
[ https://issues.apache.org/jira/browse/SPARK-32908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32908: -- Labels: correctness (was: ) > percentile_approx() returns incorrect results > - > > Key: SPARK-32908 > URL: https://issues.apache.org/jira/browse/SPARK-32908 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.8, 3.0.2, 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Labels: correctness > Fix For: 2.4.8, 3.0.2, 3.1.0 > > Attachments: percentile_approx-input.csv > > > Read input data from the attached CSV file: > {code:scala} > val df = spark.read.option("header", "true") > .option("inferSchema", "true") > .csv("/Users/maximgekk/tmp/percentile_approx-input.csv") > .repartition(1) > df.createOrReplaceTempView(table) > {code} > Calculate the 0.77 percentile with accuracy 1e-05: > {code:Scala} > spark.sql( > s"""SELECT >| percentile_approx(tr_rat_resampling_score, 0.77, 10) >|FROM $table >""".stripMargin).show > {code} > {code} > ++ > |percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 10)| > ++ > |1000| > ++ > {code} > The same for smaller accuracy 0.001: > {code} > +--+ > |percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 1000)| > +--+ > |18| > +--+ > {code} > and better accuracy 1e-06: > {code} > +-+ > |percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 100)| > +-+ > | 17| > +-+ > {code} > For the accuracy 1e-05, the result must be around 17-18 but not 1000. > Here is percentile calculation in Google Sheets for the same input: > https://docs.google.com/spreadsheets/d/1Y1i4Td6s9jZQ-bD4IRTESLXP3UxKpqJSXGtmx0Q5TA0/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32908) percentile_approx() returns incorrect results
[ https://issues.apache.org/jira/browse/SPARK-32908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-32908: --- Description: Read input data from the attached CSV file: {code:scala} val df = spark.read.option("header", "true") .option("inferSchema", "true") .csv("/Users/maximgekk/tmp/percentile_approx-input.csv") .repartition(1) df.createOrReplaceTempView(table) {code} Calculate the 0.77 percentile with accuracy 1e-05: {code:Scala} spark.sql( s"""SELECT | percentile_approx(tr_rat_resampling_score, 0.77, 10) |FROM $table """.stripMargin).show {code} {code} ++ |percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 10)| ++ |1000| ++ {code} The same for smaller accuracy 0.001: {code} +--+ |percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 1000)| +--+ |18| +--+ {code} and better accuracy 1e-06: {code} +-+ |percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 100)| +-+ | 17| +-+ {code} For the accuracy 1e-05, the result must be around 17-18 but not 1000. Here is percentile calculation in Google Sheets for the same input: https://docs.google.com/spreadsheets/d/1Y1i4Td6s9jZQ-bD4IRTESLXP3UxKpqJSXGtmx0Q5TA0/edit?usp=sharing was: Read input data from the attached CSV file: {code:scala} val df = spark.read.option("header", "true") .option("inferSchema", "true") .csv("/Users/maximgekk/tmp/tr_rat_resampling_score.csv") .repartition(1) df.createOrReplaceTempView(table) {code} Calculate the 0.77 percentile with accuracy 1e-05: {code:scala} spark.sql( s"""SELECT | percentile_approx(tr_rat_resampling_score, 0.77, 10) |FROM $table """.stripMargin).show {code} {code} ++ |percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 10)| ++ |1000| ++ {code} The same for smaller accuracy 0.001: {code} +--+ |percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 1000)| +--+ |18| +--+ {code} and better accuracy 1e-06: {code} +-+ |percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 100)| +-+ | 17| +-+ {code} For the accuracy 1e-05, the result must be around 17-18 but not 1000. Here is percentile calculation in Google Sheets for the same input: https://docs.google.com/spreadsheets/d/1Y1i4Td6s9jZQ-bD4IRTESLXP3UxKpqJSXGtmx0Q5TA0/edit?usp=sharing > percentile_approx() returns incorrect results > - > > Key: SPARK-32908 > URL: https://issues.apache.org/jira/browse/SPARK-32908 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.8, 3.0.2, 3.1.0 >Reporter: Maxim Gekk >Priority: Major > Attachments: percentile_approx-input.csv > > > Read input data from the attached CSV file: > {code:scala} > val df = spark.read.option("header", "true") > .option("inferSchema", "true") > .csv("/Users/maximgekk/tmp/percentile_approx-input.csv") > .repartition(1) > df.createOrReplaceTempView(table) > {code} > Calculate the 0.77 percentile with accuracy 1e-05:
[jira] [Updated] (SPARK-32908) percentile_approx() returns incorrect results
[ https://issues.apache.org/jira/browse/SPARK-32908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-32908: --- Attachment: percentile_approx-input.csv > percentile_approx() returns incorrect results > - > > Key: SPARK-32908 > URL: https://issues.apache.org/jira/browse/SPARK-32908 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.8, 3.0.2, 3.1.0 >Reporter: Maxim Gekk >Priority: Major > Attachments: percentile_approx-input.csv > > > Read input data from the attached CSV file: > {code:scala} > val df = spark.read.option("header", "true") > .option("inferSchema", "true") > .csv("/Users/maximgekk/tmp/tr_rat_resampling_score.csv") > .repartition(1) > df.createOrReplaceTempView(table) > {code} > Calculate the 0.77 percentile with accuracy 1e-05: > {code:scala} > spark.sql( > s"""SELECT >| percentile_approx(tr_rat_resampling_score, 0.77, 10) >|FROM $table >""".stripMargin).show > {code} > {code} > ++ > |percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 10)| > ++ > |1000| > ++ > {code} > The same for smaller accuracy 0.001: > {code} > +--+ > |percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 1000)| > +--+ > |18| > +--+ > {code} > and better accuracy 1e-06: > {code} > +-+ > |percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 100)| > +-+ > | 17| > +-+ > {code} > For the accuracy 1e-05, the result must be around 17-18 but not 1000. > Here is percentile calculation in Google Sheets for the same input: > https://docs.google.com/spreadsheets/d/1Y1i4Td6s9jZQ-bD4IRTESLXP3UxKpqJSXGtmx0Q5TA0/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org