[ 
https://issues.apache.org/jira/browse/SPARK-29325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Verbus updated SPARK-29325:
---------------------------------
    Attachment: 20191001_example_data_approx_quantile_bug.zip

> approxQuantile() results are incorrect and vary significantly for small 
> changes in relativeError
> ------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-29325
>                 URL: https://issues.apache.org/jira/browse/SPARK-29325
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.2, 2.4.4
>         Environment: I was using OSX 10.14.6.
> I was using Scala 2.11.12 and Spark 2.4.4.
> I also verified the bug exists for Scala 2.11.8 and Spark 2.3.2.
>            Reporter: James Verbus
>            Priority: Major
>              Labels: correctness
>         Attachments: 20191001_example_data_approx_quantile_bug.zip
>
>
> The [approxQuantile() 
> method|https://github.com/apache/spark/blob/3b1674cb1f244598463e879477d89632b0817578/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L40]
>  returns sometimes incorrect results that are sensitively dependent upon the 
> choice of the relativeError.
> Below is an example in the latest Spark version (2.4.4). You can see the 
> result varies significantly for modest changes in the specified relativeError 
> parameter. The result varies much more than the magnitude of the 
> relativeError parameter.
>  
> {code:java}
> Welcome to
>       ____              __
>      / __/__  ___ _____/ /__
>     _\ \/ _ \/ _ `/ __/  '_/
>    /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
>       /_/
>          
> Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 
> 1.8.0_212)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> val df = spark.read.format("csv").option("header", 
> "true").option("inferSchema", 
> "true").load("./20191001_example_data_approx_quantile_bug")
> df: org.apache.spark.sql.DataFrame = [value: double]
> scala> df.stat.approxQuantile("value", Array(0.9), 0)
> res0: Array[Double] = Array(0.5929591082174609)
> scala> df.stat.approxQuantile("value", Array(0.9), 0.001)
> res1: Array[Double] = Array(0.67621027121925)
> scala> df.stat.approxQuantile("value", Array(0.9), 0.002)
> res2: Array[Double] = Array(0.5926195654486178)
> scala> df.stat.approxQuantile("value", Array(0.9), 0.003)
> res3: Array[Double] = Array(0.5924693999048418)
> scala> df.stat.approxQuantile("value", Array(0.9), 0.004)
> res4: Array[Double] = Array(0.67621027121925)
> scala> df.stat.approxQuantile("value", Array(0.9), 0.005)
> res5: Array[Double] = Array(0.5923925937051544)
>  
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to