James Verbus created SPARK-29325:
------------------------------------

             Summary: approxQuantile() results are incorrect and vary 
significantly for small changes in relativeError
                 Key: SPARK-29325
                 URL: https://issues.apache.org/jira/browse/SPARK-29325
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.4.4, 2.3.2
         Environment: I was using OSX 10.14.6.

I was using Scala 2.11.12 and Spark 2.4.4.

I also verified the bug exists for Scala 2.11.8 and Spark 2.3.2.
            Reporter: James Verbus
         Attachments: 20191001_example_data_approx_quantile_bug.zip

The [approxQuantile() 
method|https://github.com/apache/spark/blob/3b1674cb1f244598463e879477d89632b0817578/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L40]
 returns sometimes incorrect results that are sensitively dependent upon the 
choice of the relativeError.

Below is an example in the latest Spark version (2.4.4). You can see the result 
varies significantly for modest changes in the specified relativeError 
parameter. The result varies much more than the magnitude of the relativeError 
parameter.

 
{code:java}
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
      /_/
         
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_212)
Type in expressions to have them evaluated.
Type :help for more information.


scala> val df = spark.read.format("csv").option("header", 
"true").option("inferSchema", 
"true").load("./20191001_example_data_approx_quantile_bug")
df: org.apache.spark.sql.DataFrame = [value: double]


scala> df.stat.approxQuantile("value", Array(0.9), 0)
res0: Array[Double] = Array(0.5929591082174609)


scala> df.stat.approxQuantile("value", Array(0.9), 0.001)
res1: Array[Double] = Array(0.67621027121925)


scala> df.stat.approxQuantile("value", Array(0.9), 0.002)
res2: Array[Double] = Array(0.5926195654486178)


scala> df.stat.approxQuantile("value", Array(0.9), 0.003)
res3: Array[Double] = Array(0.5924693999048418)


scala> df.stat.approxQuantile("value", Array(0.9), 0.004)
res4: Array[Double] = Array(0.67621027121925)


scala> df.stat.approxQuantile("value", Array(0.9), 0.005)
res5: Array[Double] = Array(0.5923925937051544)
 
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to