James Verbus created SPARK-29325: ------------------------------------ Summary: approxQuantile() results are incorrect and vary significantly for small changes in relativeError Key: SPARK-29325 URL: https://issues.apache.org/jira/browse/SPARK-29325 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.4, 2.3.2 Environment: I was using OSX 10.14.6.
I was using Scala 2.11.12 and Spark 2.4.4. I also verified the bug exists for Scala 2.11.8 and Spark 2.3.2. Reporter: James Verbus Attachments: 20191001_example_data_approx_quantile_bug.zip The [approxQuantile() method|https://github.com/apache/spark/blob/3b1674cb1f244598463e879477d89632b0817578/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L40] returns sometimes incorrect results that are sensitively dependent upon the choice of the relativeError. Below is an example in the latest Spark version (2.4.4). You can see the result varies significantly for modest changes in the specified relativeError parameter. The result varies much more than the magnitude of the relativeError parameter. {code:java} Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.4 /_/ Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_212) Type in expressions to have them evaluated. Type :help for more information. scala> val df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("./20191001_example_data_approx_quantile_bug") df: org.apache.spark.sql.DataFrame = [value: double] scala> df.stat.approxQuantile("value", Array(0.9), 0) res0: Array[Double] = Array(0.5929591082174609) scala> df.stat.approxQuantile("value", Array(0.9), 0.001) res1: Array[Double] = Array(0.67621027121925) scala> df.stat.approxQuantile("value", Array(0.9), 0.002) res2: Array[Double] = Array(0.5926195654486178) scala> df.stat.approxQuantile("value", Array(0.9), 0.003) res3: Array[Double] = Array(0.5924693999048418) scala> df.stat.approxQuantile("value", Array(0.9), 0.004) res4: Array[Double] = Array(0.67621027121925) scala> df.stat.approxQuantile("value", Array(0.9), 0.005) res5: Array[Double] = Array(0.5923925937051544) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org