[ https://issues.apache.org/jira/browse/SPARK-29325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
James Verbus updated SPARK-29325: --------------------------------- Attachment: 20191001_example_data_approx_quantile_bug.zip > approxQuantile() results are incorrect and vary significantly for small > changes in relativeError > ------------------------------------------------------------------------------------------------ > > Key: SPARK-29325 > URL: https://issues.apache.org/jira/browse/SPARK-29325 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.3.2, 2.4.4 > Environment: I was using OSX 10.14.6. > I was using Scala 2.11.12 and Spark 2.4.4. > I also verified the bug exists for Scala 2.11.8 and Spark 2.3.2. > Reporter: James Verbus > Priority: Major > Labels: correctness > Attachments: 20191001_example_data_approx_quantile_bug.zip > > > The [approxQuantile() > method|https://github.com/apache/spark/blob/3b1674cb1f244598463e879477d89632b0817578/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L40] > returns sometimes incorrect results that are sensitively dependent upon the > choice of the relativeError. > Below is an example in the latest Spark version (2.4.4). You can see the > result varies significantly for modest changes in the specified relativeError > parameter. The result varies much more than the magnitude of the > relativeError parameter. > > {code:java} > Welcome to > ____ __ > / __/__ ___ _____/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 2.4.4 > /_/ > > Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java > 1.8.0_212) > Type in expressions to have them evaluated. > Type :help for more information. > scala> val df = spark.read.format("csv").option("header", > "true").option("inferSchema", > "true").load("./20191001_example_data_approx_quantile_bug") > df: org.apache.spark.sql.DataFrame = [value: double] > scala> df.stat.approxQuantile("value", Array(0.9), 0) > res0: Array[Double] = Array(0.5929591082174609) > scala> df.stat.approxQuantile("value", Array(0.9), 0.001) > res1: Array[Double] = Array(0.67621027121925) > scala> df.stat.approxQuantile("value", Array(0.9), 0.002) > res2: Array[Double] = Array(0.5926195654486178) > scala> df.stat.approxQuantile("value", Array(0.9), 0.003) > res3: Array[Double] = Array(0.5924693999048418) > scala> df.stat.approxQuantile("value", Array(0.9), 0.004) > res4: Array[Double] = Array(0.67621027121925) > scala> df.stat.approxQuantile("value", Array(0.9), 0.005) > res5: Array[Double] = Array(0.5923925937051544) > > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org