[ https://issues.apache.org/jira/browse/SPARK-29325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16946166#comment-16946166 ]
Frank Astier commented on SPARK-29325: -------------------------------------- I noticed that the algorithm is sensitive to the partitioning, i.e. that it is sensitive to the order of the data. In the following code, 2 partitions make the result vary with the relError, but more than 2 partitions give stable results no matter what relError is. val df = spark.read.format("csv").option("header", "true").option("inferSchema", "true") .load("./src/main/resources/20191001_example_data_approx_quantile_bug") .repartition(20) List(0, 0.001, 0.002, 0.003, 0.004, 0.005, 0.0001, 0.0005, 0.00001, 0.00005, 0.000001, 0.000005) .foreach { relError => val pct = df.stat.approxQuantile("value", Array(0.9), relError).mkString("") //val pct = multipleApproxQuantiles(df, Array(0.9), relError).mkString("") println(s"$relError $pct") } > approxQuantile() results are incorrect and vary significantly for small > changes in relativeError > ------------------------------------------------------------------------------------------------ > > Key: SPARK-29325 > URL: https://issues.apache.org/jira/browse/SPARK-29325 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.3.2, 2.4.4 > Environment: I was using OSX 10.14.6. > I was using Scala 2.11.12 and Spark 2.4.4. > I also verified the bug exists for Scala 2.11.8 and Spark 2.3.2. > Reporter: James Verbus > Priority: Major > Labels: correctness > Attachments: 20191001_example_data_approx_quantile_bug.zip > > > The [approxQuantile() > method|https://github.com/apache/spark/blob/3b1674cb1f244598463e879477d89632b0817578/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L40] > returns sometimes incorrect results that are sensitively dependent upon the > choice of the relativeError. > Below is an example in the latest Spark version (2.4.4). You can see the > result varies significantly for modest changes in the specified relativeError > parameter. The result varies much more than would be expected based upon the > relativeError parameter. > > {code:java} > Welcome to > ____ __ > / __/__ ___ _____/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 2.4.4 > /_/ > > Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java > 1.8.0_212) > Type in expressions to have them evaluated. > Type :help for more information. > scala> val df = spark.read.format("csv").option("header", > "true").option("inferSchema", > "true").load("./20191001_example_data_approx_quantile_bug") > df: org.apache.spark.sql.DataFrame = [value: double] > scala> df.stat.approxQuantile("value", Array(0.9), 0) > res0: Array[Double] = Array(0.5929591082174609) > scala> df.stat.approxQuantile("value", Array(0.9), 0.001) > res1: Array[Double] = Array(0.67621027121925) > scala> df.stat.approxQuantile("value", Array(0.9), 0.002) > res2: Array[Double] = Array(0.5926195654486178) > scala> df.stat.approxQuantile("value", Array(0.9), 0.003) > res3: Array[Double] = Array(0.5924693999048418) > scala> df.stat.approxQuantile("value", Array(0.9), 0.004) > res4: Array[Double] = Array(0.67621027121925) > scala> df.stat.approxQuantile("value", Array(0.9), 0.005) > res5: Array[Double] = Array(0.5923925937051544) > {code} > I attached a zip file containing the data used for the above example > demonstrating the bug. > Also, the following demonstrates that there is data for intermediate quantile > values between the 0.5926195654486178 and 0.67621027121925 values observed > above. > {code:java} > scala> df.stat.approxQuantile("value", Array(0.9), 0.0) > res10: Array[Double] = Array(0.5929591082174609) > scala> df.stat.approxQuantile("value", Array(0.91), 0.0) > res11: Array[Double] = Array(0.5966354540849995) > scala> df.stat.approxQuantile("value", Array(0.92), 0.0) > res12: Array[Double] = Array(0.6015676591185595) > scala> df.stat.approxQuantile("value", Array(0.93), 0.0) > res13: Array[Double] = Array(0.6029240823799614) > scala> df.stat.approxQuantile("value", Array(0.94), 0.0) > res14: Array[Double] = Array(0.6117645471000034) > scala> df.stat.approxQuantile("value", Array(0.95), 0.0) > res15: Array[Double] = Array(0.6185162204274052) > scala> df.stat.approxQuantile("value", Array(0.96), 0.0) > res16: Array[Double] = Array(0.625983000807062) > scala> df.stat.approxQuantile("value", Array(0.97), 0.0) > res17: Array[Double] = Array(0.6306892943412258) > scala> df.stat.approxQuantile("value", Array(0.98), 0.0) > res18: Array[Double] = Array(0.6365567375994333) > scala> df.stat.approxQuantile("value", Array(0.99), 0.0) > res19: Array[Double] = Array(0.6554479197566019) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org