[ 
https://issues.apache.org/jira/browse/SPARK-29325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16946166#comment-16946166
 ] 

Frank Astier commented on SPARK-29325:
--------------------------------------

I noticed that the algorithm is sensitive to the partitioning, i.e. that it is 
sensitive to the order of the data. In the following code, 2 partitions make 
the result vary with the relError, but more than 2 partitions give stable 
results no matter what relError is. 
  val df = spark.read.format("csv").option("header", 
"true").option("inferSchema", "true")
      .load("./src/main/resources/20191001_example_data_approx_quantile_bug")
      .repartition(20)

    List(0, 0.001, 0.002, 0.003, 0.004, 0.005, 0.0001, 0.0005, 0.00001, 
0.00005, 0.000001, 0.000005)
      .foreach { relError =>
        val pct = df.stat.approxQuantile("value", Array(0.9), 
relError).mkString("")
        //val pct = multipleApproxQuantiles(df, Array(0.9), 
relError).mkString("")
        println(s"$relError $pct")
      }

> approxQuantile() results are incorrect and vary significantly for small 
> changes in relativeError
> ------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-29325
>                 URL: https://issues.apache.org/jira/browse/SPARK-29325
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.2, 2.4.4
>         Environment: I was using OSX 10.14.6.
> I was using Scala 2.11.12 and Spark 2.4.4.
> I also verified the bug exists for Scala 2.11.8 and Spark 2.3.2.
>            Reporter: James Verbus
>            Priority: Major
>              Labels: correctness
>         Attachments: 20191001_example_data_approx_quantile_bug.zip
>
>
> The [approxQuantile() 
> method|https://github.com/apache/spark/blob/3b1674cb1f244598463e879477d89632b0817578/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L40]
>  returns sometimes incorrect results that are sensitively dependent upon the 
> choice of the relativeError.
> Below is an example in the latest Spark version (2.4.4). You can see the 
> result varies significantly for modest changes in the specified relativeError 
> parameter. The result varies much more than would be expected based upon the 
> relativeError parameter.
>  
> {code:java}
> Welcome to
>       ____              __
>      / __/__  ___ _____/ /__
>     _\ \/ _ \/ _ `/ __/  '_/
>    /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
>       /_/
>          
> Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 
> 1.8.0_212)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> val df = spark.read.format("csv").option("header", 
> "true").option("inferSchema", 
> "true").load("./20191001_example_data_approx_quantile_bug")
> df: org.apache.spark.sql.DataFrame = [value: double]
> scala> df.stat.approxQuantile("value", Array(0.9), 0)
> res0: Array[Double] = Array(0.5929591082174609)
> scala> df.stat.approxQuantile("value", Array(0.9), 0.001)
> res1: Array[Double] = Array(0.67621027121925)
> scala> df.stat.approxQuantile("value", Array(0.9), 0.002)
> res2: Array[Double] = Array(0.5926195654486178)
> scala> df.stat.approxQuantile("value", Array(0.9), 0.003)
> res3: Array[Double] = Array(0.5924693999048418)
> scala> df.stat.approxQuantile("value", Array(0.9), 0.004)
> res4: Array[Double] = Array(0.67621027121925)
> scala> df.stat.approxQuantile("value", Array(0.9), 0.005)
> res5: Array[Double] = Array(0.5923925937051544) 
> {code}
> I attached a zip file containing the data used for the above example 
> demonstrating the bug.
> Also, the following demonstrates that there is data for intermediate quantile 
> values between the 0.5926195654486178 and 0.67621027121925 values observed 
> above.
> {code:java}
> scala> df.stat.approxQuantile("value", Array(0.9), 0.0)
> res10: Array[Double] = Array(0.5929591082174609)
> scala> df.stat.approxQuantile("value", Array(0.91), 0.0)
> res11: Array[Double] = Array(0.5966354540849995)
> scala> df.stat.approxQuantile("value", Array(0.92), 0.0)
> res12: Array[Double] = Array(0.6015676591185595)
> scala> df.stat.approxQuantile("value", Array(0.93), 0.0)
> res13: Array[Double] = Array(0.6029240823799614)
> scala> df.stat.approxQuantile("value", Array(0.94), 0.0)
> res14: Array[Double] = Array(0.6117645471000034)
> scala> df.stat.approxQuantile("value", Array(0.95), 0.0)
> res15: Array[Double] = Array(0.6185162204274052)
> scala> df.stat.approxQuantile("value", Array(0.96), 0.0)
> res16: Array[Double] = Array(0.625983000807062)
> scala> df.stat.approxQuantile("value", Array(0.97), 0.0)
> res17: Array[Double] = Array(0.6306892943412258)
> scala> df.stat.approxQuantile("value", Array(0.98), 0.0)
> res18: Array[Double] = Array(0.6365567375994333)
> scala> df.stat.approxQuantile("value", Array(0.99), 0.0)
> res19: Array[Double] = Array(0.6554479197566019)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to