[ 
https://issues.apache.org/jira/browse/SPARK-29325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Verbus updated SPARK-29325:
---------------------------------
    Description: 
The [approxQuantile() 
method|https://github.com/apache/spark/blob/3b1674cb1f244598463e879477d89632b0817578/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L40]
 returns sometimes incorrect results that are sensitively dependent upon the 
choice of the relativeError.

Below is an example in the latest Spark version (2.4.4). You can see the result 
varies significantly for modest changes in the specified relativeError 
parameter. The result varies much more than would be expected based upon the 
relativeError parameter.

 
{code:java}
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
      /_/
         
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_212)
Type in expressions to have them evaluated.
Type :help for more information.


scala> val df = spark.read.format("csv").option("header", 
"true").option("inferSchema", 
"true").load("./20191001_example_data_approx_quantile_bug")
df: org.apache.spark.sql.DataFrame = [value: double]


scala> df.stat.approxQuantile("value", Array(0.9), 0)
res0: Array[Double] = Array(0.5929591082174609)


scala> df.stat.approxQuantile("value", Array(0.9), 0.001)
res1: Array[Double] = Array(0.67621027121925)


scala> df.stat.approxQuantile("value", Array(0.9), 0.002)
res2: Array[Double] = Array(0.5926195654486178)


scala> df.stat.approxQuantile("value", Array(0.9), 0.003)
res3: Array[Double] = Array(0.5924693999048418)


scala> df.stat.approxQuantile("value", Array(0.9), 0.004)
res4: Array[Double] = Array(0.67621027121925)


scala> df.stat.approxQuantile("value", Array(0.9), 0.005)
res5: Array[Double] = Array(0.5923925937051544) 
{code}
I attached a zip file containing the data used for the above example 
demonstrating the bug.

Also, the following demonstrates that there is data for intermediate quantile 
values between the 0.5926195654486178 and 0.67621027121925 values observed 
above.
{code:java}
scala> df.stat.approxQuantile("value", Array(0.9), 0.0)
res10: Array[Double] = Array(0.5929591082174609)

scala> df.stat.approxQuantile("value", Array(0.91), 0.0)
res11: Array[Double] = Array(0.5966354540849995)

scala> df.stat.approxQuantile("value", Array(0.92), 0.0)
res12: Array[Double] = Array(0.6015676591185595)

scala> df.stat.approxQuantile("value", Array(0.93), 0.0)
res13: Array[Double] = Array(0.6029240823799614)

scala> df.stat.approxQuantile("value", Array(0.94), 0.0)
res14: Array[Double] = Array(0.6117645471000034)

scala> df.stat.approxQuantile("value", Array(0.95), 0.0)
res15: Array[Double] = Array(0.6185162204274052)

scala> df.stat.approxQuantile("value", Array(0.96), 0.0)
res16: Array[Double] = Array(0.625983000807062)

scala> df.stat.approxQuantile("value", Array(0.97), 0.0)
res17: Array[Double] = Array(0.6306892943412258)

scala> df.stat.approxQuantile("value", Array(0.98), 0.0)
res18: Array[Double] = Array(0.6365567375994333)

scala> df.stat.approxQuantile("value", Array(0.99), 0.0)
res19: Array[Double] = Array(0.6554479197566019)
{code}

  was:
The [approxQuantile() 
method|https://github.com/apache/spark/blob/3b1674cb1f244598463e879477d89632b0817578/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L40]
 returns sometimes incorrect results that are sensitively dependent upon the 
choice of the relativeError.

Below is an example in the latest Spark version (2.4.4). You can see the result 
varies significantly for modest changes in the specified relativeError 
parameter. The result varies much more than would be expected based upon the 
relativeError parameter.

 
{code:java}
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
      /_/
         
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_212)
Type in expressions to have them evaluated.
Type :help for more information.


scala> val df = spark.read.format("csv").option("header", 
"true").option("inferSchema", 
"true").load("./20191001_example_data_approx_quantile_bug")
df: org.apache.spark.sql.DataFrame = [value: double]


scala> df.stat.approxQuantile("value", Array(0.9), 0)
res0: Array[Double] = Array(0.5929591082174609)


scala> df.stat.approxQuantile("value", Array(0.9), 0.001)
res1: Array[Double] = Array(0.67621027121925)


scala> df.stat.approxQuantile("value", Array(0.9), 0.002)
res2: Array[Double] = Array(0.5926195654486178)


scala> df.stat.approxQuantile("value", Array(0.9), 0.003)
res3: Array[Double] = Array(0.5924693999048418)


scala> df.stat.approxQuantile("value", Array(0.9), 0.004)
res4: Array[Double] = Array(0.67621027121925)


scala> df.stat.approxQuantile("value", Array(0.9), 0.005)
res5: Array[Double] = Array(0.5923925937051544) 
{code}
I attached a zip file containing the data used for the above example 
demonstrating the bug.

Also, the following demonstrates that there is data for intermediate quantile 
values between the 0.5926195654486178 and 0.67621027121925 values observed 
above.
{code:java}
scala> df.stat.approxQuantile("value", Array(0.9), 0.0) res10: Array[Double] = 
Array(0.5929591082174609)
scala> df.stat.approxQuantile("value", Array(0.91), 0.0) res11: Array[Double] = 
Array(0.5966354540849995)
scala> df.stat.approxQuantile("value", Array(0.92), 0.0) res12: Array[Double] = 
Array(0.6015676591185595)
scala> df.stat.approxQuantile("value", Array(0.93), 0.0) res13: Array[Double] = 
Array(0.6029240823799614)
scala> df.stat.approxQuantile("value", Array(0.94), 0.0) res14: Array[Double] = 
Array(0.6117645471000034)
scala> df.stat.approxQuantile("value", Array(0.95), 0.0) res15: Array[Double] = 
Array(0.6185162204274052)
scala> df.stat.approxQuantile("value", Array(0.96), 0.0) res16: Array[Double] = 
Array(0.625983000807062)
scala> df.stat.approxQuantile("value", Array(0.97), 0.0) res17: Array[Double] = 
Array(0.6306892943412258)
scala> df.stat.approxQuantile("value", Array(0.98), 0.0) res18: Array[Double] = 
Array(0.6365567375994333)
scala> df.stat.approxQuantile("value", Array(0.99), 0.0) res19: Array[Double] = 
Array(0.6554479197566019)
{code}


> approxQuantile() results are incorrect and vary significantly for small 
> changes in relativeError
> ------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-29325
>                 URL: https://issues.apache.org/jira/browse/SPARK-29325
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.2, 2.4.4
>         Environment: I was using OSX 10.14.6.
> I was using Scala 2.11.12 and Spark 2.4.4.
> I also verified the bug exists for Scala 2.11.8 and Spark 2.3.2.
>            Reporter: James Verbus
>            Priority: Major
>              Labels: correctness
>         Attachments: 20191001_example_data_approx_quantile_bug.zip
>
>
> The [approxQuantile() 
> method|https://github.com/apache/spark/blob/3b1674cb1f244598463e879477d89632b0817578/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L40]
>  returns sometimes incorrect results that are sensitively dependent upon the 
> choice of the relativeError.
> Below is an example in the latest Spark version (2.4.4). You can see the 
> result varies significantly for modest changes in the specified relativeError 
> parameter. The result varies much more than would be expected based upon the 
> relativeError parameter.
>  
> {code:java}
> Welcome to
>       ____              __
>      / __/__  ___ _____/ /__
>     _\ \/ _ \/ _ `/ __/  '_/
>    /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
>       /_/
>          
> Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 
> 1.8.0_212)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> val df = spark.read.format("csv").option("header", 
> "true").option("inferSchema", 
> "true").load("./20191001_example_data_approx_quantile_bug")
> df: org.apache.spark.sql.DataFrame = [value: double]
> scala> df.stat.approxQuantile("value", Array(0.9), 0)
> res0: Array[Double] = Array(0.5929591082174609)
> scala> df.stat.approxQuantile("value", Array(0.9), 0.001)
> res1: Array[Double] = Array(0.67621027121925)
> scala> df.stat.approxQuantile("value", Array(0.9), 0.002)
> res2: Array[Double] = Array(0.5926195654486178)
> scala> df.stat.approxQuantile("value", Array(0.9), 0.003)
> res3: Array[Double] = Array(0.5924693999048418)
> scala> df.stat.approxQuantile("value", Array(0.9), 0.004)
> res4: Array[Double] = Array(0.67621027121925)
> scala> df.stat.approxQuantile("value", Array(0.9), 0.005)
> res5: Array[Double] = Array(0.5923925937051544) 
> {code}
> I attached a zip file containing the data used for the above example 
> demonstrating the bug.
> Also, the following demonstrates that there is data for intermediate quantile 
> values between the 0.5926195654486178 and 0.67621027121925 values observed 
> above.
> {code:java}
> scala> df.stat.approxQuantile("value", Array(0.9), 0.0)
> res10: Array[Double] = Array(0.5929591082174609)
> scala> df.stat.approxQuantile("value", Array(0.91), 0.0)
> res11: Array[Double] = Array(0.5966354540849995)
> scala> df.stat.approxQuantile("value", Array(0.92), 0.0)
> res12: Array[Double] = Array(0.6015676591185595)
> scala> df.stat.approxQuantile("value", Array(0.93), 0.0)
> res13: Array[Double] = Array(0.6029240823799614)
> scala> df.stat.approxQuantile("value", Array(0.94), 0.0)
> res14: Array[Double] = Array(0.6117645471000034)
> scala> df.stat.approxQuantile("value", Array(0.95), 0.0)
> res15: Array[Double] = Array(0.6185162204274052)
> scala> df.stat.approxQuantile("value", Array(0.96), 0.0)
> res16: Array[Double] = Array(0.625983000807062)
> scala> df.stat.approxQuantile("value", Array(0.97), 0.0)
> res17: Array[Double] = Array(0.6306892943412258)
> scala> df.stat.approxQuantile("value", Array(0.98), 0.0)
> res18: Array[Double] = Array(0.6365567375994333)
> scala> df.stat.approxQuantile("value", Array(0.99), 0.0)
> res19: Array[Double] = Array(0.6554479197566019)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to