[jira] [Comment Edited] (SPARK-42905) pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect and inconsistent results for the same DataFrame if it has huge amount of Ties.

2023-08-28 Thread zhangzhenhao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17757858#comment-17757858
 ] 

zhangzhenhao edited comment on SPARK-42905 at 8/28/23 11:04 AM:


minimal reproducible example. the result is incorrect and inconsistent when 
tied value size > 10_000_000

 
{code:java}
import org.apache.spark.ml.linalg.{Matrix, Vectors, Vector}
import org.apache.spark.ml.stat.Correlation
import org.apache.spark.sql.Row

val N = 1002
val x = sc.range(0, N).map(i => if (i < N - 1) 1.0 else 2.0)
val y = sc.range(0, N).map(i => if (i < N - 1) 2.0 else 1.0)
//val s1 = Statistics.corr(x, y, "spearman")
val df = x.zip(y)
  .map{case (x, y) => Vectors.dense(x, y)}
  .map(Tuple1.apply)
  .repartition(1) 
  .toDF("features")
  
val Row(coeff1: Matrix) = Correlation.corr(df, "features", "spearman").head
val r = coeff1(0, 1)
println(s"spearman correlation in spark: $r")
// spearman correlation in spark: -9.90476024495E-8 {code}
 

 

the correct result is -1.0


was (Author: JIRAUSER301717):
minimal reproducible example. the result is incorrect and inconsistent when 
tied value size > 10_000_000

 
{code:java}
import org.apache.spark.ml.linalg.{Matrix, Vectors, Vector}
import org.apache.spark.ml.stat.Correlation
import org.apache.spark.sql.Row

val N = 1002
val x = sc.range(0, N).map(i => if (i < N - 1) 1.0 else 2.0)
val y = sc.range(0, N).map(i => if (i < N - 1) 2.0 else 1.0)
//val s1 = Statistics.corr(x, y, "spearman")
val df = x.zip(y)
  .map{case (x, y) => Vectors.dense(x, y)}
  .map(Tuple1.apply)
  .repartition(1) 
  .toDF("features")
  
val Row(coeff1: Matrix) = Correlation.corr(df, "features", "spearman").head
val r = coeff1(0, 1)
println(s"pearson correlation in spark: $r")
// pearson correlation in spark: -9.90476024495E-8 {code}
 

 

the correct result is -1.0

> pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect 
> and inconsistent results for the same DataFrame if it has huge amount of Ties.
> -
>
> Key: SPARK-42905
> URL: https://issues.apache.org/jira/browse/SPARK-42905
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.3.0
>Reporter: dronzer
>Priority: Critical
>  Labels: correctness
> Attachments: image-2023-03-23-10-51-28-420.png, 
> image-2023-03-23-10-52-11-481.png, image-2023-03-23-10-52-49-392.png, 
> image-2023-03-23-10-53-37-461.png, image-2023-03-23-10-55-26-879.png
>
>
> pyspark.ml.stat.Correlation
> Following is the Scenario where the Correlation function fails for giving 
> correct Spearman Coefficient Results.
> Tested E.g -> Spark DataFrame has 2 columns A and B.
> !image-2023-03-23-10-55-26-879.png|width=562,height=162!
> Column A has 3 Distinct Values and total of 108Million rows
> Column B has 4 Distinct Values and total of 108Million rows
> If I Calculate the correlation for this DataFrame in Python Pandas DF.corr, 
> it gives the correct answer even if i run the same code multiple times the 
> same answer is produced. (Each column has only 3-4 distinct values)
> !image-2023-03-23-10-53-37-461.png|width=468,height=287!
>  
> Coming to Spark and using Spearman Correlation produces a *different results* 
> for the *same dataframe* on multiple runs. (see below) (each column in this 
> df has only 3-4 distinct values)
> !image-2023-03-23-10-52-49-392.png|width=516,height=322!
>  
> Basically in python Pandas Df.corr it gives same results on same dataframe on 
> multiple runs which is expected behaviour. However, in Spark using the same 
> data it gives different result, moreover running the same cell with same data 
> multiple times produces different results meaning the output is inconsistent.
> Coming to data the only observation I could conclude is Ties in data. (Only 
> 3-4 Distinct values over 108M Rows.) This scenario is not handled in Spark 
> Correlation method as the same data when used in python using df.corr 
> produces consistent results.
> The only Workaround we could find to get consistent and the same output as 
> from python in Spark is by using Pandas UDF as shown below:
> !image-2023-03-23-10-52-11-481.png|width=518,height=111!
> !image-2023-03-23-10-51-28-420.png|width=509,height=270!
>  
> We also tried pyspark.pandas.DataFrame .corr method and it produces incorrect 
> and inconsistent results for this case too.
> Only PandasUDF seems to provide consistent results.
>  
> Another point to note is : If i add some random noise to the data, which will 
> inturn increase the distinct values in the data. It again gives consistent 
> results for any runs. Which makes me believe that the Python version handles 
> ties co

[jira] [Comment Edited] (SPARK-42905) pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect and inconsistent results for the same DataFrame if it has huge amount of Ties.

2023-08-23 Thread zhangzhenhao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17757858#comment-17757858
 ] 

zhangzhenhao edited comment on SPARK-42905 at 8/23/23 7:35 AM:
---

minimal reproducible example. the result is incorrect and inconsistent when 
tied value size > 10_000_000

 
{code:java}
import org.apache.spark.ml.linalg.{Matrix, Vectors, Vector}
import org.apache.spark.ml.stat.Correlation
import org.apache.spark.sql.Row

val N = 1002
val x = sc.range(0, N).map(i => if (i < N - 1) 1.0 else 2.0)
val y = sc.range(0, N).map(i => if (i < N - 1) 2.0 else 1.0)
//val s1 = Statistics.corr(x, y, "spearman")
val df = x.zip(y)
  .map{case (x, y) => Vectors.dense(x, y)}
  .map(Tuple1.apply)
  .repartition(1) 
  .toDF("features")
  
val Row(coeff1: Matrix) = Correlation.corr(df, "features", "spearman").head
val r = coeff1(0, 1)
println(s"pearson correlation in spark: $r")
// pearson correlation in spark: -9.90476024495E-8 {code}
 

 

the correct result is -1.0


was (Author: JIRAUSER301717):
minimal reproducible example, the result is incorrect and inconsistent when 
tied value size > 10_000_000

```scala
import org.apache.spark.ml.linalg.{Matrix, Vectors, Vector}import 
org.apache.spark.ml.stat.Correlationimport org.apache.spark.sql.Rowval N = 
1002val x = sc.range(0, N).map(i => if (i < N - 1) 1.0 else 2.0)val y = 
sc.range(0, N).map(i => if (i < N - 1) 2.0 else 1.0)val df = x.zip(y)
  .map{case (x, y) => Vectors.dense(x, y)}
  .map(Tuple1.apply)
  .repartition(1) 
  .toDF("features")  val Row(coeff1: Matrix) = Correlation.corr(df, "features", 
"spearman").headval r = coeff1(0, 1)
println(s"pearson correlation in spark: $r")// pearson correlation in spark: 
-9.90476024495E-8
```

correct result is -1.0

> pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect 
> and inconsistent results for the same DataFrame if it has huge amount of Ties.
> -
>
> Key: SPARK-42905
> URL: https://issues.apache.org/jira/browse/SPARK-42905
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.3.0
>Reporter: dronzer
>Priority: Critical
>  Labels: correctness
> Attachments: image-2023-03-23-10-51-28-420.png, 
> image-2023-03-23-10-52-11-481.png, image-2023-03-23-10-52-49-392.png, 
> image-2023-03-23-10-53-37-461.png, image-2023-03-23-10-55-26-879.png
>
>
> pyspark.ml.stat.Correlation
> Following is the Scenario where the Correlation function fails for giving 
> correct Spearman Coefficient Results.
> Tested E.g -> Spark DataFrame has 2 columns A and B.
> !image-2023-03-23-10-55-26-879.png|width=562,height=162!
> Column A has 3 Distinct Values and total of 108Million rows
> Column B has 4 Distinct Values and total of 108Million rows
> If I Calculate the correlation for this DataFrame in Python Pandas DF.corr, 
> it gives the correct answer even if i run the same code multiple times the 
> same answer is produced. (Each column has only 3-4 distinct values)
> !image-2023-03-23-10-53-37-461.png|width=468,height=287!
>  
> Coming to Spark and using Spearman Correlation produces a *different results* 
> for the *same dataframe* on multiple runs. (see below) (each column in this 
> df has only 3-4 distinct values)
> !image-2023-03-23-10-52-49-392.png|width=516,height=322!
>  
> Basically in python Pandas Df.corr it gives same results on same dataframe on 
> multiple runs which is expected behaviour. However, in Spark using the same 
> data it gives different result, moreover running the same cell with same data 
> multiple times produces different results meaning the output is inconsistent.
> Coming to data the only observation I could conclude is Ties in data. (Only 
> 3-4 Distinct values over 108M Rows.) This scenario is not handled in Spark 
> Correlation method as the same data when used in python using df.corr 
> produces consistent results.
> The only Workaround we could find to get consistent and the same output as 
> from python in Spark is by using Pandas UDF as shown below:
> !image-2023-03-23-10-52-11-481.png|width=518,height=111!
> !image-2023-03-23-10-51-28-420.png|width=509,height=270!
>  
> We also tried pyspark.pandas.DataFrame .corr method and it produces incorrect 
> and inconsistent results for this case too.
> Only PandasUDF seems to provide consistent results.
>  
> Another point to note is : If i add some random noise to the data, which will 
> inturn increase the distinct values in the data. It again gives consistent 
> results for any runs. Which makes me believe that the Python version handles 
> ties correctly and gives consistent results no matter how many ties exist. 
> H