[jira] [Commented] (SPARK-42905) pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect and inconsistent results for the same DataFrame if it has huge amount of Ties.

zhangzhenhao (Jira) Wed, 23 Aug 2023 00:32:09 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-42905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17757858#comment-17757858
 ]


zhangzhenhao commented on SPARK-42905:
--------------------------------------

minimal reproducible example, the result is incorrect and inconsistent when 
tied value size > 10_000_000

```scala
import org.apache.spark.ml.linalg.{Matrix, Vectors, Vector}import 
org.apache.spark.ml.stat.Correlationimport org.apache.spark.sql.Rowval N = 
10000002val x = sc.range(0, N).map(i => if (i < N - 1) 1.0 else 2.0)val y = 
sc.range(0, N).map(i => if (i < N - 1) 2.0 else 1.0)val df = x.zip(y)
  .map{case (x, y) => Vectors.dense(x, y)}
  .map(Tuple1.apply)
  .repartition(1) 
  .toDF("features")  val Row(coeff1: Matrix) = Correlation.corr(df, "features", 
"spearman").headval r = coeff1(0, 1)
println(s"pearson correlation in spark: $r")// pearson correlation in spark: 
-9.999990476024495E-8
```

correct result is -1.0

> pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect 
> and inconsistent results for the same DataFrame if it has huge amount of Ties.
> ---------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-42905
>                 URL: https://issues.apache.org/jira/browse/SPARK-42905
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 3.3.0
>            Reporter: dronzer
>            Priority: Critical
>              Labels: correctness
>         Attachments: image-2023-03-23-10-51-28-420.png, 
> image-2023-03-23-10-52-11-481.png, image-2023-03-23-10-52-49-392.png, 
> image-2023-03-23-10-53-37-461.png, image-2023-03-23-10-55-26-879.png
>
>
> pyspark.ml.stat.Correlation
> Following is the Scenario where the Correlation function fails for giving 
> correct Spearman Coefficient Results.
> Tested E.g -> Spark DataFrame has 2 columns A and B.
> !image-2023-03-23-10-55-26-879.png|width=562,height=162!
> Column A has 3 Distinct Values and total of 108Million rows
> Column B has 4 Distinct Values and total of 108Million rows
> If I Calculate the correlation for this DataFrame in Python Pandas DF.corr, 
> it gives the correct answer even if i run the same code multiple times the 
> same answer is produced. (Each column has only 3-4 distinct values)
> !image-2023-03-23-10-53-37-461.png|width=468,height=287!
>  
> Coming to Spark and using Spearman Correlation produces a *different results* 
> for the *same dataframe* on multiple runs. (see below) (each column in this 
> df has only 3-4 distinct values)
> !image-2023-03-23-10-52-49-392.png|width=516,height=322!
>  
> Basically in python Pandas Df.corr it gives same results on same dataframe on 
> multiple runs which is expected behaviour. However, in Spark using the same 
> data it gives different result, moreover running the same cell with same data 
> multiple times produces different results meaning the output is inconsistent.
> Coming to data the only observation I could conclude is Ties in data. (Only 
> 3-4 Distinct values over 108M Rows.) This scenario is not handled in Spark 
> Correlation method as the same data when used in python using df.corr 
> produces consistent results.
> The only Workaround we could find to get consistent and the same output as 
> from python in Spark is by using Pandas UDF as shown below:
> !image-2023-03-23-10-52-11-481.png|width=518,height=111!
> !image-2023-03-23-10-51-28-420.png|width=509,height=270!
>  
> We also tried pyspark.pandas.DataFrame .corr method and it produces incorrect 
> and inconsistent results for this case too.
> Only PandasUDF seems to provide consistent results.
>  
> Another point to note is : If i add some random noise to the data, which will 
> inturn increase the distinct values in the data. It again gives consistent 
> results for any runs. Which makes me believe that the Python version handles 
> ties correctly and gives consistent results no matter how many ties exist. 
> However, pyspark method is somehow not able to handle many ties in data.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42905) pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect and inconsistent results for the same DataFrame if it has huge amount of Ties.

Reply via email to