[ https://issues.apache.org/jira/browse/SPARK-42905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17757858#comment-17757858 ]
zhangzhenhao commented on SPARK-42905: -------------------------------------- minimal reproducible example, the result is incorrect and inconsistent when tied value size > 10_000_000 ```scala import org.apache.spark.ml.linalg.{Matrix, Vectors, Vector}import org.apache.spark.ml.stat.Correlationimport org.apache.spark.sql.Rowval N = 10000002val x = sc.range(0, N).map(i => if (i < N - 1) 1.0 else 2.0)val y = sc.range(0, N).map(i => if (i < N - 1) 2.0 else 1.0)val df = x.zip(y) .map{case (x, y) => Vectors.dense(x, y)} .map(Tuple1.apply) .repartition(1) .toDF("features") val Row(coeff1: Matrix) = Correlation.corr(df, "features", "spearman").headval r = coeff1(0, 1) println(s"pearson correlation in spark: $r")// pearson correlation in spark: -9.999990476024495E-8 ``` correct result is -1.0 > pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect > and inconsistent results for the same DataFrame if it has huge amount of Ties. > --------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-42905 > URL: https://issues.apache.org/jira/browse/SPARK-42905 > Project: Spark > Issue Type: Bug > Components: ML > Affects Versions: 3.3.0 > Reporter: dronzer > Priority: Critical > Labels: correctness > Attachments: image-2023-03-23-10-51-28-420.png, > image-2023-03-23-10-52-11-481.png, image-2023-03-23-10-52-49-392.png, > image-2023-03-23-10-53-37-461.png, image-2023-03-23-10-55-26-879.png > > > pyspark.ml.stat.Correlation > Following is the Scenario where the Correlation function fails for giving > correct Spearman Coefficient Results. > Tested E.g -> Spark DataFrame has 2 columns A and B. > !image-2023-03-23-10-55-26-879.png|width=562,height=162! > Column A has 3 Distinct Values and total of 108Million rows > Column B has 4 Distinct Values and total of 108Million rows > If I Calculate the correlation for this DataFrame in Python Pandas DF.corr, > it gives the correct answer even if i run the same code multiple times the > same answer is produced. (Each column has only 3-4 distinct values) > !image-2023-03-23-10-53-37-461.png|width=468,height=287! > > Coming to Spark and using Spearman Correlation produces a *different results* > for the *same dataframe* on multiple runs. (see below) (each column in this > df has only 3-4 distinct values) > !image-2023-03-23-10-52-49-392.png|width=516,height=322! > > Basically in python Pandas Df.corr it gives same results on same dataframe on > multiple runs which is expected behaviour. However, in Spark using the same > data it gives different result, moreover running the same cell with same data > multiple times produces different results meaning the output is inconsistent. > Coming to data the only observation I could conclude is Ties in data. (Only > 3-4 Distinct values over 108M Rows.) This scenario is not handled in Spark > Correlation method as the same data when used in python using df.corr > produces consistent results. > The only Workaround we could find to get consistent and the same output as > from python in Spark is by using Pandas UDF as shown below: > !image-2023-03-23-10-52-11-481.png|width=518,height=111! > !image-2023-03-23-10-51-28-420.png|width=509,height=270! > > We also tried pyspark.pandas.DataFrame .corr method and it produces incorrect > and inconsistent results for this case too. > Only PandasUDF seems to provide consistent results. > > Another point to note is : If i add some random noise to the data, which will > inturn increase the distinct values in the data. It again gives consistent > results for any runs. Which makes me believe that the Python version handles > ties correctly and gives consistent results no matter how many ties exist. > However, pyspark method is somehow not able to handle many ties in data. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org