[jira] [Updated] (SPARK-42905) pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect and inconsistent results for the same DataFrame if it has huge amount of Ties.

dronzer (Jira) Wed, 22 Mar 2023 22:30:12 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-42905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


dronzer updated SPARK-42905:
----------------------------
    Labels: correctness  (was: )

> pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect 
> and inconsistent results for the same DataFrame if it has huge amount of Ties.
> ---------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-42905
>                 URL: https://issues.apache.org/jira/browse/SPARK-42905
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 3.3.0
>            Reporter: dronzer
>            Priority: Blocker
>              Labels: correctness
>         Attachments: image-2023-03-23-10-51-28-420.png, 
> image-2023-03-23-10-52-11-481.png, image-2023-03-23-10-52-49-392.png, 
> image-2023-03-23-10-53-37-461.png, image-2023-03-23-10-55-26-879.png
>
>
> pyspark.ml.stat.Correlation
> Following is the Scenario where the Correlation function fails for giving 
> correct Spearman Coefficient Results.
> Tested E.g -> Spark DataFrame has 2 columns A and B.
> !image-2023-03-23-10-55-26-879.png|width=562,height=162!
> Column A has 3 Distinct Values and total of 108Million rows
> Column B has 4 Distinct Values and total of 108Million rows
> If I Calculate the correlation for this DataFrame in Python Pandas DF.corr, 
> it gives the correct answer even if i run the same code multiple times the 
> same answer is produced. (Each column has only 3-4 distinct values)
> !image-2023-03-23-10-53-37-461.png|width=468,height=287!
>  
> Coming to Spark and using Spearman Correlation produces a *different results* 
> for the *same dataframe* on multiple runs. (see below) (each column in this 
> df has only 3-4 distinct values)
> !image-2023-03-23-10-52-49-392.png|width=516,height=322!
>  
> Basically in python Pandas Df.corr it gives same results on same dataframe on 
> multiple runs which is expected behaviour. However, in Spark using the same 
> data it gives different result, moreover running the same cell with same data 
> multiple times produces different results meaning the output is inconsistent.
> Coming to data the only observation I could conclude is Ties in data. (Only 
> 3-4 Distinct values over 108M Rows.) This scenario is not handled in Spark 
> Correlation method as the same data when used in python using df.corr 
> produces consistent results.
> The only Workaround we could find to get consistent and the same output as 
> from python in Spark is by using Pandas UDF as shown below:
> !image-2023-03-23-10-52-11-481.png|width=518,height=111!
> !image-2023-03-23-10-51-28-420.png|width=509,height=270!
>  
> We also tried pyspark.pandas.DataFrame .corr method and it produces incorrect 
> and inconsistent results for this case too.
> Only PandasUDF seems to provide consistent results.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42905) pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect and inconsistent results for the same DataFrame if it has huge amount of Ties.

Reply via email to