[ 
https://issues.apache.org/jira/browse/SPARK-42905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dronzer updated SPARK-42905:
----------------------------
    Attachment: image-2023-03-23-10-52-49-392.png

> pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect 
> and inconsistent results for the same DataFrame if it has huge amount of Ties.
> ---------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-42905
>                 URL: https://issues.apache.org/jira/browse/SPARK-42905
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 3.3.0
>            Reporter: dronzer
>            Priority: Blocker
>         Attachments: image-2023-03-23-10-51-28-420.png, 
> image-2023-03-23-10-52-11-481.png, image-2023-03-23-10-52-49-392.png
>
>
> pyspark.ml.stat.Correlation
> Following is the Scenario where the Correlation function fails for giving 
> correct Spearman Coefficient Results.
> Tested E.g -> Spark DataFrame has 2 columns A and B.
> Column A has 3 Distinct Values and total of 108Million rows
> Column B has 4 Distinct Values and total of 108Million rows
> If I Calculate the correlation for this DataFrame in Python Pandas DF.corr, 
> it gives the correct answer even if i run the same code multiple times the 
> same answer is produced.
> !image-2023-03-23-10-38-49-071.png|width=526,height=258!
>  
> Coming to Spark and using Spearman Correlation produces a *different results* 
> for the *same dataframe* on multiple runs. (see below)
> !image-2023-03-23-10-41-38-696.png|width=527,height=329!
>  
> Basically in python Pandas Df.corr it gives same results on same dataframe on 
> multiple runs which is expected behaviour. However, in Spark using the same 
> data it gives different result, moreover running the same cell with same data 
> multiple times produces different results meaning the output is inconsistent.
> Coming to data the only observation I could conclude is Ties in data. (Only 
> 3-4 Distinct values over 108M Rows.) This scenario is not handled in Spark 
> Correlation method as the same data when used in python using df.corr 
> produces consistent results.
> The only Workaround we could find to get consistent and the same output as 
> from python in Spark is by using Pandas UDF as shown below:
> !image-2023-03-23-10-48-01-045.png|width=554,height=94!
> !image-2023-03-23-10-48-55-922.png|width=568,height=301!
>  
> We also tried pyspark.pandas.DataFrame .corr method and it produces incorrect 
> and inconsistent results for this case too.
> Only PandasUDF seems to provide consistent results.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to