[ 
https://issues.apache.org/jira/browse/SPARK-42905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dronzer updated SPARK-42905:
----------------------------
    Description: 
pyspark.ml.stat.Correlation

Following is the Scenario where the Correlation function fails for giving 
correct Spearman Coefficient Results.

Tested E.g -> Spark DataFrame has 2 columns A and B.

!image-2023-03-23-10-55-26-879.png|width=562,height=162!

Column A has 3 Distinct Values and total of 108Million rows

Column B has 4 Distinct Values and total of 108Million rows

If I Calculate the correlation for this DataFrame in Python Pandas DF.corr, it 
gives the correct answer even if i run the same code multiple times the same 
answer is produced. (Each column has only 3-4 distinct values)

!image-2023-03-23-10-53-37-461.png|width=468,height=287!

 

Coming to Spark and using Spearman Correlation produces a *different results* 
for the *same dataframe* on multiple runs. (see below) (each column in this df 
has only 3-4 distinct values)

!image-2023-03-23-10-52-49-392.png|width=516,height=322!

 

Basically in python Pandas Df.corr it gives same results on same dataframe on 
multiple runs which is expected behaviour. However, in Spark using the same 
data it gives different result, moreover running the same cell with same data 
multiple times produces different results meaning the output is inconsistent.

Coming to data the only observation I could conclude is Ties in data. (Only 3-4 
Distinct values over 108M Rows.) This scenario is not handled in Spark 
Correlation method as the same data when used in python using df.corr produces 
consistent results.

The only Workaround we could find to get consistent and the same output as from 
python in Spark is by using Pandas UDF as shown below:

!image-2023-03-23-10-52-11-481.png|width=518,height=111!

!image-2023-03-23-10-51-28-420.png|width=509,height=270!

 

We also tried pyspark.pandas.DataFrame .corr method and it produces incorrect 
and inconsistent results for this case too.

Only PandasUDF seems to provide consistent results.

 

  was:
pyspark.ml.stat.Correlation

Following is the Scenario where the Correlation function fails for giving 
correct Spearman Coefficient Results.

Tested E.g -> Spark DataFrame has 2 columns A and B.

Column A has 3 Distinct Values and total of 108Million rows

Column B has 4 Distinct Values and total of 108Million rows

If I Calculate the correlation for this DataFrame in Python Pandas DF.corr, it 
gives the correct answer even if i run the same code multiple times the same 
answer is produced.

!image-2023-03-23-10-38-49-071.png|width=526,height=258!

 

Coming to Spark and using Spearman Correlation produces a *different results* 
for the *same dataframe* on multiple runs. (see below)

!image-2023-03-23-10-41-38-696.png|width=527,height=329!

 

Basically in python Pandas Df.corr it gives same results on same dataframe on 
multiple runs which is expected behaviour. However, in Spark using the same 
data it gives different result, moreover running the same cell with same data 
multiple times produces different results meaning the output is inconsistent.

Coming to data the only observation I could conclude is Ties in data. (Only 3-4 
Distinct values over 108M Rows.) This scenario is not handled in Spark 
Correlation method as the same data when used in python using df.corr produces 
consistent results.

The only Workaround we could find to get consistent and the same output as from 
python in Spark is by using Pandas UDF as shown below:

!image-2023-03-23-10-48-01-045.png|width=554,height=94!

!image-2023-03-23-10-48-55-922.png|width=568,height=301!

 

We also tried pyspark.pandas.DataFrame .corr method and it produces incorrect 
and inconsistent results for this case too.

Only PandasUDF seems to provide consistent results.

 


> pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect 
> and inconsistent results for the same DataFrame if it has huge amount of Ties.
> ---------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-42905
>                 URL: https://issues.apache.org/jira/browse/SPARK-42905
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 3.3.0
>            Reporter: dronzer
>            Priority: Blocker
>         Attachments: image-2023-03-23-10-51-28-420.png, 
> image-2023-03-23-10-52-11-481.png, image-2023-03-23-10-52-49-392.png, 
> image-2023-03-23-10-53-37-461.png, image-2023-03-23-10-55-26-879.png
>
>
> pyspark.ml.stat.Correlation
> Following is the Scenario where the Correlation function fails for giving 
> correct Spearman Coefficient Results.
> Tested E.g -> Spark DataFrame has 2 columns A and B.
> !image-2023-03-23-10-55-26-879.png|width=562,height=162!
> Column A has 3 Distinct Values and total of 108Million rows
> Column B has 4 Distinct Values and total of 108Million rows
> If I Calculate the correlation for this DataFrame in Python Pandas DF.corr, 
> it gives the correct answer even if i run the same code multiple times the 
> same answer is produced. (Each column has only 3-4 distinct values)
> !image-2023-03-23-10-53-37-461.png|width=468,height=287!
>  
> Coming to Spark and using Spearman Correlation produces a *different results* 
> for the *same dataframe* on multiple runs. (see below) (each column in this 
> df has only 3-4 distinct values)
> !image-2023-03-23-10-52-49-392.png|width=516,height=322!
>  
> Basically in python Pandas Df.corr it gives same results on same dataframe on 
> multiple runs which is expected behaviour. However, in Spark using the same 
> data it gives different result, moreover running the same cell with same data 
> multiple times produces different results meaning the output is inconsistent.
> Coming to data the only observation I could conclude is Ties in data. (Only 
> 3-4 Distinct values over 108M Rows.) This scenario is not handled in Spark 
> Correlation method as the same data when used in python using df.corr 
> produces consistent results.
> The only Workaround we could find to get consistent and the same output as 
> from python in Spark is by using Pandas UDF as shown below:
> !image-2023-03-23-10-52-11-481.png|width=518,height=111!
> !image-2023-03-23-10-51-28-420.png|width=509,height=270!
>  
> We also tried pyspark.pandas.DataFrame .corr method and it produces incorrect 
> and inconsistent results for this case too.
> Only PandasUDF seems to provide consistent results.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to