[jira] [Updated] (SPARK-45834) Fix Pearson correlation calculation more stable

Jiayi Liu (Jira) Tue, 07 Nov 2023 21:47:54 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-45834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jiayi Liu updated SPARK-45834:
------------------------------
    Description: 
Spark uses the formula {{ck / sqrt(xMk * yMk)}} to calculate the Pearson 
Correlation Coefficient. If {{xMk}} and {{yMk}} are very small, it can lead to 
double multiplication overflow, resulting in a denominator of 0. This leads to 
an Infinity result in the calculation.

For example, when calculating the correlation for the same columns a and b in a 
table, the result will be Infinity, but the correlation for identical columns 
should be 1.0 instead.
||a||b||
|1e-200|1e-200|
|1e-200|1e-200|
|1e-100|1e-100|

Modifying the formula to {{ck / sqrt(xMk) / sqrt(yMk)}} can indeed solve this 
issue and improve the stability of the calculation. The benefit of this 
modification is that it splits the square root of the denominator into two 
parts: {{sqrt(xMk)}} and {{{}sqrt(yMk){}}}. This helps avoid multiplication 
overflow or cases where the product of extremely small values becomes zero.
 
 

  was:
Spark uses the formula {{ck / sqrt(xMk * yMk)}} to calculate the Pearson 
Correlation Coefficient. If {{xMk}} and {{yMk}} are very small, it can lead to 
double multiplication overflow, resulting in a denominator of 0. This leads to 
a NaN result in the calculation.

For example, when calculating the correlation for the same columns a and b in a 
table, the result will be Infinity, but the correlation for identical columns 
should be 1.0 instead.
||a||b||
|1e-200|1e-200|
|1e-200|1e-200|
|1e-100|1e-100|

Modifying the formula to {{ck / sqrt(xMk) / sqrt(yMk)}} can indeed solve this 
issue and improve the stability of the calculation. The benefit of this 
modification is that it splits the square root of the denominator into two 
parts: {{sqrt(xMk)}} and {{{}sqrt(yMk){}}}. This helps avoid multiplication 
overflow or cases where the product of extremely small values becomes zero.
 
 


> Fix Pearson correlation calculation more stable
> -----------------------------------------------
>
>                 Key: SPARK-45834
>                 URL: https://issues.apache.org/jira/browse/SPARK-45834
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.5.0
>            Reporter: Jiayi Liu
>            Priority: Major
>
> Spark uses the formula {{ck / sqrt(xMk * yMk)}} to calculate the Pearson 
> Correlation Coefficient. If {{xMk}} and {{yMk}} are very small, it can lead 
> to double multiplication overflow, resulting in a denominator of 0. This 
> leads to an Infinity result in the calculation.
> For example, when calculating the correlation for the same columns a and b in 
> a table, the result will be Infinity, but the correlation for identical 
> columns should be 1.0 instead.
> ||a||b||
> |1e-200|1e-200|
> |1e-200|1e-200|
> |1e-100|1e-100|
> Modifying the formula to {{ck / sqrt(xMk) / sqrt(yMk)}} can indeed solve this 
> issue and improve the stability of the calculation. The benefit of this 
> modification is that it splits the square root of the denominator into two 
> parts: {{sqrt(xMk)}} and {{{}sqrt(yMk){}}}. This helps avoid multiplication 
> overflow or cases where the product of extremely small values becomes zero.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45834) Fix Pearson correlation calculation more stable

Reply via email to