[ https://issues.apache.org/jira/browse/SPARK-45834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jiayi Liu updated SPARK-45834: ------------------------------ Description: Spark uses the formula {{ck / sqrt(xMk * yMk)}} to calculate the Pearson Correlation Coefficient. If {{xMk}} and {{yMk}} are very small, it can lead to double multiplication overflow, resulting in a denominator of 0. This leads to an Infinity result in the calculation. For example, when calculating the correlation for the same columns a and b in a table, the result will be Infinity, but the correlation for identical columns should be 1.0 instead. ||a||b|| |1e-200|1e-200| |1e-200|1e-200| |1e-100|1e-100| Modifying the formula to {{ck / sqrt(xMk) / sqrt(yMk)}} can indeed solve this issue and improve the stability of the calculation. The benefit of this modification is that it splits the square root of the denominator into two parts: {{sqrt(xMk)}} and {{{}sqrt(yMk){}}}. This helps avoid multiplication overflow or cases where the product of extremely small values becomes zero. was: Spark uses the formula {{ck / sqrt(xMk * yMk)}} to calculate the Pearson Correlation Coefficient. If {{xMk}} and {{yMk}} are very small, it can lead to double multiplication overflow, resulting in a denominator of 0. This leads to a NaN result in the calculation. For example, when calculating the correlation for the same columns a and b in a table, the result will be Infinity, but the correlation for identical columns should be 1.0 instead. ||a||b|| |1e-200|1e-200| |1e-200|1e-200| |1e-100|1e-100| Modifying the formula to {{ck / sqrt(xMk) / sqrt(yMk)}} can indeed solve this issue and improve the stability of the calculation. The benefit of this modification is that it splits the square root of the denominator into two parts: {{sqrt(xMk)}} and {{{}sqrt(yMk){}}}. This helps avoid multiplication overflow or cases where the product of extremely small values becomes zero. > Fix Pearson correlation calculation more stable > ----------------------------------------------- > > Key: SPARK-45834 > URL: https://issues.apache.org/jira/browse/SPARK-45834 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.5.0 > Reporter: Jiayi Liu > Priority: Major > > Spark uses the formula {{ck / sqrt(xMk * yMk)}} to calculate the Pearson > Correlation Coefficient. If {{xMk}} and {{yMk}} are very small, it can lead > to double multiplication overflow, resulting in a denominator of 0. This > leads to an Infinity result in the calculation. > For example, when calculating the correlation for the same columns a and b in > a table, the result will be Infinity, but the correlation for identical > columns should be 1.0 instead. > ||a||b|| > |1e-200|1e-200| > |1e-200|1e-200| > |1e-100|1e-100| > Modifying the formula to {{ck / sqrt(xMk) / sqrt(yMk)}} can indeed solve this > issue and improve the stability of the calculation. The benefit of this > modification is that it splits the square root of the denominator into two > parts: {{sqrt(xMk)}} and {{{}sqrt(yMk){}}}. This helps avoid multiplication > overflow or cases where the product of extremely small values becomes zero. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org