[ https://issues.apache.org/jira/browse/SPARK-39664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon updated SPARK-39664: --------------------------------- Component/s: ML (was: Pandas API on Spark) > RowMatrix(...).computeCovariance() VS Correlation.corr(..., ...) > ---------------------------------------------------------------- > > Key: SPARK-39664 > URL: https://issues.apache.org/jira/browse/SPARK-39664 > Project: Spark > Issue Type: Bug > Components: ML, PySpark > Affects Versions: 3.2.1 > Reporter: igal l > Priority: Major > > I have a Pyspark DF with one column. This column type is Vector and the > values are DenseVectors of size 768. The DF has 1 million rows. > I want to calculate the Covariance matrix of this set of vectors. > When I try to calculate it with > `RowMatrix(df.rdd.map(list)).computeCovariance()`, it takes 1.57 minuts. > When I try to calculate the Correlation matrix with `Correlation.corr(df, > '_1')`, it takes 33 seconds. > Covariance and Correlation's formula are pretty much the same, therefore, I > don't understand the gap between them -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org