[ https://issues.apache.org/jira/browse/SPARK-39664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562438#comment-17562438 ]
igal l edited comment on SPARK-39664 at 7/5/22 7:52 AM: -------------------------------------------------------- Sure [~hyukjin.kwon] {code:java} import numpy as np import random from pyspark.ml.linalg import DenseVector from pyspark.mllib.linalg.distributed import RowMatrix, IndexedRowMatrix from pyspark.ml.feature import VectorAssembler from pyspark.ml.stat import Correlation from pyspark.ml.clustering import GaussianMixture{code} {code:java} # Create RDD and DataFrame def generate_10_embeddings(t): for c in range(0,10): yield (DenseVector([random.uniform(-1, 1) for i in range(768)]),) rdd = spark.sparkContext.parallelize([(1,)]) #Increase the size of this loop to create more data. #The number of rows will be 10 ^ n for _ in range(0, 6): rdd = rdd.flatMap(generate_10_embeddings) rdd = rdd.repartition(int(spark.conf.get('spark.sql.shuffle.partitions'))) print(rdd.count()) df = spark.createDataFrame(rdd, ["features"]) df.cache() {code} {code:java} cov = RowMatrix(rdd).computeCovariance() cov {code} {code:java} cor = Correlation.corr(df, 'features') cor {code} was (Author: JIRAUSER292168): Sure [~hyukjin.kwon] {code:java} import numpy as np import random from pyspark.ml.linalg import DenseVector from pyspark.mllib.linalg.distributed import RowMatrix, IndexedRowMatrix from pyspark.ml.feature import VectorAssembler from pyspark.ml.stat import Correlation from pyspark.ml.clustering import GaussianMixture{code} {code:java} # Create RDD and DataFrame def generate_10_embeddings(t): for c in range(0,10): yield (DenseVector([random.uniform(-1, 1) for i in range(768)]),) rdd = spark.sparkContext.parallelize([(1,)]) #Increase the size of this loop to create more data. #The number of rows will be 10 ^ n for _ in range(0, 6): rdd = rdd.flatMap(generate_10_embeddings) rdd = rdd.repartition(int(spark.conf.get('spark.sql.shuffle.partitions'))) print(rdd.count()) df = spark.createDataFrame(rdd, ["features"]) df.cache() {code} {code:java} cov = RowMatrix(rdd).computeCovariance() cov {code} {code:java} cor = Correlation.corr(df, 'features') cor {code} > RowMatrix(...).computeCovariance() VS Correlation.corr(..., ...) > ---------------------------------------------------------------- > > Key: SPARK-39664 > URL: https://issues.apache.org/jira/browse/SPARK-39664 > Project: Spark > Issue Type: Bug > Components: Pandas API on Spark, PySpark > Affects Versions: 3.2.1 > Reporter: igal l > Priority: Major > > I have a Pyspark DF with one column. This column type is Vector and the > values are DenseVectors of size 768. The DF has 1 million rows. > I want to calculate the Covariance matrix of this set of vectors. > When I try to calculate it with > `RowMatrix(df.rdd.map(list)).computeCovariance()`, it takes 1.57 minuts. > When I try to calculate the Correlation matrix with `Correlation.corr(df, > '_1')`, it takes 33 seconds. > Covariance and Correlation's formula are pretty much the same, therefore, I > don't understand the gap between them -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org