[ 
https://issues.apache.org/jira/browse/SPARK-39664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562438#comment-17562438
 ] 

igal l edited comment on SPARK-39664 at 7/5/22 7:52 AM:
--------------------------------------------------------

Sure [~hyukjin.kwon] 
{code:java}
import numpy as np
import random
from pyspark.ml.linalg import DenseVector
from pyspark.mllib.linalg.distributed import RowMatrix, IndexedRowMatrix
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.stat import Correlation
from pyspark.ml.clustering import GaussianMixture{code}
{code:java}
# Create RDD and DataFrame
def generate_10_embeddings(t):    
  for c in range(0,10):        
        yield (DenseVector([random.uniform(-1, 1) for i in range(768)]),)
rdd = spark.sparkContext.parallelize([(1,)])
#Increase the size of this loop to create more data.
#The number of rows will be 10 ^ n
for _ in range(0, 6):
    rdd = rdd.flatMap(generate_10_embeddings)
    rdd = rdd.repartition(int(spark.conf.get('spark.sql.shuffle.partitions')))
    print(rdd.count())
    
df = spark.createDataFrame(rdd, ["features"])
df.cache()
{code}
{code:java}
cov = RowMatrix(rdd).computeCovariance()
cov {code}
{code:java}
cor = Correlation.corr(df, 'features')
cor {code}
 

 

 

 


was (Author: JIRAUSER292168):
Sure [~hyukjin.kwon] 

 
{code:java}
import numpy as np
import random
from pyspark.ml.linalg import DenseVector
from pyspark.mllib.linalg.distributed import RowMatrix, IndexedRowMatrix
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.stat import Correlation
from pyspark.ml.clustering import GaussianMixture{code}
 

 
{code:java}
# Create RDD and DataFrame
def generate_10_embeddings(t):    
  for c in range(0,10):        
        yield (DenseVector([random.uniform(-1, 1) for i in range(768)]),)
rdd = spark.sparkContext.parallelize([(1,)])
#Increase the size of this loop to create more data.
#The number of rows will be 10 ^ n
for _ in range(0, 6):
    rdd = rdd.flatMap(generate_10_embeddings)
    rdd = rdd.repartition(int(spark.conf.get('spark.sql.shuffle.partitions')))
    print(rdd.count())
    
df = spark.createDataFrame(rdd, ["features"])
df.cache()
{code}
 
{code:java}
cov = RowMatrix(rdd).computeCovariance()
cov {code}
 

 
{code:java}
cor = Correlation.corr(df, 'features')
cor {code}
 

 

 

 

> RowMatrix(...).computeCovariance() VS Correlation.corr(..., ...)
> ----------------------------------------------------------------
>
>                 Key: SPARK-39664
>                 URL: https://issues.apache.org/jira/browse/SPARK-39664
>             Project: Spark
>          Issue Type: Bug
>          Components: Pandas API on Spark, PySpark
>    Affects Versions: 3.2.1
>            Reporter: igal l
>            Priority: Major
>
> I have a Pyspark DF with one column. This column type is Vector and the 
> values are DenseVectors of size 768.  The DF has 1 million rows.
> I want to calculate the Covariance matrix of this set of vectors.
> When I try to calculate it with 
> `RowMatrix(df.rdd.map(list)).computeCovariance()`, it takes 1.57 minuts.
> When I try to calculate the Correlation matrix with `Correlation.corr(df, 
> '_1')`, it takes 33 seconds.
> Covariance and Correlation's formula are pretty much the same, therefore, I 
> don't understand the gap between them



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to