[ https://issues.apache.org/jira/browse/SPARK-35423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17355293#comment-17355293 ]
Apache Spark commented on SPARK-35423: -------------------------------------- User 'shahidki31' has created a pull request for this issue: https://github.com/apache/spark/pull/32734 > The output of PCA is inconsistent > --------------------------------- > > Key: SPARK-35423 > URL: https://issues.apache.org/jira/browse/SPARK-35423 > Project: Spark > Issue Type: Bug > Components: MLlib > Affects Versions: 3.1.1 > Environment: Spark Version: 3.1.1 > Reporter: cqfrog > Priority: Major > > 1. The example from doc > > {code:java} > import org.apache.spark.ml.feature.PCA > import org.apache.spark.ml.linalg.Vectors > val data = Array( > Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))), > Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0), > Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0) > ) > val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features") > val pca = new PCA() > .setInputCol("features") > .setOutputCol("pcaFeatures") > .setK(3) > .fit(df) > val result = pca.transform(df).select("pcaFeatures") > result.show(false) > {code} > > > the output show: > {code:java} > +-----------------------------------------------------------+ > |pcaFeatures | > +-----------------------------------------------------------+ > |[1.6485728230883807,-4.013282700516296,-5.524543751369388] | > |[-4.645104331781534,-1.1167972663619026,-5.524543751369387]| > |[-6.428880535676489,-5.337951427775355,-5.524543751369389] | > +-----------------------------------------------------------+ > {code} > 2. change the Vector format > I modified the code from "Vectors.sparse(5, Seq((1, 1.0), (3, 7.0)))" to > "Vectors.dense(0.0,1.0,0.0,7.0,0.0)" 。 > but the output show: > {code:java} > +------------------------------------------------------------+ > |pcaFeatures | > +------------------------------------------------------------+ > |[1.6485728230883814,-4.0132827005162985,-1.0091435193998504]| > |[-4.645104331781533,-1.1167972663619048,-1.0091435193998501]| > |[-6.428880535676488,-5.337951427775359,-1.009143519399851] | > +------------------------------------------------------------+ > {code} > It's strange that the two outputs are inconsistent. Why? > Thanks. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org