Dear all, I was exploring an use case of PCA , and found out that the results of Spark ML and R are different.
More clearly, 1) eigenMatrix_Spark EQUALS-TO eigenMatrix_R 2) transformedData_Spark NOT-EQUALS-TO transformedData_R Sample Spark Code ---------------------------------- PCAModel pca = new PCA().setInputCol("features").setOutputCol("pcaFeatures").setK(numberOfCol).fit(inputDataset); DenseMatrix eigenMatrix_Spark = pca.pc Dataset<Row> transformedData_Spark = pca.transform(inputDataset.select("features")); Sample R Code --------------------------------- pc <- prcomp(mydata) eigenMatrix_R<- pc$Rotation transformedData_R<- pc$x ********************************************************************************************************************************************************************************************** After further analysis, I found out that: - By Default, R initially performs mean-centering on the input dataset and then uses this modified dataset for calculating both Eigen Matrix and Transformed Data. [ Uses a parameter : 'center = TRUE'; for mean-centering] - Whereas, probably Spark is performing mean-centering on the input data to calculate only the Eigen Matrix; and using the original dataset to compute the Transformed Data. [Generally, Transformed data = Eigen Matrix * Dataset ] That is why, the result of- Eigen Matrix of Spark and R are same, whereas the Transformed dataset result is different for both the cases. So, can anyone please point out the reason for why spark is not considering mean-centered Input data for Transformed data calculation[But considers while calculating for Eigen Matrix], as opposed to R? [Initial, Mean centering on the Input Data is done for a good PCA analysis as pointed out by many technical papers as well as in R] With Best Regards Amlan Jyoti =====-----=====-----===== Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you