Help in Parsing 'Categorical' type of data
Hi, I am trying to run Naive Bayes Model using Spark ML libraries, in Java. The sample snippet of dataset is given below: Raw Data - But, as the input data needs to in numeric, so I am using one-hot-encoder on the Gender field[m->0,1][f->1,0]; and the finally the 'features' vector is inputted to Model, and I could get the Output. Transformed Data - But the model results are not correct as the 'Gender' field[Originally, Categorical] is now considered as a continuous field after one-hot encoding transformations. Expectation is that - for 'continuous data', mean and variance ; and for 'categorical data', the number of occurrences of different categories, is to be calculated. [In, my case, mean and variances are calculated even for the Gender Field]. So, is there any way by which I can indicate to the model that a particular data field is 'categorical' by nature? Thanks Best Regards Amlan Jyoti =-=-= Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you
Different Results When Performing PCA with Spark and R
Dear all, I was exploring an use case of PCA , and found out that the results of Spark ML and R are different. More clearly, 1) eigenMatrix_Spark EQUALS-TO eigenMatrix_R 2) transformedData_Spark NOT-EQUALS-TO transformedData_R Sample Spark Code -- PCAModel pca = new PCA().setInputCol("features").setOutputCol("pcaFeatures").setK(numberOfCol).fit(inputDataset); DenseMatrix eigenMatrix_Spark = pca.pc Dataset transformedData_Spark = pca.transform(inputDataset.select("features")); Sample R Code - pc <- prcomp(mydata) eigenMatrix_R<- pc$Rotation transformedData_R<- pc$x ** After further analysis, I found out that: - By Default, R initially performs mean-centering on the input dataset and then uses this modified dataset for calculating both Eigen Matrix and Transformed Data. [ Uses a parameter : 'center = TRUE'; for mean-centering] - Whereas, probably Spark is performing mean-centering on the input data to calculate only the Eigen Matrix; and using the original dataset to compute the Transformed Data. [Generally, Transformed data = Eigen Matrix * Dataset ] That is why, the result of- Eigen Matrix of Spark and R are same, whereas the Transformed dataset result is different for both the cases. So, can anyone please point out the reason for why spark is not considering mean-centered Input data for Transformed data calculation[But considers while calculating for Eigen Matrix], as opposed to R? [Initial, Mean centering on the Input Data is done for a good PCA analysis as pointed out by many technical papers as well as in R] With Best Regards Amlan Jyoti =-=-= Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you