Help in Parsing 'Categorical' type of data

2017-05-31 Thread Amlan Jyoti
Hi,

I am trying to run Naive Bayes Model using Spark ML libraries, in Java. 
The sample snippet of dataset is given below:

Raw Data -


But, as the input data needs to in numeric, so I am using one-hot-encoder 
on the Gender field[m->0,1][f->1,0]; and the finally the 'features' vector 
is inputted to Model, and I could get the Output.

Transformed Data - 


But the model results are not correct as the 'Gender' field[Originally, 
Categorical] is now considered as a continuous field after one-hot 
encoding transformations. 

Expectation is that - for 'continuous data', mean and variance ; and for 
'categorical data', the number of occurrences of different categories, is 
to be calculated. [In, my case, mean and variances are calculated even for 
the Gender Field].

So, is there any way by which I can indicate to the model that a 
particular data field is 'categorical' by nature?

Thanks

Best Regards
Amlan Jyoti


=-=-=
Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you




Different Results When Performing PCA with Spark and R

2017-02-14 Thread Amlan Jyoti
Dear all,

I was exploring an use case of PCA , and found out that the results of 
Spark ML and R are different. 

More clearly,
 1) eigenMatrix_Spark EQUALS-TO eigenMatrix_R
 2) transformedData_Spark NOT-EQUALS-TO transformedData_R
 
Sample Spark Code
--
PCAModel pca = new 
PCA().setInputCol("features").setOutputCol("pcaFeatures").setK(numberOfCol).fit(inputDataset);
DenseMatrix eigenMatrix_Spark = pca.pc
Dataset transformedData_Spark = 
pca.transform(inputDataset.select("features"));

Sample R Code
- 
pc <- prcomp(mydata)
eigenMatrix_R<- pc$Rotation
transformedData_R<- pc$x

**
 
 
After further analysis, I found out that:

- By Default, R initially performs mean-centering on the input 
dataset and then uses this modified dataset for calculating both Eigen 
Matrix and Transformed Data. [ Uses a parameter : 'center = TRUE'; for 
mean-centering]
 
- Whereas, probably Spark is performing mean-centering on the 
input data to calculate only the Eigen Matrix; and using the original 
dataset to compute the Transformed Data. [Generally, Transformed data = 
Eigen Matrix * Dataset ]
 
That is why, the result of- Eigen Matrix of Spark and R are same, whereas 
the Transformed dataset result is different for both the cases.

So, can anyone please point out the reason for why spark is not 
considering mean-centered Input data for Transformed data calculation[But 
considers while calculating for Eigen Matrix], as opposed to R?
 [Initial, Mean centering on the Input Data is done for a good PCA 
analysis as pointed out by many technical papers as well as in R]


With Best Regards
Amlan Jyoti
=-=-=
Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you