The categorical features must be encoded into indices starting from 0: 0, 1, ..., numCategories - 1. Then you can provide the categoricalFeatureInfo map to specify which columns contain categorical features and the number of categories in each. Joseph is updating the user guide. But if you want to try something now, you can take look at the docs of DecisionTree.trainClassifier and trainRegressor: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala#L360
-Xiangrui On Tue, Aug 19, 2014 at 4:24 PM, Sameer Tilak <ssti...@live.com> wrote: > Hi All, > > Is there any example of MLlib decision tree handling categorical variables? > My dataset includes few categorical variables (20 out of 100 features) so > was interested in knowing how I can use the current version of decision tree > implementation to handle this situation? I looked at the LabeledData and not > sure if that the way to go.. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org