The categorical features must be encoded into indices starting from 0:
0, 1, ..., numCategories - 1. Then you can provide the
categoricalFeatureInfo map to specify which columns contain
categorical features and the number of categories in each. Joseph is
updating the user guide. But if you want to try something now, you can
take look at the docs of DecisionTree.trainClassifier and
trainRegressor:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala#L360

-Xiangrui

On Tue, Aug 19, 2014 at 4:24 PM, Sameer Tilak <ssti...@live.com> wrote:
> Hi All,
>
> Is there any example of MLlib decision tree handling categorical variables?
> My dataset includes few categorical variables (20 out of 100 features) so
> was interested in knowing how I can use the current version of decision tree
> implementation to handle this situation? I looked at the LabeledData and not
> sure if that the way to go..

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to