From: ssti...@live.com To: men...@gmail.com Subject: RE: Decision tree: categorical variables Date: Wed, 20 Aug 2014 12:09:52 -0700
Hi Xiangrui, My data is in the following format: 0,1,5,A,8,1,M0,1,5,B,4,1,M1,0,2,B,7,0,U0,1,3,C,8,0,M0,0,5,C,1,0,M1,1,5,C,8,0,U0,0,5,B,8,0,M1,0,3,B,2,1,M0,1,5,B,8,0,F1,0,2,B,4,0,F0,1,5,A,8,0,F I can create a map like this: val catmap = Map(3-> 3, 6 -> 2) However, I am not sure what should I do when I parse the data. In the default case, I parse it like: val parsedData = data.map { line => val parts = line.split(',').map(_.toDouble) LabeledPoint(parts(0), Vectors.dense(parts.tail)) } Do In need to explicitly do something for columns 3 and 6 or just specifying map will suffice.... > Date: Tue, 19 Aug 2014 16:45:35 -0700 > Subject: Re: Decision tree: categorical variables > From: men...@gmail.com > To: ssti...@live.com > CC: user@spark.apache.org > > The categorical features must be encoded into indices starting from 0: > 0, 1, ..., numCategories - 1. Then you can provide the > categoricalFeatureInfo map to specify which columns contain > categorical features and the number of categories in each. Joseph is > updating the user guide. But if you want to try something now, you can > take look at the docs of DecisionTree.trainClassifier and > trainRegressor: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala#L360 > > -Xiangrui > > On Tue, Aug 19, 2014 at 4:24 PM, Sameer Tilak <ssti...@live.com> wrote: > > Hi All, > > > > Is there any example of MLlib decision tree handling categorical variables? > > My dataset includes few categorical variables (20 out of 100 features) so > > was interested in knowing how I can use the current version of decision tree > > implementation to handle this situation? I looked at the LabeledData and not > > sure if that the way to go.. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org >