Re: Please assist: migrating RandomForestExample from MLLib to ML
many thanks Sean! kr marco On Wed, Sep 14, 2016 at 10:33 PM, Sean Owenwrote: > If it helps, I've already updated that code for the 2nd edition, which > will be based on ~Spark 2.1: > > https://github.com/sryza/aas/blob/master/ch04-rdf/src/main/ > scala/com/cloudera/datascience/rdf/RunRDF.scala#L220 > > This should be an equivalent working example that deals with > categoricals via VectorIndexer. > > You're right that you must use it because it adds the metadata that > says it's categorical. I'm not sure of another way to do it? > > Sean > > > On Wed, Sep 14, 2016 at 10:18 PM, Marco Mistroni > wrote: > > hi all > > i have been toying around with this well known RandomForestExample code > > > > val forest = RandomForest.trainClassifier( > > trainData, 7, Map(10 -> 4, 11 -> 40), 20, > > "auto", "entropy", 30, 300) > > > > This comes from this link > > (https://www.safaribooksonline.com/library/view/advanced-analytics-with/ > 9781491912751/ch04.html), > > and also Sean Owen's presentation > > > > (https://www.youtube.com/watch?v=ObiCMJ24ezs) > > > > > > > > and now i want to migrate it to use ML Libraries. > > The problem i have is that the MLLib example has categorical features, > and > > i cannot find > > a way to use categorical features with ML > > Apparently i should use VectorIndexer, but VectorIndexer assumes only one > > input > > column for features. > > I am at the moment using Vectorassembler instead, but i cannot find a > way to > > achieve the > > same > > I have checed spark samples, but all i can see is RandomForestClassifier > > using VectorIndexer for 1 feature > > > > > > > > Could anyone assist? > > This is my current codewhat do i need to add to take into account > > categorical features? > > > > val labelIndexer = new StringIndexer() > > .setInputCol("Col0") > > .setOutputCol("indexedLabel") > > .fit(data) > > > > val features = new VectorAssembler() > > .setInputCols(Array( > > "Col1", "Col2", "Col3", "Col4", "Col5", > > "Col6", "Col7", "Col8", "Col9", "Col10")) > > .setOutputCol("features") > > > > val labelConverter = new IndexToString() > > .setInputCol("prediction") > > .setOutputCol("predictedLabel") > > .setLabels(labelIndexer.labels) > > > > val rf = new RandomForestClassifier() > > .setLabelCol("indexedLabel") > > .setFeaturesCol("features") > > .setNumTrees(20) > > .setMaxDepth(30) > > .setMaxBins(300) > > .setImpurity("entropy") > > > > println("Kicking off pipeline..") > > > > val pipeline = new Pipeline() > > .setStages(Array(labelIndexer, features, rf, labelConverter)) > > > > thanks in advance and regards > > Marco > > >
Re: Please assist: migrating RandomForestExample from MLLib to ML
If it helps, I've already updated that code for the 2nd edition, which will be based on ~Spark 2.1: https://github.com/sryza/aas/blob/master/ch04-rdf/src/main/scala/com/cloudera/datascience/rdf/RunRDF.scala#L220 This should be an equivalent working example that deals with categoricals via VectorIndexer. You're right that you must use it because it adds the metadata that says it's categorical. I'm not sure of another way to do it? Sean On Wed, Sep 14, 2016 at 10:18 PM, Marco Mistroniwrote: > hi all > i have been toying around with this well known RandomForestExample code > > val forest = RandomForest.trainClassifier( > trainData, 7, Map(10 -> 4, 11 -> 40), 20, > "auto", "entropy", 30, 300) > > This comes from this link > (https://www.safaribooksonline.com/library/view/advanced-analytics-with/9781491912751/ch04.html), > and also Sean Owen's presentation > > (https://www.youtube.com/watch?v=ObiCMJ24ezs) > > > > and now i want to migrate it to use ML Libraries. > The problem i have is that the MLLib example has categorical features, and > i cannot find > a way to use categorical features with ML > Apparently i should use VectorIndexer, but VectorIndexer assumes only one > input > column for features. > I am at the moment using Vectorassembler instead, but i cannot find a way to > achieve the > same > I have checed spark samples, but all i can see is RandomForestClassifier > using VectorIndexer for 1 feature > > > > Could anyone assist? > This is my current codewhat do i need to add to take into account > categorical features? > > val labelIndexer = new StringIndexer() > .setInputCol("Col0") > .setOutputCol("indexedLabel") > .fit(data) > > val features = new VectorAssembler() > .setInputCols(Array( > "Col1", "Col2", "Col3", "Col4", "Col5", > "Col6", "Col7", "Col8", "Col9", "Col10")) > .setOutputCol("features") > > val labelConverter = new IndexToString() > .setInputCol("prediction") > .setOutputCol("predictedLabel") > .setLabels(labelIndexer.labels) > > val rf = new RandomForestClassifier() > .setLabelCol("indexedLabel") > .setFeaturesCol("features") > .setNumTrees(20) > .setMaxDepth(30) > .setMaxBins(300) > .setImpurity("entropy") > > println("Kicking off pipeline..") > > val pipeline = new Pipeline() > .setStages(Array(labelIndexer, features, rf, labelConverter)) > > thanks in advance and regards > Marco > - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Please assist: migrating RandomForestExample from MLLib to ML
hi all i have been toying around with this well known RandomForestExample code val forest = RandomForest.trainClassifier( trainData, 7, Map(10 -> 4, 11 -> 40), 20, "auto", "entropy", 30, 300) This comes from this link ( https://www.safaribooksonline.com/library/view/advanced-analytics-with/9781491912751/ch04.html), and also Sean Owen's presentation (https://www.youtube.com/watch?v=ObiCMJ24ezs) and now i want to migrate it to use ML Libraries. The problem i have is that the MLLib example has categorical features, and i cannot find a way to use categorical features with ML Apparently i should use VectorIndexer, but VectorIndexer assumes only one input column for features. I am at the moment using Vectorassembler instead, but i cannot find a way to achieve the same I have checed spark samples, but all i can see is RandomForestClassifier using VectorIndexer for 1 feature Could anyone assist? This is my current codewhat do i need to add to take into account categorical features? val labelIndexer = new StringIndexer() .setInputCol("Col0") .setOutputCol("indexedLabel") .fit(data) val features = new VectorAssembler() .setInputCols(Array( "Col1", "Col2", "Col3", "Col4", "Col5", "Col6", "Col7", "Col8", "Col9", "Col10")) .setOutputCol("features") val labelConverter = new IndexToString() .setInputCol("prediction") .setOutputCol("predictedLabel") .setLabels(labelIndexer.labels) val rf = new RandomForestClassifier() .setLabelCol("indexedLabel") .setFeaturesCol("features") .setNumTrees(20) .setMaxDepth(30) .setMaxBins(300) .setImpurity("entropy") println("Kicking off pipeline..") val pipeline = new Pipeline() .setStages(Array(labelIndexer, features, rf, labelConverter)) thanks in advance and regards Marco