If it helps, I've already updated that code for the 2nd edition, which
will be based on ~Spark 2.1:

https://github.com/sryza/aas/blob/master/ch04-rdf/src/main/scala/com/cloudera/datascience/rdf/RunRDF.scala#L220

This should be an equivalent working example that deals with
categoricals via VectorIndexer.

You're right that you must use it because it adds the metadata that
says it's categorical. I'm not sure of another way to do it?

Sean


On Wed, Sep 14, 2016 at 10:18 PM, Marco Mistroni <mmistr...@gmail.com> wrote:
> hi all
>  i have been toying around with this well known RandomForestExample code
>
> val forest = RandomForest.trainClassifier(
>   trainData, 7, Map(10 -> 4, 11 -> 40), 20,
>   "auto", "entropy", 30, 300)
>
> This comes from this link
> (https://www.safaribooksonline.com/library/view/advanced-analytics-with/9781491912751/ch04.html),
> and also Sean Owen's presentation
>
> (https://www.youtube.com/watch?v=ObiCMJ24ezs)
>
>
>
> and now i want to migrate it to use ML Libraries.
> The problem i have is that the MLLib  example has categorical features, and
> i cannot find
> a way to use categorical features with ML
> Apparently i should use VectorIndexer, but VectorIndexer assumes only one
> input
> column for features.
> I am at the moment using Vectorassembler instead, but i cannot find a way to
> achieve the
> same
> I have checed spark samples, but all i can see is RandomForestClassifier
> using VectorIndexer for 1 feature
>
>
>
> Could anyone assist?
> This is my current code....what do i need to add to take into account
> categorical features?
>
> val labelIndexer = new StringIndexer()
>       .setInputCol("Col0")
>       .setOutputCol("indexedLabel")
>       .fit(data)
>
>     val features = new VectorAssembler()
>       .setInputCols(Array(
>         "Col1", "Col2", "Col3", "Col4", "Col5",
>         "Col6", "Col7", "Col8", "Col9", "Col10"))
>       .setOutputCol("features")
>
>     val labelConverter = new IndexToString()
>       .setInputCol("prediction")
>       .setOutputCol("predictedLabel")
>       .setLabels(labelIndexer.labels)
>
>     val rf = new RandomForestClassifier()
>       .setLabelCol("indexedLabel")
>       .setFeaturesCol("features")
>       .setNumTrees(20)
>       .setMaxDepth(30)
>       .setMaxBins(300)
>       .setImpurity("entropy")
>
>     println("Kicking off pipeline..")
>
>     val pipeline = new Pipeline()
>       .setStages(Array(labelIndexer, features, rf, labelConverter))
>
> thanks in advance and regards
>  Marco
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to