hi all
 i have been toying around with this well known RandomForestExample code

val forest = RandomForest.trainClassifier(
  trainData, 7, Map(10 -> 4, 11 -> 40), 20,
  "auto", "entropy", 30, 300)

This comes from this link (
https://www.safaribooksonline.com/library/view/advanced-analytics-with/9781491912751/ch04.html),
and also Sean Owen's presentation

(https://www.youtube.com/watch?v=ObiCMJ24ezs)



and now i want to migrate it to use ML Libraries.
The problem i have is that the MLLib  example has categorical features, and
i cannot find
a way to use categorical features with ML
Apparently i should use VectorIndexer, but VectorIndexer assumes only one
input
column for features.
I am at the moment using Vectorassembler instead, but i cannot find a way
to achieve the
same
I have checed spark samples, but all i can see is RandomForestClassifier
using VectorIndexer for 1 feature



Could anyone assist?
This is my current code....what do i need to add to take into account
categorical features?

val labelIndexer = new StringIndexer()
      .setInputCol("Col0")
      .setOutputCol("indexedLabel")
      .fit(data)

    val features = new VectorAssembler()
      .setInputCols(Array(
        "Col1", "Col2", "Col3", "Col4", "Col5",
        "Col6", "Col7", "Col8", "Col9", "Col10"))
      .setOutputCol("features")

    val labelConverter = new IndexToString()
      .setInputCol("prediction")
      .setOutputCol("predictedLabel")
      .setLabels(labelIndexer.labels)

    val rf = new RandomForestClassifier()
      .setLabelCol("indexedLabel")
      .setFeaturesCol("features")
      .setNumTrees(20)
      .setMaxDepth(30)
      .setMaxBins(300)
      .setImpurity("entropy")

    println("Kicking off pipeline..")

    val pipeline = new Pipeline()
      .setStages(Array(labelIndexer, features, rf, labelConverter))

thanks in advance and regards
 Marco

Reply via email to