SparkML RandomForest

Pengcheng Wed, 10 Aug 2016 20:43:07 -0700

Hi There,

I was comparing Randomforest in sparkml(org.apache.spark.ml.classification)
and spark mllib(org.apache.spark.mllib.tree) using the same datasets and
same parameter settings, spark mllib always gives me better results on test
data sets.
I was wondering


1. Did anyone notice similar performance
difference 
as I do?
2. How to output parameters for Pipelinemodel?

for example: I want to output the parameters trained for
RandomForestClassifier. None of these (model.params.toString or
 model.explainParams() or model.extractParamMap())
output meaningful parameters such as

totalNumNodes

etc.

*val *rf = *new *RandomForestClassifier()
.setFeaturesCol(*"features"*)
.setLabelCol(*"label"*)
.setNumTrees(100)
.setFeatureSubsetStrategy(*"auto"*)
.setImpurity(*"entropy"*)
.setMaxDepth(4)
.setMaxBins(32)

*val *indexer = *new *StringIndexer()
.setInputCol(*"category"*)
.setOutputCol(*"label"*)

*val *pipeline = *new *Pipeline().setStages(*Array*(indexer, rf))

*val *model: PipelineModel = pipeline.fit(trainingData)


thanks,
pengcheng

SparkML RandomForest

Reply via email to