Hi There,

I was comparing Randomforest in sparkml(org.apache.spark.ml.classification)
and spark mllib(org.apache.spark.mllib.tree) using the same datasets and
same parameter settings, spark mllib always gives me better results on test
data sets.
I was wondering

1. Did anyone notice similar performance
​difference ​
as I do?
2. How to output parameters for Pipelinemodel?

for example: I want to output the parameters trained for
RandomForestClassifier. None of these (model.params.toString or
 model.explainParams() or model.extractParamMap())
output meaningful parameters such as
​
totalNumNodes
​
etc.

*val *rf = *new *RandomForestClassifier()
.setFeaturesCol(*"features"*)
.setLabelCol(*"label"*)
.setNumTrees(100)
.setFeatureSubsetStrategy(*"auto"*)
.setImpurity(*"entropy"*)
.setMaxDepth(4)
.setMaxBins(32)

*val *indexer = *new *StringIndexer()
.setInputCol(*"category"*)
.setOutputCol(*"label"*)

*val *pipeline = *new *Pipeline().setStages(*Array*(indexer, rf))

*val *model: PipelineModel = pipeline.fit(trainingData)


thanks,
pengcheng

​

Reply via email to