Coming back to this I believe I found some reasons.
Basically, the main logic sits inside ProbabilisticClassificationModel.
It has a transform method which takes a DataFrame (the vector to classify)
and appends to it some UDFs which actually do the prediction.
The thing is that this DataFrame exec
ah, so with that much serialization happening, you might actually need
*less* workers! :)
in the next couple releases of Spark ML should, we should see better
scoring/predicting functionality using a single node for exactly this
reason.
to get there, we need model.save/load support (PMML?), opti
Definitely the biggest difference is the maxDepth of the trees. With
values smaller or equal to 5 the time goes into milliseconds.
The amount of trees affects the performance but not that much.
I tried to profile the app and I see decent time spent in serialization.
I'm wondering if Spark isn't som
so it looks like you're increasing num trees by 5x and you're seeing an 8x
increase in runtime, correct?
did you analyze the Spark cluster resources to monitor the memory usage,
spillage, disk I/O, etc?
you may need more Workers.
On Tue, Dec 22, 2015 at 8:57 AM, Alexander Ratnikov <
ratnikov.ale
Hi All,
It would be good to get some tips on tuning Apache Spark for Random
Forest classification.
Currently, we have a model that looks like:
featureSubsetStrategy all
impurity gini
maxBins 32
maxDepth 11
numberOfClasses 2
numberOfTrees 100
We are running Spark 1.5.1 as a standalone cluster.
1