ah, so with that much serialization happening, you might actually need
*less* workers!  :)

in the next couple releases of Spark ML should, we should see better
scoring/predicting functionality using a single node for exactly this
reason.

to get there, we need model.save/load support (PMML?), optimized
single-node linear algebra support, and a few other goodies.

useNodeIdCache only affects training.

btw, are you checkpointing per this
<https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala#L151>?
 (code snippet from DecisionTreeParams copy/pasted below for convenience)

/** * Specifies how often to checkpoint the cached node IDs. * E.g. 10
means that the cache will get checkpointed every 10 iterations. * This is
only used if cacheNodeIds is true and if the checkpoint directory is set in
* [[org.apache.spark.SparkContext]]. * Must be >= 1. * (default = 10) *
@group expertSetParam */ def setCheckpointInterval(value: Int): this.type =
set(checkpointInterval, value)
i'm not actually sure how this will affect training performance with the
new ml.RandomForest impl, but i'm curious to hear what you find.


On Fri, Dec 25, 2015 at 6:03 PM, Alexander Ratnikov <
ratnikov.alexan...@gmail.com> wrote:

> Definitely the biggest difference is the maxDepth of the trees. With
> values smaller or equal to 5 the time goes into milliseconds.
> The amount of trees affects the performance but not that much.
> I tried to profile the app and I see decent time spent in serialization.
> I'm wondering if Spark isn't somehow caching model on workers during
> classification ?
>
> useNodeIdCache is ON but docs aren't clear if Spark is using it only
> on training.
> Also, I must say we didn't have this problem in the old mllib API so
> it might be something in the new ml that I'm missing.
> I will dig deeper into the problem after holidays.
>
> 2015-12-25 16:26 GMT+01:00 Chris Fregly <ch...@fregly.com>:
> > so it looks like you're increasing num trees by 5x and you're seeing an
> 8x
> > increase in runtime, correct?
> >
> > did you analyze the Spark cluster resources to monitor the memory usage,
> > spillage, disk I/O, etc?
> >
> > you may need more Workers.
> >
> > On Tue, Dec 22, 2015 at 8:57 AM, Alexander Ratnikov
> > <ratnikov.alexan...@gmail.com> wrote:
> >>
> >> Hi All,
> >>
> >> It would be good to get some tips on tuning Apache Spark for Random
> >> Forest classification.
> >> Currently, we have a model that looks like:
> >>
> >> featureSubsetStrategy all
> >> impurity gini
> >> maxBins 32
> >> maxDepth 11
> >> numberOfClasses 2
> >> numberOfTrees 100
> >>
> >> We are running Spark 1.5.1 as a standalone cluster.
> >>
> >> 1 Master and 2 Worker nodes.
> >> The amount of RAM is 32GB on each node with 4 Cores.
> >> The classification takes 440ms.
> >>
> >> When we increase the number of trees to 500, it takes 8 sec already.
> >> We tried to reduce the depth but then error rate is higher. We have
> >> around 246 attributes.
> >>
> >> Probably we are doing something wrong. Any ideas how we could improve
> >> the performance ?
> >>
> >>
> >>
> >>
> >> --
> >> View this message in context:
> >>
> http://apache-spark-user-list.1001560.n3.nabble.com/Tips-for-Spark-s-Random-Forest-slow-performance-tp25766.html
> >> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: user-h...@spark.apache.org
> >>
> >
> >
> >
> > --
> >
> > Chris Fregly
> > Principal Data Solutions Engineer
> > IBM Spark Technology Center, San Francisco, CA
> > http://spark.tc | http://advancedspark.com
>



-- 

*Chris Fregly*
Principal Data Solutions Engineer
IBM Spark Technology Center, San Francisco, CA
http://spark.tc | http://advancedspark.com

Reply via email to