ah, so with that much serialization happening, you might actually need *less* workers! :)
in the next couple releases of Spark ML should, we should see better scoring/predicting functionality using a single node for exactly this reason. to get there, we need model.save/load support (PMML?), optimized single-node linear algebra support, and a few other goodies. useNodeIdCache only affects training. btw, are you checkpointing per this <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala#L151>? (code snippet from DecisionTreeParams copy/pasted below for convenience) /** * Specifies how often to checkpoint the cached node IDs. * E.g. 10 means that the cache will get checkpointed every 10 iterations. * This is only used if cacheNodeIds is true and if the checkpoint directory is set in * [[org.apache.spark.SparkContext]]. * Must be >= 1. * (default = 10) * @group expertSetParam */ def setCheckpointInterval(value: Int): this.type = set(checkpointInterval, value) i'm not actually sure how this will affect training performance with the new ml.RandomForest impl, but i'm curious to hear what you find. On Fri, Dec 25, 2015 at 6:03 PM, Alexander Ratnikov < ratnikov.alexan...@gmail.com> wrote: > Definitely the biggest difference is the maxDepth of the trees. With > values smaller or equal to 5 the time goes into milliseconds. > The amount of trees affects the performance but not that much. > I tried to profile the app and I see decent time spent in serialization. > I'm wondering if Spark isn't somehow caching model on workers during > classification ? > > useNodeIdCache is ON but docs aren't clear if Spark is using it only > on training. > Also, I must say we didn't have this problem in the old mllib API so > it might be something in the new ml that I'm missing. > I will dig deeper into the problem after holidays. > > 2015-12-25 16:26 GMT+01:00 Chris Fregly <ch...@fregly.com>: > > so it looks like you're increasing num trees by 5x and you're seeing an > 8x > > increase in runtime, correct? > > > > did you analyze the Spark cluster resources to monitor the memory usage, > > spillage, disk I/O, etc? > > > > you may need more Workers. > > > > On Tue, Dec 22, 2015 at 8:57 AM, Alexander Ratnikov > > <ratnikov.alexan...@gmail.com> wrote: > >> > >> Hi All, > >> > >> It would be good to get some tips on tuning Apache Spark for Random > >> Forest classification. > >> Currently, we have a model that looks like: > >> > >> featureSubsetStrategy all > >> impurity gini > >> maxBins 32 > >> maxDepth 11 > >> numberOfClasses 2 > >> numberOfTrees 100 > >> > >> We are running Spark 1.5.1 as a standalone cluster. > >> > >> 1 Master and 2 Worker nodes. > >> The amount of RAM is 32GB on each node with 4 Cores. > >> The classification takes 440ms. > >> > >> When we increase the number of trees to 500, it takes 8 sec already. > >> We tried to reduce the depth but then error rate is higher. We have > >> around 246 attributes. > >> > >> Probably we are doing something wrong. Any ideas how we could improve > >> the performance ? > >> > >> > >> > >> > >> -- > >> View this message in context: > >> > http://apache-spark-user-list.1001560.n3.nabble.com/Tips-for-Spark-s-Random-Forest-slow-performance-tp25766.html > >> Sent from the Apache Spark User List mailing list archive at Nabble.com. > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > >> For additional commands, e-mail: user-h...@spark.apache.org > >> > > > > > > > > -- > > > > Chris Fregly > > Principal Data Solutions Engineer > > IBM Spark Technology Center, San Francisco, CA > > http://spark.tc | http://advancedspark.com > -- *Chris Fregly* Principal Data Solutions Engineer IBM Spark Technology Center, San Francisco, CA http://spark.tc | http://advancedspark.com