Coming back to this I believe I found some reasons.
Basically, the main logic sits inside ProbabilisticClassificationModel.
It has a transform method which takes a DataFrame (the vector to classify)
and appends to it some UDFs which actually do the prediction.

The thing is that this DataFrame execution is not local.
This makes it different to the old mllib API which does't send anything to
the executors and just evaluates the model in memory of the driver which
makes it fast.
With the new ml API it basically needs to serialize/deserialize
RandomForestClassificationModel on each prediction which makes it notably
slow even for my small model (around 100 mb).
There's something in JIRA regarding the topic:
https://issues.apache.org/jira/browse/SPARK-10014

I believe there should be a way to cache the model and avoid rebroadcasting
but I didn't manage to find it.
Could you advice me how to do it ?
Btw, I made a local fix and I'm happy to provide a pull request but first
I'd like to make sure I'm not missing the right way.


2015-12-26 2:03 GMT+01:00 Chris Fregly <ch...@fregly.com>:

> ah, so with that much serialization happening, you might actually need
> *less* workers!  :)
>
> in the next couple releases of Spark ML should, we should see better
> scoring/predicting functionality using a single node for exactly this
> reason.
>
> to get there, we need model.save/load support (PMML?), optimized
> single-node linear algebra support, and a few other goodies.
>
> useNodeIdCache only affects training.
>
> btw, are you checkpointing per this
> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala#L151>?
>  (code snippet from DecisionTreeParams copy/pasted below for convenience)
>
> /**
> * Specifies how often to checkpoint the cached node IDs.
> * E.g. 10 means that the cache will get checkpointed every 10 iterations.
> * This is only used if cacheNodeIds is true and if the checkpoint
> directory is set in
> * [[org.apache.spark.SparkContext]].
> * Must be >= 1.
> * (default = 10)
> * @group expertSetParam
> */
> def setCheckpointInterval(value: Int): this.type =
> set(checkpointInterval, value)
> i'm not actually sure how this will affect training performance with the
> new ml.RandomForest impl, but i'm curious to hear what you find.
>
>
> On Fri, Dec 25, 2015 at 6:03 PM, Alexander Ratnikov <
> ratnikov.alexan...@gmail.com> wrote:
>
>> Definitely the biggest difference is the maxDepth of the trees. With
>> values smaller or equal to 5 the time goes into milliseconds.
>> The amount of trees affects the performance but not that much.
>> I tried to profile the app and I see decent time spent in serialization.
>> I'm wondering if Spark isn't somehow caching model on workers during
>> classification ?
>>
>> useNodeIdCache is ON but docs aren't clear if Spark is using it only
>> on training.
>> Also, I must say we didn't have this problem in the old mllib API so
>> it might be something in the new ml that I'm missing.
>> I will dig deeper into the problem after holidays.
>>
>> 2015-12-25 16:26 GMT+01:00 Chris Fregly <ch...@fregly.com>:
>> > so it looks like you're increasing num trees by 5x and you're seeing an
>> 8x
>> > increase in runtime, correct?
>> >
>> > did you analyze the Spark cluster resources to monitor the memory usage,
>> > spillage, disk I/O, etc?
>> >
>> > you may need more Workers.
>> >
>> > On Tue, Dec 22, 2015 at 8:57 AM, Alexander Ratnikov
>> > <ratnikov.alexan...@gmail.com> wrote:
>> >>
>> >> Hi All,
>> >>
>> >> It would be good to get some tips on tuning Apache Spark for Random
>> >> Forest classification.
>> >> Currently, we have a model that looks like:
>> >>
>> >> featureSubsetStrategy all
>> >> impurity gini
>> >> maxBins 32
>> >> maxDepth 11
>> >> numberOfClasses 2
>> >> numberOfTrees 100
>> >>
>> >> We are running Spark 1.5.1 as a standalone cluster.
>> >>
>> >> 1 Master and 2 Worker nodes.
>> >> The amount of RAM is 32GB on each node with 4 Cores.
>> >> The classification takes 440ms.
>> >>
>> >> When we increase the number of trees to 500, it takes 8 sec already.
>> >> We tried to reduce the depth but then error rate is higher. We have
>> >> around 246 attributes.
>> >>
>> >> Probably we are doing something wrong. Any ideas how we could improve
>> >> the performance ?
>> >>
>> >>
>> >>
>> >>
>> >> --
>> >> View this message in context:
>> >>
>> http://apache-spark-user-list.1001560.n3.nabble.com/Tips-for-Spark-s-Random-Forest-slow-performance-tp25766.html
>> >> Sent from the Apache Spark User List mailing list archive at
>> Nabble.com.
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> >> For additional commands, e-mail: user-h...@spark.apache.org
>> >>
>> >
>> >
>> >
>> > --
>> >
>> > Chris Fregly
>> > Principal Data Solutions Engineer
>> > IBM Spark Technology Center, San Francisco, CA
>> > http://spark.tc | http://advancedspark.com
>>
>
>
>
> --
>
> *Chris Fregly*
> Principal Data Solutions Engineer
> IBM Spark Technology Center, San Francisco, CA
> http://spark.tc | http://advancedspark.com
>

Reply via email to