Re: SparkML. RandomForest predict performance for small dataset.

Yanbo Liang Fri, 11 Dec 2015 23:55:43 -0800

I think you are finding the ability of prediction on single instance. It's
a feature on the development, please refer SPARK-10413.


2015-12-10 4:37 GMT+08:00 Eugene Morozov <evgeny.a.moro...@gmail.com>:

> Hello,
>
> I'm using RandomForest pipeline (ml package). Everything is working fine
> (learning models, prediction, etc), but I'd like to tune it for the case,
> when I predict with small dataset.
> My issue is that when I apply
>
> (PipelineModel)model.transform(dataset)
>
> The model consists of the following stages:
>
> StringIndexerModel labelIndexer = new StringIndexer()...
> RandomForestClassifier classifier = new RandomForestClassifier()...
> IndexToString labelConverter = new IndexToString()...
> Pipeline pipeline = new Pipeline().setStages(new 
> PipelineStage[]{labelIndexer, classifier, labelConverter});
>
> it obviously takes some time to predict, but when my dataset consists of
> just 1 (record) I'd expect it to be really fast.
>
> My observations are even though I use small dataset Spark broadcasts
> something over and over again. That's fine, when I load my (serialized)
> model from disk and use it just once for prediction, but when I use the
> same model in a loop for the same! dataset, I'd say that everything should
> already be on a worker nodes, thus I'd expect prediction to be fast.
> It takes 20 seconds to predict dataset once (with one input row) and all
> subsequent predictions over the same dataset with the same model takes
> roughly 10 seconds.
> My goal is to have 0.5 - 1 second response.
>
> My intention was to keep learned model on a driver (that's stay online
> with created SparkContext) to use it for any subsequent predictions, but
> these 10 seconds predictions basically kill the whole idea.
>
> Is it possible somehow to distribute the model over the cluster upfront so
> that the prediction is really fast?
> Are there any specific params to apply to the PipelineModel to stay
> resident on a worker nodes? Anything to keep and reuse broadcasted data?
>
> Thanks in advance.
> --
> Be well!
> Jean Morozov
>

Re: SparkML. RandomForest predict performance for small dataset.

Reply via email to