Re: SparkML. RandomForest predict performance for small dataset.

2015-12-11 Thread Yanbo Liang
I think you are finding the ability of prediction on single instance. It's
a feature on the development, please refer SPARK-10413.

2015-12-10 4:37 GMT+08:00 Eugene Morozov :

> Hello,
>
> I'm using RandomForest pipeline (ml package). Everything is working fine
> (learning models, prediction, etc), but I'd like to tune it for the case,
> when I predict with small dataset.
> My issue is that when I apply
>
> (PipelineModel)model.transform(dataset)
>
> The model consists of the following stages:
>
> StringIndexerModel labelIndexer = new StringIndexer()...
> RandomForestClassifier classifier = new RandomForestClassifier()...
> IndexToString labelConverter = new IndexToString()...
> Pipeline pipeline = new Pipeline().setStages(new 
> PipelineStage[]{labelIndexer, classifier, labelConverter});
>
> it obviously takes some time to predict, but when my dataset consists of
> just 1 (record) I'd expect it to be really fast.
>
> My observations are even though I use small dataset Spark broadcasts
> something over and over again. That's fine, when I load my (serialized)
> model from disk and use it just once for prediction, but when I use the
> same model in a loop for the same! dataset, I'd say that everything should
> already be on a worker nodes, thus I'd expect prediction to be fast.
> It takes 20 seconds to predict dataset once (with one input row) and all
> subsequent predictions over the same dataset with the same model takes
> roughly 10 seconds.
> My goal is to have 0.5 - 1 second response.
>
> My intention was to keep learned model on a driver (that's stay online
> with created SparkContext) to use it for any subsequent predictions, but
> these 10 seconds predictions basically kill the whole idea.
>
> Is it possible somehow to distribute the model over the cluster upfront so
> that the prediction is really fast?
> Are there any specific params to apply to the PipelineModel to stay
> resident on a worker nodes? Anything to keep and reuse broadcasted data?
>
> Thanks in advance.
> --
> Be well!
> Jean Morozov
>


SparkML. RandomForest predict performance for small dataset.

2015-12-09 Thread Eugene Morozov
Hello,

I'm using RandomForest pipeline (ml package). Everything is working fine
(learning models, prediction, etc), but I'd like to tune it for the case,
when I predict with small dataset.
My issue is that when I apply

(PipelineModel)model.transform(dataset)

The model consists of the following stages:

StringIndexerModel labelIndexer = new StringIndexer()...
RandomForestClassifier classifier = new RandomForestClassifier()...
IndexToString labelConverter = new IndexToString()...
Pipeline pipeline = new Pipeline().setStages(new
PipelineStage[]{labelIndexer, classifier, labelConverter});

it obviously takes some time to predict, but when my dataset consists of
just 1 (record) I'd expect it to be really fast.

My observations are even though I use small dataset Spark broadcasts
something over and over again. That's fine, when I load my (serialized)
model from disk and use it just once for prediction, but when I use the
same model in a loop for the same! dataset, I'd say that everything should
already be on a worker nodes, thus I'd expect prediction to be fast.
It takes 20 seconds to predict dataset once (with one input row) and all
subsequent predictions over the same dataset with the same model takes
roughly 10 seconds.
My goal is to have 0.5 - 1 second response.

My intention was to keep learned model on a driver (that's stay online with
created SparkContext) to use it for any subsequent predictions, but these
10 seconds predictions basically kill the whole idea.

Is it possible somehow to distribute the model over the cluster upfront so
that the prediction is really fast?
Are there any specific params to apply to the PipelineModel to stay
resident on a worker nodes? Anything to keep and reuse broadcasted data?

Thanks in advance.
--
Be well!
Jean Morozov