Hi Did you try applying the model with akka instead of spark ? https://spark-summit.org/eu-2015/events/real-time-anomaly-detection-with-spark-ml-and-akka/
Le 18 oct. 2016 5:58 AM, "Aseem Bansal" <asmbans...@gmail.com> a écrit : > @Nicolas > > No, ours is different. We required predictions within 10ms time frame so > we needed much less latency than that. > > Every algorithm has some parameters. Correct? We took the parameters from > the mllib and used them to create ml package's model. ml package's model's > prediction time was much faster compared to mllib package's transformation. > So essentially use spark's distributed machine learning library to train > the model, save to S3, load from S3 in a different system and then convert > it into the vector based API model for actual predictions. > > There were obviously some transformations involved but we didn't use > Pipeline for those transformations. Instead, we re-wrote them for the > Vector based API. I know it's not perfect but if we had used the > transformations within the pipeline that would make us dependent on spark's > distributed API and we didn't see how we will really reach our latency > requirements. Would have been much simpler and more DRY if the > PipelineModel had a predict method based on vectors and was not distributed. > > As you can guess it is very much model-specific and more work. If we > decide to use another type of Model we will have to add conversion > code/transformation code for that also. Only if spark exposed a prediction > method which is as fast as the old machine learning package. > > On Sat, Oct 15, 2016 at 8:42 PM, Nicolas Long <nicolasl...@gmail.com> > wrote: > >> Hi Sean and Aseem, >> >> thanks both. A simple thing which sped things up greatly was simply to >> load our sql (for one record effectively) directly and then convert to a >> dataframe, rather than using Spark to load it. Sounds stupid, but this took >> us from > 5 seconds to ~1 second on a very small instance. >> >> Aseem: can you explain your solution a bit more? I'm not sure I >> understand it. At the moment we load our models from S3 >> (RandomForestClassificationModel.load(..) ) and then store that in an >> object property so that it persists across requests - this is in Scala. Is >> this essentially what you mean? >> >> >> >> >> >> >> On 12 October 2016 at 10:52, Aseem Bansal <asmbans...@gmail.com> wrote: >> >>> Hi >>> >>> Faced a similar issue. Our solution was to load the model, cache it >>> after converting it to a model from mllib and then use that instead of ml >>> model. >>> >>> On Tue, Oct 11, 2016 at 10:22 PM, Sean Owen <so...@cloudera.com> wrote: >>> >>>> I don't believe it will ever scale to spin up a whole distributed job >>>> to serve one request. You can look possibly at the bits in mllib-local. You >>>> might do well to export as something like PMML either with Spark's export >>>> or JPMML and then load it into a web container and score it, without Spark >>>> (possibly also with JPMML, OpenScoring) >>>> >>>> >>>> On Tue, Oct 11, 2016, 17:53 Nicolas Long <nicolasl...@gmail.com> wrote: >>>> >>>>> Hi all, >>>>> >>>>> so I have a model which has been stored in S3. And I have a Scala >>>>> webapp which for certain requests loads the model and transforms submitted >>>>> data against it. >>>>> >>>>> I'm not sure how to run this quickly on a single instance though. At >>>>> the moment Spark is being bundled up with the web app in an uberjar (sbt >>>>> assembly). >>>>> >>>>> But the process is quite slow. I'm aiming for responses < 1 sec so >>>>> that the webapp can respond quickly to requests. When I look the Spark UI >>>>> I >>>>> see: >>>>> >>>>> Summary Metrics for 1 Completed Tasks >>>>> >>>>> Metric Min 25th percentile Median 75th percentile Max >>>>> Duration 94 ms 94 ms 94 ms 94 ms 94 ms >>>>> Scheduler Delay 0 ms 0 ms 0 ms 0 ms 0 ms >>>>> Task Deserialization Time 3 s 3 s 3 s 3 s 3 s >>>>> GC Time 2 s 2 s 2 s 2 s 2 s >>>>> Result Serialization Time 0 ms 0 ms 0 ms 0 ms 0 ms >>>>> Getting Result Time 0 ms 0 ms 0 ms 0 ms 0 ms >>>>> Peak Execution Memory 0.0 B 0.0 B 0.0 B 0.0 B 0.0 B >>>>> >>>>> I don't really understand why deserialization and GC should take so >>>>> long when the models are already loaded. Is this evidence I am doing >>>>> something wrong? And where can I get a better understanding on how Spark >>>>> works under the hood here, and how best to do a standalone/bundled jar >>>>> deployment? >>>>> >>>>> Thanks! >>>>> >>>>> Nic >>>>> >>>> >>> >> >