Re: thought experiment: use spark ML to real time prediction

Andy Davidson Sun, 22 Nov 2015 10:34:47 -0800

Hi Nick

I started this thread. IMHO we need something like spark to train our
models. The resulting model are typically small enough to easily fit on a
single machine. My real time production system is not built on spark. The
real time system needs to use the model to make predictions in real time.



User case: ³high frequency stock training². Use spark to train a model.
There is no way I could use spark streaming in the real time production
system. I need some way to easily move the model trained using spark to a
non spark environment so I can make predictions in real time.

³credit card Fraud detection² is another similar use case.

Kind regards

Andy




From:  Nick Pentreath <nick.pentre...@gmail.com>
Date:  Wednesday, November 18, 2015 at 4:03 AM
To:  DB Tsai <dbt...@dbtsai.com>
Cc:  "user @spark" <user@spark.apache.org>
Subject:  Re: thought experiment: use spark ML to real time prediction

> One such "lightweight PMML in JSON" is here -
> https://github.com/bigmlcom/json-pml. At least for the schema definitions. But
> nothing available in terms of evaluation/scoring. Perhaps this is something
> that can form a basis for such a new undertaking.
> 
> I agree that distributed models are only really applicable in the case of
> massive scale factor models - and then anyway for latency purposes one needs
> to use LSH or something similar to achieve sufficiently real-time performance.
> These days one can easily spin up a single very powerful server to handle even
> very large models.
> 
> On Tue, Nov 17, 2015 at 11:34 PM, DB Tsai <dbt...@dbtsai.com> wrote:
>> I was thinking about to work on better version of PMML, JMML in JSON, but as
>> you said, this requires a dedicated team to define the standard which will be
>> a huge work.  However, option b) and c) still don't address the distributed
>> models issue. In fact, most of the models in production have to be small
>> enough to return the result to users within reasonable latency, so I doubt
>> how usefulness of the distributed models in real production use-case. For R
>> and Python, we can build a wrapper on-top of the lightweight
>> "spark-ml-common" project.
>> 
>> 
>> Sincerely,
>> 
>> DB Tsai
>> ----------------------------------------------------------
>> Web: https://www.dbtsai.com
>> PGP Key ID: 0xAF08DF8D
>> 
>> On Tue, Nov 17, 2015 at 2:29 AM, Nick Pentreath <nick.pentre...@gmail.com>
>> wrote:
>>> I think the issue with pulling in all of spark-core is often with
>>> dependencies (and versions) conflicting with the web framework (or Akka in
>>> many cases). Plus it really is quite heavy if you just want a fairly
>>> lightweight model-serving app. For example we've built a fairly simple but
>>> scalable ALS factor model server on Scalatra, Akka and Breeze. So all you
>>> really need is the web framework and Breeze (or an alternative linear
>>> algebra lib).
>>> 
>>> I definitely hear the pain-point that PMML might not be able to handle some
>>> types of transformations or models that exist in Spark. However, here's an
>>> example from scikit-learn -> PMML that may be instructive
>>> (https://github.com/scikit-learn/scikit-learn/issues/1596 and
>>> https://github.com/jpmml/jpmml-sklearn), where a fairly impressive list of
>>> estimators and transformers are supported (including e.g. scaling and
>>> encoding, and PCA).
>>> 
>>> I definitely think the current model I/O and "export" or "deploy to
>>> production" situation needs to be improved substantially. However, you are
>>> left with the following options:
>>> 
>>> (a) build out a lightweight "spark-ml-common" project that brings in the
>>> dependencies needed for production scoring / transformation in independent
>>> apps. However, here you only support Scala/Java - what about R and Python?
>>> Also, what about the distributed models? Perhaps "local" wrappers can be
>>> created, though this may not work for very large factor or LDA models. See
>>> also H20 example http://docs.h2o.ai/h2oclassic/userguide/scorePOJO.html
>>> 
>>> (b) build out Spark's PMML support, and add missing stuff to PMML where
>>> possible. The benefit here is an existing standard with various tools for
>>> scoring (via REST server, Java app, Pig, Hive, various language support).
>>> 
>>> (c) build out a more comprehensive I/O, serialization and scoring framework.
>>> Here you face the issue of supporting various predictors and transformers
>>> generically, across platforms and versioning. i.e. you're re-creating a new
>>> standard like PMML
>>> 
>>> Option (a) is do-able, but I'm a bit concerned that it may be too "Spark
>>> specific", or even too "Scala / Java" specific. But it is still potentially
>>> very useful to Spark users to build this out and have a somewhat standard
>>> production serving framework and/or library (there are obviously existing
>>> options like PredictionIO etc).
>>> 
>>> Option (b) is really building out the existing PMML support within Spark, so
>>> a lot of the initial work has already been done. I know some folks had (or
>>> have) licensing issues with some components of JPMML (e.g. the evaluator and
>>> REST server). But perhaps the solution here is to build an Apache2-licensed
>>> evaluator framework.
>>> 
>>> Option (c) is obviously interesting - "let's build a better PMML (that uses
>>> JSON or whatever instead of XML!)". But it also seems like a huge amount of
>>> reinventing the wheel, and like any new standard would take time to garner
>>> wide support (if at all).
>>> 
>>> It would be really useful to start to understand what the main missing
>>> pieces are in PMML - perhaps the lowest-hanging fruit is simply to
>>> contribute improvements or additions to PMML.
>>> 
>>> 
>>> 
>>> On Fri, Nov 13, 2015 at 11:46 AM, Sabarish Sasidharan
>>> <sabarish.sasidha...@manthan.com> wrote:
>>>> That may not be an issue if the app using the models runs by itself (not
>>>> bundled into an existing app), which may actually be the right way to
>>>> design it considering separation of concerns.
>>>> 
>>>> Regards
>>>> Sab 
>>>> 
>>>> On Fri, Nov 13, 2015 at 9:59 AM, DB Tsai <dbt...@dbtsai.com> wrote:
>>>>> This will bring the whole dependencies of spark will may break the web
>>>>> app.
>>>>> 
>>>>> 
>>>>> Sincerely,
>>>>> 
>>>>> DB Tsai
>>>>> ----------------------------------------------------------
>>>>> Web: https://www.dbtsai.com
>>>>> PGP Key ID: 0xAF08DF8D
>>>>> 
>>>>> On Thu, Nov 12, 2015 at 8:15 PM, Nirmal Fernando <nir...@wso2.com> wrote:
>>>>>> 
>>>>>> 
>>>>>> On Fri, Nov 13, 2015 at 2:04 AM, darren <dar...@ontrenet.com> wrote:
>>>>>>>  
>>>>>>> I agree 100%. Making the model requires large data and many cpus.
>>>>>>> 
>>>>>>> Using it does not.
>>>>>>> 
>>>>>>> This is a very useful side effect of ML models.
>>>>>>> 
>>>>>>> If mlib can't use models outside spark that's a real shame.
>>>>>> 
>>>>>> Well you can as mentioned earlier. You don't need Spark runtime for
>>>>>> predictions, save the serialized model and deserialize to use. (you need
>>>>>> the Spark Jars in the classpath though)
>>>>>>> 
>>>>>>> 
>>>>>>> Sent from my Verizon Wireless 4G LTE smartphone
>>>>>>> 
>>>>>>> 
>>>>>>> -------- Original message --------
>>>>>>> From: "Kothuvatiparambil, Viju"
>>>>>>> <viju.kothuvatiparam...@bankofamerica.com>
>>>>>>> Date: 11/12/2015  3:09 PM  (GMT-05:00)
>>>>>>> To: DB Tsai <dbt...@dbtsai.com>, Sean Owen <so...@cloudera.com>
>>>>>>> Cc: Felix Cheung <felixcheun...@hotmail.com>, Nirmal Fernando
>>>>>>> <nir...@wso2.com>, Andy Davidson <a...@santacruzintegration.com>, Adrian
>>>>>>> Tanase <atan...@adobe.com>, "user @spark" <user@spark.apache.org>,
>>>>>>> Xiangrui Meng <men...@gmail.com>, hol...@pigscanfly.ca
>>>>>>> Subject: RE: thought experiment: use spark ML to real time prediction
>>>>>>> 
>>>>>>> I am glad to see DB¹s comments, make me feel I am not the only one
>>>>>>> facing these issues. If we are able to use MLLib to load the model in
>>>>>>> web applications (outside the spark cluster), that would have solved the
>>>>>>> issue.  I understand Spark is manly for processing big data in a
>>>>>>> distributed mode. But, there is no purpose in training a model using
>>>>>>> MLLib, if we are not able to use it in applications where needs to
>>>>>>> access the model.
>>>>>>>  
>>>>>>> Thanks
>>>>>>> Viju
>>>>>>>  
>>>>>>> From: DB Tsai [mailto:dbt...@dbtsai.com]
>>>>>>> Sent: Thursday, November 12, 2015 11:04 AM
>>>>>>> To: Sean Owen
>>>>>>> Cc: Felix Cheung; Nirmal Fernando; Andy Davidson; Adrian Tanase; user
>>>>>>> @spark; Xiangrui Meng; hol...@pigscanfly.ca
>>>>>>> Subject: Re: thought experiment: use spark ML to real time prediction
>>>>>>>  
>>>>>>> 
>>>>>>> I think the use-case can be quick different from PMML.
>>>>>>> 
>>>>>>>  
>>>>>>> 
>>>>>>> By having a Spark platform independent ML jar, this can empower users to
>>>>>>> do the following,
>>>>>>> 
>>>>>>>  
>>>>>>> 
>>>>>>> 1) PMML doesn't contain all the models we have in mllib. Also, for a ML
>>>>>>> pipeline trained by Spark, most of time, PMML is not expressive enough
>>>>>>> to do all the transformation we have in Spark ML. As a result, if we are
>>>>>>> able to serialize the entire Spark ML pipeline after training, and then
>>>>>>> load them back in app without any Spark platform for production
>>>>>>> scorning, this will be very useful for production deployment of Spark ML
>>>>>>> models. The only issue will be if the transformer involves with shuffle,
>>>>>>> we need to figure out a way to handle it. When I chatted with Xiangrui
>>>>>>> about this, he suggested that we may tag if a transformer is shuffle
>>>>>>> ready. Currently, at Netflix, we are not able to use ML pipeline because
>>>>>>> of those issues, and we have to write our own scorers in our production
>>>>>>> which is quite a duplicated work.
>>>>>>> 
>>>>>>>  
>>>>>>> 
>>>>>>> 2) If users can use Spark's linear algebra like vector or matrix code in
>>>>>>> their application, this will be very useful. This can help to share code
>>>>>>> in Spark training pipeline and production deployment. Also, lots of good
>>>>>>> stuff at Spark's mllib doesn't depend on Spark platform, and people can
>>>>>>> use them in their application without pulling lots of dependencies. In
>>>>>>> fact, in my project, I have to copy & paste code from mllib into my
>>>>>>> project to use those goodies in apps.
>>>>>>> 
>>>>>>>  
>>>>>>> 
>>>>>>> 3) Currently, mllib depends on graphx which means in graphx, there is no
>>>>>>> way to use mllib's vector or matrix. And
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> -- 
>>>>>> 
>>>>>> Thanks & regards,
>>>>>> Nirmal
>>>>>> 
>>>>>> Team Lead - WSO2 Machine Learner
>>>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>>>>> Mobile: +94715779733 <tel:%2B94715779733>
>>>>>> Blog: http://nirmalfdo.blogspot.com/
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> 
>>>> Architect - Big Data
>>>> Ph: +91 99805 99458 <tel:%2B91%2099805%2099458>
>>>> 
>>>> Manthan Systems | Company of the year - Analytics (2014 Frost and Sullivan
>>>> India ICT)
>>>> +++
>>> 
>> 
>

Re: thought experiment: use spark ML to real time prediction

Reply via email to