Hi Nick I started this thread. IMHO we need something like spark to train our models. The resulting model are typically small enough to easily fit on a single machine. My real time production system is not built on spark. The real time system needs to use the model to make predictions in real time.
User case: ³high frequency stock training². Use spark to train a model. There is no way I could use spark streaming in the real time production system. I need some way to easily move the model trained using spark to a non spark environment so I can make predictions in real time. ³credit card Fraud detection² is another similar use case. Kind regards Andy From: Nick Pentreath <nick.pentre...@gmail.com> Date: Wednesday, November 18, 2015 at 4:03 AM To: DB Tsai <dbt...@dbtsai.com> Cc: "user @spark" <user@spark.apache.org> Subject: Re: thought experiment: use spark ML to real time prediction > One such "lightweight PMML in JSON" is here - > https://github.com/bigmlcom/json-pml. At least for the schema definitions. But > nothing available in terms of evaluation/scoring. Perhaps this is something > that can form a basis for such a new undertaking. > > I agree that distributed models are only really applicable in the case of > massive scale factor models - and then anyway for latency purposes one needs > to use LSH or something similar to achieve sufficiently real-time performance. > These days one can easily spin up a single very powerful server to handle even > very large models. > > On Tue, Nov 17, 2015 at 11:34 PM, DB Tsai <dbt...@dbtsai.com> wrote: >> I was thinking about to work on better version of PMML, JMML in JSON, but as >> you said, this requires a dedicated team to define the standard which will be >> a huge work. However, option b) and c) still don't address the distributed >> models issue. In fact, most of the models in production have to be small >> enough to return the result to users within reasonable latency, so I doubt >> how usefulness of the distributed models in real production use-case. For R >> and Python, we can build a wrapper on-top of the lightweight >> "spark-ml-common" project. >> >> >> Sincerely, >> >> DB Tsai >> ---------------------------------------------------------- >> Web: https://www.dbtsai.com >> PGP Key ID: 0xAF08DF8D >> >> On Tue, Nov 17, 2015 at 2:29 AM, Nick Pentreath <nick.pentre...@gmail.com> >> wrote: >>> I think the issue with pulling in all of spark-core is often with >>> dependencies (and versions) conflicting with the web framework (or Akka in >>> many cases). Plus it really is quite heavy if you just want a fairly >>> lightweight model-serving app. For example we've built a fairly simple but >>> scalable ALS factor model server on Scalatra, Akka and Breeze. So all you >>> really need is the web framework and Breeze (or an alternative linear >>> algebra lib). >>> >>> I definitely hear the pain-point that PMML might not be able to handle some >>> types of transformations or models that exist in Spark. However, here's an >>> example from scikit-learn -> PMML that may be instructive >>> (https://github.com/scikit-learn/scikit-learn/issues/1596 and >>> https://github.com/jpmml/jpmml-sklearn), where a fairly impressive list of >>> estimators and transformers are supported (including e.g. scaling and >>> encoding, and PCA). >>> >>> I definitely think the current model I/O and "export" or "deploy to >>> production" situation needs to be improved substantially. However, you are >>> left with the following options: >>> >>> (a) build out a lightweight "spark-ml-common" project that brings in the >>> dependencies needed for production scoring / transformation in independent >>> apps. However, here you only support Scala/Java - what about R and Python? >>> Also, what about the distributed models? Perhaps "local" wrappers can be >>> created, though this may not work for very large factor or LDA models. See >>> also H20 example http://docs.h2o.ai/h2oclassic/userguide/scorePOJO.html >>> >>> (b) build out Spark's PMML support, and add missing stuff to PMML where >>> possible. The benefit here is an existing standard with various tools for >>> scoring (via REST server, Java app, Pig, Hive, various language support). >>> >>> (c) build out a more comprehensive I/O, serialization and scoring framework. >>> Here you face the issue of supporting various predictors and transformers >>> generically, across platforms and versioning. i.e. you're re-creating a new >>> standard like PMML >>> >>> Option (a) is do-able, but I'm a bit concerned that it may be too "Spark >>> specific", or even too "Scala / Java" specific. But it is still potentially >>> very useful to Spark users to build this out and have a somewhat standard >>> production serving framework and/or library (there are obviously existing >>> options like PredictionIO etc). >>> >>> Option (b) is really building out the existing PMML support within Spark, so >>> a lot of the initial work has already been done. I know some folks had (or >>> have) licensing issues with some components of JPMML (e.g. the evaluator and >>> REST server). But perhaps the solution here is to build an Apache2-licensed >>> evaluator framework. >>> >>> Option (c) is obviously interesting - "let's build a better PMML (that uses >>> JSON or whatever instead of XML!)". But it also seems like a huge amount of >>> reinventing the wheel, and like any new standard would take time to garner >>> wide support (if at all). >>> >>> It would be really useful to start to understand what the main missing >>> pieces are in PMML - perhaps the lowest-hanging fruit is simply to >>> contribute improvements or additions to PMML. >>> >>> >>> >>> On Fri, Nov 13, 2015 at 11:46 AM, Sabarish Sasidharan >>> <sabarish.sasidha...@manthan.com> wrote: >>>> That may not be an issue if the app using the models runs by itself (not >>>> bundled into an existing app), which may actually be the right way to >>>> design it considering separation of concerns. >>>> >>>> Regards >>>> Sab >>>> >>>> On Fri, Nov 13, 2015 at 9:59 AM, DB Tsai <dbt...@dbtsai.com> wrote: >>>>> This will bring the whole dependencies of spark will may break the web >>>>> app. >>>>> >>>>> >>>>> Sincerely, >>>>> >>>>> DB Tsai >>>>> ---------------------------------------------------------- >>>>> Web: https://www.dbtsai.com >>>>> PGP Key ID: 0xAF08DF8D >>>>> >>>>> On Thu, Nov 12, 2015 at 8:15 PM, Nirmal Fernando <nir...@wso2.com> wrote: >>>>>> >>>>>> >>>>>> On Fri, Nov 13, 2015 at 2:04 AM, darren <dar...@ontrenet.com> wrote: >>>>>>> >>>>>>> I agree 100%. Making the model requires large data and many cpus. >>>>>>> >>>>>>> Using it does not. >>>>>>> >>>>>>> This is a very useful side effect of ML models. >>>>>>> >>>>>>> If mlib can't use models outside spark that's a real shame. >>>>>> >>>>>> Well you can as mentioned earlier. You don't need Spark runtime for >>>>>> predictions, save the serialized model and deserialize to use. (you need >>>>>> the Spark Jars in the classpath though) >>>>>>> >>>>>>> >>>>>>> Sent from my Verizon Wireless 4G LTE smartphone >>>>>>> >>>>>>> >>>>>>> -------- Original message -------- >>>>>>> From: "Kothuvatiparambil, Viju" >>>>>>> <viju.kothuvatiparam...@bankofamerica.com> >>>>>>> Date: 11/12/2015 3:09 PM (GMT-05:00) >>>>>>> To: DB Tsai <dbt...@dbtsai.com>, Sean Owen <so...@cloudera.com> >>>>>>> Cc: Felix Cheung <felixcheun...@hotmail.com>, Nirmal Fernando >>>>>>> <nir...@wso2.com>, Andy Davidson <a...@santacruzintegration.com>, Adrian >>>>>>> Tanase <atan...@adobe.com>, "user @spark" <user@spark.apache.org>, >>>>>>> Xiangrui Meng <men...@gmail.com>, hol...@pigscanfly.ca >>>>>>> Subject: RE: thought experiment: use spark ML to real time prediction >>>>>>> >>>>>>> I am glad to see DB¹s comments, make me feel I am not the only one >>>>>>> facing these issues. If we are able to use MLLib to load the model in >>>>>>> web applications (outside the spark cluster), that would have solved the >>>>>>> issue. I understand Spark is manly for processing big data in a >>>>>>> distributed mode. But, there is no purpose in training a model using >>>>>>> MLLib, if we are not able to use it in applications where needs to >>>>>>> access the model. >>>>>>> >>>>>>> Thanks >>>>>>> Viju >>>>>>> >>>>>>> From: DB Tsai [mailto:dbt...@dbtsai.com] >>>>>>> Sent: Thursday, November 12, 2015 11:04 AM >>>>>>> To: Sean Owen >>>>>>> Cc: Felix Cheung; Nirmal Fernando; Andy Davidson; Adrian Tanase; user >>>>>>> @spark; Xiangrui Meng; hol...@pigscanfly.ca >>>>>>> Subject: Re: thought experiment: use spark ML to real time prediction >>>>>>> >>>>>>> >>>>>>> I think the use-case can be quick different from PMML. >>>>>>> >>>>>>> >>>>>>> >>>>>>> By having a Spark platform independent ML jar, this can empower users to >>>>>>> do the following, >>>>>>> >>>>>>> >>>>>>> >>>>>>> 1) PMML doesn't contain all the models we have in mllib. Also, for a ML >>>>>>> pipeline trained by Spark, most of time, PMML is not expressive enough >>>>>>> to do all the transformation we have in Spark ML. As a result, if we are >>>>>>> able to serialize the entire Spark ML pipeline after training, and then >>>>>>> load them back in app without any Spark platform for production >>>>>>> scorning, this will be very useful for production deployment of Spark ML >>>>>>> models. The only issue will be if the transformer involves with shuffle, >>>>>>> we need to figure out a way to handle it. When I chatted with Xiangrui >>>>>>> about this, he suggested that we may tag if a transformer is shuffle >>>>>>> ready. Currently, at Netflix, we are not able to use ML pipeline because >>>>>>> of those issues, and we have to write our own scorers in our production >>>>>>> which is quite a duplicated work. >>>>>>> >>>>>>> >>>>>>> >>>>>>> 2) If users can use Spark's linear algebra like vector or matrix code in >>>>>>> their application, this will be very useful. This can help to share code >>>>>>> in Spark training pipeline and production deployment. Also, lots of good >>>>>>> stuff at Spark's mllib doesn't depend on Spark platform, and people can >>>>>>> use them in their application without pulling lots of dependencies. In >>>>>>> fact, in my project, I have to copy & paste code from mllib into my >>>>>>> project to use those goodies in apps. >>>>>>> >>>>>>> >>>>>>> >>>>>>> 3) Currently, mllib depends on graphx which means in graphx, there is no >>>>>>> way to use mllib's vector or matrix. And >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> Thanks & regards, >>>>>> Nirmal >>>>>> >>>>>> Team Lead - WSO2 Machine Learner >>>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >>>>>> Mobile: +94715779733 <tel:%2B94715779733> >>>>>> Blog: http://nirmalfdo.blogspot.com/ >>>>>> >>>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> >>>> Architect - Big Data >>>> Ph: +91 99805 99458 <tel:%2B91%2099805%2099458> >>>> >>>> Manthan Systems | Company of the year - Analytics (2014 Frost and Sullivan >>>> India ICT) >>>> +++ >>> >> >