I agree with you that this is needed. There is a JIRA https://issues.apache.org/jira/browse/SPARK-10413
On Sun, Feb 5, 2017 at 11:21 PM, Debasish Das <debasish.da...@gmail.com> wrote: > Hi Aseem, > > Due to production deploy, we did not upgrade to 2.0 but that's critical > item on our list. > > For exposing models out of PipelineModel, let me look into the ML > tasks...we should add it since dataframe should not be must for model > scoring...many times model are scored on api or streaming paths which don't > have micro batching involved...data directly lands from http or kafka/msg > queues...for such cases raw access to ML model is essential similar to > mllib model access... > > Thanks. > Deb > On Feb 4, 2017 9:58 PM, "Aseem Bansal" <asmbans...@gmail.com> wrote: > >> @Debasish >> >> I see that the spark version being used in the project that you mentioned >> is 1.6.0. I would suggest that you take a look at some blogs related to >> Spark 2.0 Pipelines, Models in new ml package. The new ml package's API as >> of latest Spark 2.1.0 release has no way to call predict on single vector. >> There is no API exposed. It is WIP but not yet released. >> >> On Sat, Feb 4, 2017 at 11:07 PM, Debasish Das <debasish.da...@gmail.com> >> wrote: >> >>> If we expose an API to access the raw models out of PipelineModel can't >>> we call predict directly on it from an API ? Is there a task open to expose >>> the model out of PipelineModel so that predict can be called on it....there >>> is no dependency of spark context in ml model... >>> On Feb 4, 2017 9:11 AM, "Aseem Bansal" <asmbans...@gmail.com> wrote: >>> >>>> >>>> - In Spark 2.0 there is a class called PipelineModel. I know that >>>> the title says pipeline but it is actually talking about PipelineModel >>>> trained via using a Pipeline. >>>> - Why PipelineModel instead of pipeline? Because usually there is a >>>> series of stuff that needs to be done when doing ML which warrants an >>>> ordered sequence of operations. Read the new spark ml docs or one of the >>>> databricks blogs related to spark pipelines. If you have used python's >>>> sklearn library the concept is inspired from there. >>>> - "once model is deserialized as ml model from the store of choice >>>> within ms" - The timing of loading the model was not what I was >>>> referring to when I was talking about timing. >>>> - "it can be used on incoming features to score through >>>> spark.ml.Model predict API". The predict API is in the old mllib package >>>> not the new ml package. >>>> - "why r we using dataframe and not the ML model directly from API" >>>> - Because as of now the new ml package does not have the direct API. >>>> >>>> >>>> On Sat, Feb 4, 2017 at 10:24 PM, Debasish Das <debasish.da...@gmail.com >>>> > wrote: >>>> >>>>> I am not sure why I will use pipeline to do scoring...idea is to build >>>>> a model, use model ser/deser feature to put it in the row or column store >>>>> of choice and provide a api access to the model...we support these >>>>> primitives in github.com/Verizon/trapezium...the api has access to >>>>> spark context in local or distributed mode...once model is deserialized as >>>>> ml model from the store of choice within ms, it can be used on incoming >>>>> features to score through spark.ml.Model predict API...I am not clear on >>>>> 2200x speedup...why r we using dataframe and not the ML model directly >>>>> from >>>>> API ? >>>>> On Feb 4, 2017 7:52 AM, "Aseem Bansal" <asmbans...@gmail.com> wrote: >>>>> >>>>>> Does this support Java 7? >>>>>> What is your timezone in case someone wanted to talk? >>>>>> >>>>>> On Fri, Feb 3, 2017 at 10:23 PM, Hollin Wilkins <hol...@combust.ml> >>>>>> wrote: >>>>>> >>>>>>> Hey Aseem, >>>>>>> >>>>>>> We have built pipelines that execute several string indexers, one >>>>>>> hot encoders, scaling, and a random forest or linear regression at the >>>>>>> end. >>>>>>> Execution time for the linear regression was on the order of 11 >>>>>>> microseconds, a bit longer for random forest. This can be further >>>>>>> optimized >>>>>>> by using row-based transformations if your pipeline is simple to around >>>>>>> 2-3 >>>>>>> microseconds. The pipeline operated on roughly 12 input features, and by >>>>>>> the time all the processing was done, we had somewhere around 1000 >>>>>>> features >>>>>>> or so going into the linear regression after one hot encoding and >>>>>>> everything else. >>>>>>> >>>>>>> Hope this helps, >>>>>>> Hollin >>>>>>> >>>>>>> On Fri, Feb 3, 2017 at 4:05 AM, Aseem Bansal <asmbans...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Does this support Java 7? >>>>>>>> >>>>>>>> On Fri, Feb 3, 2017 at 5:30 PM, Aseem Bansal <asmbans...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Is computational time for predictions on the order of few >>>>>>>>> milliseconds (< 10 ms) like the old mllib library? >>>>>>>>> >>>>>>>>> On Thu, Feb 2, 2017 at 10:12 PM, Hollin Wilkins <hol...@combust.ml >>>>>>>>> > wrote: >>>>>>>>> >>>>>>>>>> Hey everyone, >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Some of you may have seen Mikhail and I talk at Spark/Hadoop >>>>>>>>>> Summits about MLeap and how you can use it to build production >>>>>>>>>> services >>>>>>>>>> from your Spark-trained ML pipelines. MLeap is an open-source >>>>>>>>>> technology >>>>>>>>>> that allows Data Scientists and Engineers to deploy Spark-trained ML >>>>>>>>>> Pipelines and Models to a scoring engine instantly. The MLeap >>>>>>>>>> execution >>>>>>>>>> engine has no dependencies on a Spark context and the serialization >>>>>>>>>> format >>>>>>>>>> is entirely based on Protobuf 3 and JSON. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> The recent 0.5.0 release provides serialization and inference >>>>>>>>>> support for close to 100% of Spark transformers (we don’t yet >>>>>>>>>> support ALS >>>>>>>>>> and LDA). >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> MLeap is open-source, take a look at our Github page: >>>>>>>>>> >>>>>>>>>> https://github.com/combust/mleap >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Or join the conversation on Gitter: >>>>>>>>>> >>>>>>>>>> https://gitter.im/combust/mleap >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> We have a set of documentation to help get you started here: >>>>>>>>>> >>>>>>>>>> http://mleap-docs.combust.ml/ >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> We even have a set of demos, for training ML Pipelines and >>>>>>>>>> linear, logistic and random forest models: >>>>>>>>>> >>>>>>>>>> https://github.com/combust/mleap-demo >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Check out our latest MLeap-serving Docker image, which allows you >>>>>>>>>> to expose a REST interface to your Spark ML pipeline models: >>>>>>>>>> >>>>>>>>>> http://mleap-docs.combust.ml/mleap-serving/ >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Several companies are using MLeap in production and even more are >>>>>>>>>> currently evaluating it. Take a look and tell us what you think! We >>>>>>>>>> hope to >>>>>>>>>> talk with you soon and welcome feedback/suggestions! >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Sincerely, >>>>>>>>>> >>>>>>>>>> Hollin and Mikhail >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>> >>