I think the use-case can be quick different from PMML.

By having a Spark platform independent ML jar, this can empower users to do
the following,

1) PMML doesn't contain all the models we have in mllib. Also, for a ML
pipeline trained by Spark, most of time, PMML is not expressive enough to
do all the transformation we have in Spark ML. As a result, if we are able
to serialize the entire Spark ML pipeline after training, and then load
them back in app without any Spark platform for production scorning, this
will be very useful for production deployment of Spark ML models. The only
issue will be if the transformer involves with shuffle, we need to figure
out a way to handle it. When I chatted with Xiangrui about this, he
suggested that we may tag if a transformer is shuffle ready. Currently, at
Netflix, we are not able to use ML pipeline because of those issues, and we
have to write our own scorers in our production which is quite a duplicated
work.

2) If users can use Spark's linear algebra like vector or matrix code in
their application, this will be very useful. This can help to share code in
Spark training pipeline and production deployment. Also, lots of good stuff
at Spark's mllib doesn't depend on Spark platform, and people can use them
in their application without pulling lots of dependencies. In fact, in my
project, I have to copy & paste code from mllib into my project to use
those goodies in apps.

3) Currently, mllib depends on graphx which means in graphx, there is no
way to use mllib's vector or matrix. And at Netflix, we implemented
parallel personalized page rank which requires to use sparse vector as part
of public api. We have to use breeze here since no access to mllib's basic
type in graphx. Before we contribute it back to open source community, we
need to address this.

Sincerely,

DB Tsai
----------------------------------------------------------
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D

On Thu, Nov 12, 2015 at 3:42 AM, Sean Owen <so...@cloudera.com> wrote:

> This is all starting to sound a lot like what's already implemented in
> Java-based PMML parsing/scoring libraries like JPMML and OpenScoring. I'm
> not clear it helps a lot to reimplement this in Spark.
>
> On Thu, Nov 12, 2015 at 8:05 AM, Felix Cheung <felixcheun...@hotmail.com>
> wrote:
>
>> +1 on that. It would be useful to use the model outside of Spark.
>>
>>
>> _____________________________
>> From: DB Tsai <dbt...@dbtsai.com>
>> Sent: Wednesday, November 11, 2015 11:57 PM
>> Subject: Re: thought experiment: use spark ML to real time prediction
>> To: Nirmal Fernando <nir...@wso2.com>
>> Cc: Andy Davidson <a...@santacruzintegration.com>, Adrian Tanase <
>> atan...@adobe.com>, user @spark <user@spark.apache.org>
>>
>>
>>
>> Do you think it will be useful to separate those models and model
>> loader/writer code into another spark-ml-common jar without any spark
>> platform dependencies so users can load the models trained by Spark ML in
>> their application and run the prediction?
>>
>>
>> Sincerely,
>>
>> DB Tsai
>> ----------------------------------------------------------
>> Web: https://www.dbtsai.com
>> PGP Key ID: 0xAF08DF8D
>>
>> On Wed, Nov 11, 2015 at 3:14 AM, Nirmal Fernando <nir...@wso2.com>
>> wrote:
>>
>>> As of now, we are basically serializing the ML model and then
>>> deserialize it for prediction at real time.
>>>
>>> On Wed, Nov 11, 2015 at 4:39 PM, Adrian Tanase <atan...@adobe.com>
>>> wrote:
>>>
>>>> I don’t think this answers your question but here’s how you would
>>>> evaluate the model in realtime in a streaming app
>>>>
>>>> https://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/predict.html
>>>>
>>>> Maybe you can find a way to extract portions of MLLib and run them
>>>> outside of spark – loading the precomputed model and calling .predict on
>>>> it…
>>>>
>>>> -adrian
>>>>
>>>> From: Andy Davidson
>>>> Date: Tuesday, November 10, 2015 at 11:31 PM
>>>> To: "user @spark"
>>>> Subject: thought experiment: use spark ML to real time prediction
>>>>
>>>> Lets say I have use spark ML to train a linear model. I know I can save
>>>> and load the model to disk. I am not sure how I can use the model in a real
>>>> time environment. For example I do not think I can return a “prediction” to
>>>> the client using spark streaming easily. Also for some applications the
>>>> extra latency created by the batch process might not be acceptable.
>>>>
>>>> If I was not using spark I would re-implement the model I trained in my
>>>> batch environment in a lang like Java  and implement a rest service that
>>>> uses the model to create a prediction and return the prediction to the
>>>> client. Many models make predictions using linear algebra. Implementing
>>>> predictions is relatively easy if you have a good vectorized LA package. Is
>>>> there a way to use a model I trained using spark ML outside of spark?
>>>>
>>>> As a motivating example, even if its possible to return data to the
>>>> client using spark streaming. I think the mini batch latency would not be
>>>> acceptable for a high frequency stock trading system.
>>>>
>>>> Kind regards
>>>>
>>>> Andy
>>>>
>>>> P.s. The examples I have seen so far use spark streaming to
>>>> “preprocess” predictions. For example a recommender system might use what
>>>> current users are watching to calculate “trending recommendations”. These
>>>> are stored on disk and served up to users when the use the “movie guide”.
>>>> If a recommendation was a couple of min. old it would not effect the end
>>>> users experience.
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> Thanks & regards,
>>> Nirmal
>>>
>>> Team Lead - WSO2 Machine Learner
>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>> Mobile: +94715779733
>>> Blog: http://nirmalfdo.blogspot.com/
>>>
>>>
>>>
>>
>>
>>
>

Reply via email to