Re: Revisiting Online serving of Spark models?

Holden Karau Mon, 21 May 2018 14:53:42 -0700

I like that idea. I’ll be around Spark Summit.

On Mon, May 21, 2018 at 1:52 PM Joseph Bradley <jos...@databricks.com>
wrote:


> Regarding model reading and writing, I'll give quick thoughts here:
> * Our approach was to use the same format but write JSON instead of
> Parquet.  It's easier to parse JSON without Spark, and using the same
> format simplifies architecture.  Plus, some people want to check files into
> version control, and JSON is nice for that.
> * The reader/writer APIs could be extended to take format parameters (just
> like DataFrame reader/writers) to handle JSON (and maybe, eventually,
> handle Parquet in the online serving setting).
>
> This would be a big project, so proposing a SPIP might be best.  If people
> are around at the Spark Summit, that could be a good time to meet up & then
> post notes back to the dev list.
>
> On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <felixcheun...@hotmail.com>
> wrote:
>
>> Specifically I’d like bring part of the discussion to Model and
>> PipelineModel, and various ModelReader and SharedReadWrite implementations
>> that rely on SparkContext. This is a big blocker on reusing  trained models
>> outside of Spark for online serving.
>>
>> What’s the next step? Would folks be interested in getting together to
>> discuss/get some feedback?
>>
>>
>> _____________________________
>> From: Felix Cheung <felixcheun...@hotmail.com>
>> Sent: Thursday, May 10, 2018 10:10 AM
>> Subject: Re: Revisiting Online serving of Spark models?
>> To: Holden Karau <hol...@pigscanfly.ca>, Joseph Bradley <
>> jos...@databricks.com>
>> Cc: dev <dev@spark.apache.org>
>>
>>
>>
>> Huge +1 on this!
>>
>> ------------------------------
>> *From:* holden.ka...@gmail.com <holden.ka...@gmail.com> on behalf of
>> Holden Karau <hol...@pigscanfly.ca>
>> *Sent:* Thursday, May 10, 2018 9:39:26 AM
>> *To:* Joseph Bradley
>> *Cc:* dev
>> *Subject:* Re: Revisiting Online serving of Spark models?
>>
>>
>>
>> On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <jos...@databricks.com>
>> wrote:
>>
>>> Thanks for bringing this up Holden!  I'm a strong supporter of this.
>>>
>>> Awesome! I'm glad other folks think something like this belongs in Spark.
>>
>>> This was one of the original goals for mllib-local: to have local
>>> versions of MLlib models which could be deployed without the big Spark JARs
>>> and without a SparkContext or SparkSession.  There are related commercial
>>> offerings like this : ) but the overhead of maintaining those offerings is
>>> pretty high.  Building good APIs within MLlib to avoid copying logic across
>>> libraries will be well worth it.
>>>
>>> We've talked about this need at Databricks and have also been syncing
>>> with the creators of MLeap.  It'd be great to get this functionality into
>>> Spark itself.  Some thoughts:
>>> * It'd be valuable to have this go beyond adding transform() methods
>>> taking a Row to the current Models.  Instead, it would be ideal to have
>>> local, lightweight versions of models in mllib-local, outside of the main
>>> mllib package (for easier deployment with smaller & fewer dependencies).
>>> * Supporting Pipelines is important.  For this, it would be ideal to
>>> utilize elements of Spark SQL, particularly Rows and Types, which could be
>>> moved into a local sql package.
>>> * This architecture may require some awkward APIs currently to have
>>> model prediction logic in mllib-local, local model classes in mllib-local,
>>> and regular (DataFrame-friendly) model classes in mllib.  We might find it
>>> helpful to break some DeveloperApis in Spark 3.0 to facilitate this
>>> architecture while making it feasible for 3rd party developers to extend
>>> MLlib APIs (especially in Java).
>>>
>> I agree this could be interesting, and feed into the other discussion
>> around when (or if) we should be considering Spark 3.0
>> I _think_ we could probably do it with optional traits people could mix
>> in to avoid breaking the current APIs but I could be wrong on that point.
>>
>>> * It could also be worth discussing local DataFrames.  They might not be
>>> as important as per-Row transformations, but they would be helpful for
>>> batching for higher throughput.
>>>
>> That could be interesting as well.
>>
>>>
>>> I'll be interested to hear others' thoughts too!
>>>
>>> Joseph
>>>
>>> On Wed, May 9, 2018 at 7:18 AM, Holden Karau <hol...@pigscanfly.ca>
>>> wrote:
>>>
>>>> Hi y'all,
>>>>
>>>> With the renewed interest in ML in Apache Spark now seems like a good a
>>>> time as any to revisit the online serving situation in Spark ML. DB &
>>>> other's have done some excellent working moving a lot of the necessary
>>>> tools into a local linear algebra package that doesn't depend on having a
>>>> SparkContext.
>>>>
>>>> There are a few different commercial and non-commercial solutions round
>>>> this, but currently our individual transform/predict methods are private so
>>>> they either need to copy or re-implement (or put them selves in
>>>> org.apache.spark) to access them. How would folks feel about adding a new
>>>> trait for ML pipeline stages to expose to do transformation of single
>>>> element inputs (or local collections) that could be optionally implemented
>>>> by stages which support this? That way we can have less copy and paste code
>>>> possibly getting out of sync with our model training.
>>>>
>>>> I think continuing to have on-line serving grow in different projects
>>>> is probably the right path, forward (folks have different needs), but I'd
>>>> love to see us make it simpler for other projects to build reliable serving
>>>> tools.
>>>>
>>>> I realize this maybe puts some of the folks in an awkward position with
>>>> their own commercial offerings, but hopefully if we make it easier for
>>>> everyone the commercial vendors can benefit as well.
>>>>
>>>> Cheers,
>>>>
>>>> Holden :)
>>>>
>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> Joseph Bradley
>>>
>>> Software Engineer - Machine Learning
>>>
>>> Databricks, Inc.
>>>
>>> [image: http://databricks.com] <http://databricks.com/>
>>>
>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>>
>>
>
>
> --
>
> Joseph Bradley
>
> Software Engineer - Machine Learning
>
> Databricks, Inc.
>
> [image: http://databricks.com] <http://databricks.com/>
>
-- 
Twitter: https://twitter.com/holdenkarau

Re: Revisiting Online serving of Spark models?

Reply via email to