Re: Revisiting Online serving of Spark models?

Holden Karau Sun, 03 Jun 2018 00:25:22 -0700

On Sat, Jun 2, 2018 at 8:39 PM, Maximiliano Felice <
[email protected]> wrote:


> Hi!
>
> We're already in San Francisco waiting for the summit. We even think that
> we spotted @holdenk this afternoon.
>
Unless you happened to be walking by my garage probably not super likely,
spent the day working on scooters/motorcycles (my style is a little less
unique in SF :)). Also if you see me feel free to say hi unless I look like
I haven't had my first coffee of the day, love chatting with folks IRL :)

>
> @chris, we're really interested in the Meetup you're hosting. My team will
> probably join it since the beginning of you have room for us, and I'll join
> it later after discussing the topics on this thread. I'll send you an email
> regarding this request.
>
> Thanks
>
> El vie., 1 de jun. de 2018 7:26 AM, Saikat Kanjilal <[email protected]>
> escribió:
>
>> @Chris This sounds fantastic, please send summary notes for Seattle folks
>>
>> @Felix I work in downtown Seattle, am wondering if we should a tech
>> meetup around model serving in spark at my work or elsewhere close,
>> thoughts?  I’m actually in the midst of building microservices to manage
>> models and when I say models I mean much more than machine learning models
>> (think OR, process models as well)
>>
>> Regards
>>
>> Sent from my iPhone
>>
>> On May 31, 2018, at 10:32 PM, Chris Fregly <[email protected]> wrote:
>>
>> Hey everyone!
>>
>> @Felix:  thanks for putting this together.  i sent some of you a quick
>> calendar event - mostly for me, so i don’t forget!  :)
>>
>> Coincidentally, this is the focus of June 6th's *Advanced Spark and
>> TensorFlow Meetup*
>> <https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/>
>>  @5:30pm
>> on June 6th (same night) here in SF!
>>
>> Everybody is welcome to come.  Here’s the link to the meetup that
>> includes the signup link:
>> *https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/*
>> <https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/>
>>
>> We have an awesome lineup of speakers covered a lot of deep, technical
>> ground.
>>
>> For those who can’t attend in person, we’ll be broadcasting live - and
>> posting the recording afterward.
>>
>> All details are in the meetup link above…
>>
>> @holden/felix/nick/joseph/maximiliano/saikat/leif:  you’re more than
>> welcome to give a talk. I can move things around to make room.
>>
>> @joseph:  I’d personally like an update on the direction of the
>> Databricks proprietary ML Serving export format which is similar to PMML
>> but not a standard in any way.
>>
>> Also, the Databricks ML Serving Runtime is only available to Databricks
>> customers.  This seems in conflict with the community efforts described
>> here.  Can you comment on behalf of Databricks?
>>
>> Look forward to your response, joseph.
>>
>> See you all soon!
>>
>> —
>>
>>
>> *Chris Fregly *Founder @ *PipelineAI* <https://pipeline.ai/> (100,000
>> Users)
>> Organizer @ *Advanced Spark and TensorFlow Meetup*
>> <https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/> (85,000
>> Global Members)
>>
>>
>>
>> *San Francisco - Chicago - Austin -  Washington DC - London - Dusseldorf *
>> *Try our PipelineAI Community Edition with GPUs and TPUs!!
>> <http://community.pipeline.ai/>*
>>
>>
>> On May 30, 2018, at 9:32 AM, Felix Cheung <[email protected]>
>> wrote:
>>
>> Hi!
>>
>> Thank you! Let’s meet then
>>
>> June 6 4pm
>>
>> Moscone West Convention Center
>> 800 Howard Street, San Francisco, CA 94103
>> <https://maps.google.com/?q=800+Howard+Street,+San+Francisco,+CA+94103&entry=gmail&source=g>
>>
>> Ground floor (outside of conference area - should be available for all) -
>> we will meet and decide where to go
>>
>> (Would not send invite because that would be too much noise for dev@)
>>
>> To paraphrase Joseph, we will use this to kick off the discusssion and
>> post notes after and follow up online. As for Seattle, I would be very
>> interested to meet in person lateen and discuss ;)
>>
>>
>> _____________________________
>> From: Saikat Kanjilal <[email protected]>
>> Sent: Tuesday, May 29, 2018 11:46 AM
>> Subject: Re: Revisiting Online serving of Spark models?
>> To: Maximiliano Felice <[email protected]>
>> Cc: Felix Cheung <[email protected]>, Holden Karau <
>> [email protected]>, Joseph Bradley <[email protected]>, Leif
>> Walsh <[email protected]>, dev <[email protected]>
>>
>>
>> Would love to join but am in Seattle, thoughts on how to make this work?
>>
>> Regards
>>
>> Sent from my iPhone
>>
>> On May 29, 2018, at 10:35 AM, Maximiliano Felice <
>> [email protected]> wrote:
>>
>> Big +1 to a meeting with fresh air.
>>
>> Could anyone send the invites? I don't really know which is the place
>> Holden is talking about.
>>
>> 2018-05-29 14:27 GMT-03:00 Felix Cheung <[email protected]>:
>>
>>> You had me at blue bottle!
>>>
>>> _____________________________
>>> From: Holden Karau <[email protected]>
>>> Sent: Tuesday, May 29, 2018 9:47 AM
>>> Subject: Re: Revisiting Online serving of Spark models?
>>> To: Felix Cheung <[email protected]>
>>> Cc: Saikat Kanjilal <[email protected]>, Maximiliano Felice <
>>> [email protected]>, Joseph Bradley <[email protected]>,
>>> Leif Walsh <[email protected]>, dev <[email protected]>
>>>
>>>
>>>
>>> I'm down for that, we could all go for a walk maybe to the mint plazaa
>>> blue bottle and grab coffee (if the weather holds have our design meeting
>>> outside :p)?
>>>
>>> On Tue, May 29, 2018 at 9:37 AM, Felix Cheung <[email protected]
>>> > wrote:
>>>
>>>> Bump.
>>>>
>>>> ------------------------------
>>>> *From:* Felix Cheung <[email protected]>
>>>> *Sent:* Saturday, May 26, 2018 1:05:29 PM
>>>> *To:* Saikat Kanjilal; Maximiliano Felice; Joseph Bradley
>>>> *Cc:* Leif Walsh; Holden Karau; dev
>>>>
>>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>>
>>>> Hi! How about we meet the community and discuss on June 6 4pm at (near)
>>>> the Summit?
>>>>
>>>> (I propose we meet at the venue entrance so we could accommodate people
>>>> might not be in the conference)
>>>>
>>>> ------------------------------
>>>> *From:* Saikat Kanjilal <[email protected]>
>>>> *Sent:* Tuesday, May 22, 2018 7:47:07 AM
>>>> *To:* Maximiliano Felice
>>>> *Cc:* Leif Walsh; Felix Cheung; Holden Karau; Joseph Bradley; dev
>>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>>
>>>> I’m in the same exact boat as Maximiliano and have use cases as well
>>>> for model serving and would love to join this discussion.
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On May 22, 2018, at 6:39 AM, Maximiliano Felice <
>>>> [email protected]> wrote:
>>>>
>>>> Hi!
>>>>
>>>> I'm don't usually write a lot on this list but I keep up to date with
>>>> the discussions and I'm a heavy user of Spark. This topic caught my
>>>> attention, as we're currently facing this issue at work. I'm attending to
>>>> the summit and was wondering if it would it be possible for me to join that
>>>> meeting. I might be able to share some helpful usecases and ideas.
>>>>
>>>> Thanks,
>>>> Maximiliano Felice
>>>>
>>>> El mar., 22 de may. de 2018 9:14 AM, Leif Walsh <[email protected]>
>>>> escribió:
>>>>
>>>>> I’m with you on json being more readable than parquet, but we’ve had
>>>>> success using pyarrow’s parquet reader and have been quite happy with it 
>>>>> so
>>>>> far. If your target is python (and probably if not now, then soon, R), you
>>>>> should look in to it.
>>>>>
>>>>> On Mon, May 21, 2018 at 16:52 Joseph Bradley <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Regarding model reading and writing, I'll give quick thoughts here:
>>>>>> * Our approach was to use the same format but write JSON instead of
>>>>>> Parquet.  It's easier to parse JSON without Spark, and using the same
>>>>>> format simplifies architecture.  Plus, some people want to check files 
>>>>>> into
>>>>>> version control, and JSON is nice for that.
>>>>>> * The reader/writer APIs could be extended to take format parameters
>>>>>> (just like DataFrame reader/writers) to handle JSON (and maybe, 
>>>>>> eventually,
>>>>>> handle Parquet in the online serving setting).
>>>>>>
>>>>>> This would be a big project, so proposing a SPIP might be best.  If
>>>>>> people are around at the Spark Summit, that could be a good time to meet 
>>>>>> up
>>>>>> & then post notes back to the dev list.
>>>>>>
>>>>>> On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Specifically I’d like bring part of the discussion to Model and
>>>>>>> PipelineModel, and various ModelReader and SharedReadWrite 
>>>>>>> implementations
>>>>>>> that rely on SparkContext. This is a big blocker on reusing  trained 
>>>>>>> models
>>>>>>> outside of Spark for online serving.
>>>>>>>
>>>>>>> What’s the next step? Would folks be interested in getting together
>>>>>>> to discuss/get some feedback?
>>>>>>>
>>>>>>>
>>>>>>> _____________________________
>>>>>>> From: Felix Cheung <[email protected]>
>>>>>>> Sent: Thursday, May 10, 2018 10:10 AM
>>>>>>> Subject: Re: Revisiting Online serving of Spark models?
>>>>>>> To: Holden Karau <[email protected]>, Joseph Bradley <
>>>>>>> [email protected]>
>>>>>>> Cc: dev <[email protected]>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Huge +1 on this!
>>>>>>>
>>>>>>> ------------------------------
>>>>>>> *From:*[email protected] <[email protected]> on behalf of
>>>>>>> Holden Karau <[email protected]>
>>>>>>> *Sent:* Thursday, May 10, 2018 9:39:26 AM
>>>>>>> *To:* Joseph Bradley
>>>>>>> *Cc:* dev
>>>>>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Thanks for bringing this up Holden!  I'm a strong supporter of this.
>>>>>>>>
>>>>>>>> Awesome! I'm glad other folks think something like this belongs in
>>>>>>> Spark.
>>>>>>>
>>>>>>>> This was one of the original goals for mllib-local: to have local
>>>>>>>> versions of MLlib models which could be deployed without the big Spark 
>>>>>>>> JARs
>>>>>>>> and without a SparkContext or SparkSession.  There are related 
>>>>>>>> commercial
>>>>>>>> offerings like this : ) but the overhead of maintaining those 
>>>>>>>> offerings is
>>>>>>>> pretty high.  Building good APIs within MLlib to avoid copying logic 
>>>>>>>> across
>>>>>>>> libraries will be well worth it.
>>>>>>>>
>>>>>>>> We've talked about this need at Databricks and have also been
>>>>>>>> syncing with the creators of MLeap.  It'd be great to get this
>>>>>>>> functionality into Spark itself.  Some thoughts:
>>>>>>>> * It'd be valuable to have this go beyond adding transform()
>>>>>>>> methods taking a Row to the current Models.  Instead, it would be 
>>>>>>>> ideal to
>>>>>>>> have local, lightweight versions of models in mllib-local, outside of 
>>>>>>>> the
>>>>>>>> main mllib package (for easier deployment with smaller & fewer
>>>>>>>> dependencies).
>>>>>>>> * Supporting Pipelines is important.  For this, it would be ideal
>>>>>>>> to utilize elements of Spark SQL, particularly Rows and Types, which 
>>>>>>>> could
>>>>>>>> be moved into a local sql package.
>>>>>>>> * This architecture may require some awkward APIs currently to have
>>>>>>>> model prediction logic in mllib-local, local model classes in 
>>>>>>>> mllib-local,
>>>>>>>> and regular (DataFrame-friendly) model classes in mllib.  We might 
>>>>>>>> find it
>>>>>>>> helpful to break some DeveloperApis in Spark 3.0 to facilitate this
>>>>>>>> architecture while making it feasible for 3rd party developers to 
>>>>>>>> extend
>>>>>>>> MLlib APIs (especially in Java).
>>>>>>>>
>>>>>>> I agree this could be interesting, and feed into the other
>>>>>>> discussion around when (or if) we should be considering Spark 3.0
>>>>>>> I _think_ we could probably do it with optional traits people could
>>>>>>> mix in to avoid breaking the current APIs but I could be wrong on that
>>>>>>> point.
>>>>>>>
>>>>>>>> * It could also be worth discussing local DataFrames.  They might
>>>>>>>> not be as important as per-Row transformations, but they would be 
>>>>>>>> helpful
>>>>>>>> for batching for higher throughput.
>>>>>>>>
>>>>>>> That could be interesting as well.
>>>>>>>
>>>>>>>>
>>>>>>>> I'll be interested to hear others' thoughts too!
>>>>>>>>
>>>>>>>> Joseph
>>>>>>>>
>>>>>>>> On Wed, May 9, 2018 at 7:18 AM, Holden Karau <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi y'all,
>>>>>>>>>
>>>>>>>>> With the renewed interest in ML in Apache Spark now seems like a
>>>>>>>>> good a time as any to revisit the online serving situation in Spark 
>>>>>>>>> ML. DB
>>>>>>>>> & other's have done some excellent working moving a lot of the 
>>>>>>>>> necessary
>>>>>>>>> tools into a local linear algebra package that doesn't depend on 
>>>>>>>>> having a
>>>>>>>>> SparkContext.
>>>>>>>>>
>>>>>>>>> There are a few different commercial and non-commercial solutions
>>>>>>>>> round this, but currently our individual transform/predict methods are
>>>>>>>>> private so they either need to copy or re-implement (or put them 
>>>>>>>>> selves in
>>>>>>>>> org.apache.spark) to access them. How would folks feel about adding a 
>>>>>>>>> new
>>>>>>>>> trait for ML pipeline stages to expose to do transformation of single
>>>>>>>>> element inputs (or local collections) that could be optionally 
>>>>>>>>> implemented
>>>>>>>>> by stages which support this? That way we can have less copy and 
>>>>>>>>> paste code
>>>>>>>>> possibly getting out of sync with our model training.
>>>>>>>>>
>>>>>>>>> I think continuing to have on-line serving grow in different
>>>>>>>>> projects is probably the right path, forward (folks have different 
>>>>>>>>> needs),
>>>>>>>>> but I'd love to see us make it simpler for other projects to build 
>>>>>>>>> reliable
>>>>>>>>> serving tools.
>>>>>>>>>
>>>>>>>>> I realize this maybe puts some of the folks in an awkward position
>>>>>>>>> with their own commercial offerings, but hopefully if we make it 
>>>>>>>>> easier for
>>>>>>>>> everyone the commercial vendors can benefit as well.
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>>
>>>>>>>>> Holden :)
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Joseph Bradley
>>>>>>>> Software Engineer - Machine Learning
>>>>>>>> Databricks, Inc.
>>>>>>>> [image: http://databricks.com] <http://databricks.com/>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Joseph Bradley
>>>>>> Software Engineer - Machine Learning
>>>>>> Databricks, Inc.
>>>>>> [image: http://databricks.com] <http://databricks.com/>
>>>>>>
>>>>> --
>>>>> --
>>>>> Cheers,
>>>>> Leif
>>>>>
>>>>
>>>
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>>
>>>
>>
>>
>>
>>


-- 
Twitter: https://twitter.com/holdenkarau

Re: Revisiting Online serving of Spark models?

Reply via email to