Re: Revisiting Online serving of Spark models?

Holden Karau Wed, 06 Jun 2018 16:03:59 -0700

We’re by the registration sign going to start walking over at 4:05

On Wed, Jun 6, 2018 at 2:43 PM Maximiliano Felice <
maximilianofel...@gmail.com> wrote:


> Hi!
>
> Do we meet at the entrance?
>
> See you
>
>
> El mar., 5 de jun. de 2018 3:07 PM, Nick Pentreath <
> nick.pentre...@gmail.com> escribió:
>
>> I will aim to join up at 4pm tomorrow (Wed) too. Look forward to it.
>>
>> On Sun, 3 Jun 2018 at 00:24 Holden Karau <hol...@pigscanfly.ca> wrote:
>>
>>> On Sat, Jun 2, 2018 at 8:39 PM, Maximiliano Felice <
>>> maximilianofel...@gmail.com> wrote:
>>>
>>>> Hi!
>>>>
>>>> We're already in San Francisco waiting for the summit. We even think
>>>> that we spotted @holdenk this afternoon.
>>>>
>>> Unless you happened to be walking by my garage probably not super
>>> likely, spent the day working on scooters/motorcycles (my style is a little
>>> less unique in SF :)). Also if you see me feel free to say hi unless I look
>>> like I haven't had my first coffee of the day, love chatting with folks IRL
>>> :)
>>>
>>>>
>>>> @chris, we're really interested in the Meetup you're hosting. My team
>>>> will probably join it since the beginning of you have room for us, and I'll
>>>> join it later after discussing the topics on this thread. I'll send you an
>>>> email regarding this request.
>>>>
>>>> Thanks
>>>>
>>>> El vie., 1 de jun. de 2018 7:26 AM, Saikat Kanjilal <
>>>> sxk1...@hotmail.com> escribió:
>>>>
>>>>> @Chris This sounds fantastic, please send summary notes for Seattle
>>>>> folks
>>>>>
>>>>> @Felix I work in downtown Seattle, am wondering if we should a tech
>>>>> meetup around model serving in spark at my work or elsewhere close,
>>>>> thoughts?  I’m actually in the midst of building microservices to manage
>>>>> models and when I say models I mean much more than machine learning models
>>>>> (think OR, process models as well)
>>>>>
>>>>> Regards
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>> On May 31, 2018, at 10:32 PM, Chris Fregly <ch...@fregly.com> wrote:
>>>>>
>>>>> Hey everyone!
>>>>>
>>>>> @Felix:  thanks for putting this together.  i sent some of you a quick
>>>>> calendar event - mostly for me, so i don’t forget!  :)
>>>>>
>>>>> Coincidentally, this is the focus of June 6th's *Advanced Spark and
>>>>> TensorFlow Meetup*
>>>>> <https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/>
>>>>>  @5:30pm
>>>>> on June 6th (same night) here in SF!
>>>>>
>>>>> Everybody is welcome to come.  Here’s the link to the meetup that
>>>>> includes the signup link:
>>>>> *https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/*
>>>>> <https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/>
>>>>>
>>>>> We have an awesome lineup of speakers covered a lot of deep, technical
>>>>> ground.
>>>>>
>>>>> For those who can’t attend in person, we’ll be broadcasting live - and
>>>>> posting the recording afterward.
>>>>>
>>>>> All details are in the meetup link above…
>>>>>
>>>>> @holden/felix/nick/joseph/maximiliano/saikat/leif:  you’re more than
>>>>> welcome to give a talk. I can move things around to make room.
>>>>>
>>>>> @joseph:  I’d personally like an update on the direction of the
>>>>> Databricks proprietary ML Serving export format which is similar to PMML
>>>>> but not a standard in any way.
>>>>>
>>>>> Also, the Databricks ML Serving Runtime is only available to
>>>>> Databricks customers.  This seems in conflict with the community efforts
>>>>> described here.  Can you comment on behalf of Databricks?
>>>>>
>>>>> Look forward to your response, joseph.
>>>>>
>>>>> See you all soon!
>>>>>
>>>>> —
>>>>>
>>>>>
>>>>> *Chris Fregly *Founder @ *PipelineAI* <https://pipeline.ai/> (100,000
>>>>> Users)
>>>>> Organizer @ *Advanced Spark and TensorFlow Meetup*
>>>>> <https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/> (85,000
>>>>> Global Members)
>>>>>
>>>>>
>>>>>
>>>>> *San Francisco - Chicago - Austin -
>>>>> Washington DC - London - Dusseldorf *
>>>>> *Try our PipelineAI Community Edition with GPUs and TPUs!!
>>>>> <http://community.pipeline.ai/>*
>>>>>
>>>>>
>>>>> On May 30, 2018, at 9:32 AM, Felix Cheung <felixcheun...@hotmail.com>
>>>>> wrote:
>>>>>
>>>>> Hi!
>>>>>
>>>>> Thank you! Let’s meet then
>>>>>
>>>>> June 6 4pm
>>>>>
>>>>> Moscone West Convention Center
>>>>> 800 Howard Street, San Francisco, CA 94103
>>>>> <https://maps.google.com/?q=800+Howard+Street,+San+Francisco,+CA+94103&entry=gmail&source=g>
>>>>>
>>>>> Ground floor (outside of conference area - should be available for
>>>>> all) - we will meet and decide where to go
>>>>>
>>>>> (Would not send invite because that would be too much noise for dev@)
>>>>>
>>>>> To paraphrase Joseph, we will use this to kick off the discusssion and
>>>>> post notes after and follow up online. As for Seattle, I would be very
>>>>> interested to meet in person lateen and discuss ;)
>>>>>
>>>>>
>>>>> _____________________________
>>>>> From: Saikat Kanjilal <sxk1...@hotmail.com>
>>>>> Sent: Tuesday, May 29, 2018 11:46 AM
>>>>> Subject: Re: Revisiting Online serving of Spark models?
>>>>> To: Maximiliano Felice <maximilianofel...@gmail.com>
>>>>> Cc: Felix Cheung <felixcheun...@hotmail.com>, Holden Karau <
>>>>> hol...@pigscanfly.ca>, Joseph Bradley <jos...@databricks.com>, Leif
>>>>> Walsh <leif.wa...@gmail.com>, dev <dev@spark.apache.org>
>>>>>
>>>>>
>>>>> Would love to join but am in Seattle, thoughts on how to make this
>>>>> work?
>>>>>
>>>>> Regards
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>> On May 29, 2018, at 10:35 AM, Maximiliano Felice <
>>>>> maximilianofel...@gmail.com> wrote:
>>>>>
>>>>> Big +1 to a meeting with fresh air.
>>>>>
>>>>> Could anyone send the invites? I don't really know which is the place
>>>>> Holden is talking about.
>>>>>
>>>>> 2018-05-29 14:27 GMT-03:00 Felix Cheung <felixcheun...@hotmail.com>:
>>>>>
>>>>>> You had me at blue bottle!
>>>>>>
>>>>>> _____________________________
>>>>>> From: Holden Karau <hol...@pigscanfly.ca>
>>>>>> Sent: Tuesday, May 29, 2018 9:47 AM
>>>>>> Subject: Re: Revisiting Online serving of Spark models?
>>>>>> To: Felix Cheung <felixcheun...@hotmail.com>
>>>>>> Cc: Saikat Kanjilal <sxk1...@hotmail.com>, Maximiliano Felice <
>>>>>> maximilianofel...@gmail.com>, Joseph Bradley <jos...@databricks.com>,
>>>>>> Leif Walsh <leif.wa...@gmail.com>, dev <dev@spark.apache.org>
>>>>>>
>>>>>>
>>>>>>
>>>>>> I'm down for that, we could all go for a walk maybe to the mint
>>>>>> plazaa blue bottle and grab coffee (if the weather holds have our design
>>>>>> meeting outside :p)?
>>>>>>
>>>>>> On Tue, May 29, 2018 at 9:37 AM, Felix Cheung <
>>>>>> felixcheun...@hotmail.com> wrote:
>>>>>>
>>>>>>> Bump.
>>>>>>>
>>>>>>> ------------------------------
>>>>>>> *From:* Felix Cheung <felixcheun...@hotmail.com>
>>>>>>> *Sent:* Saturday, May 26, 2018 1:05:29 PM
>>>>>>> *To:* Saikat Kanjilal; Maximiliano Felice; Joseph Bradley
>>>>>>> *Cc:* Leif Walsh; Holden Karau; dev
>>>>>>>
>>>>>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>>>>>
>>>>>>> Hi! How about we meet the community and discuss on June 6 4pm at
>>>>>>> (near) the Summit?
>>>>>>>
>>>>>>> (I propose we meet at the venue entrance so we could accommodate
>>>>>>> people might not be in the conference)
>>>>>>>
>>>>>>> ------------------------------
>>>>>>> *From:* Saikat Kanjilal <sxk1...@hotmail.com>
>>>>>>> *Sent:* Tuesday, May 22, 2018 7:47:07 AM
>>>>>>> *To:* Maximiliano Felice
>>>>>>> *Cc:* Leif Walsh; Felix Cheung; Holden Karau; Joseph Bradley; dev
>>>>>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>>>>>
>>>>>>> I’m in the same exact boat as Maximiliano and have use cases as well
>>>>>>> for model serving and would love to join this discussion.
>>>>>>>
>>>>>>> Sent from my iPhone
>>>>>>>
>>>>>>> On May 22, 2018, at 6:39 AM, Maximiliano Felice <
>>>>>>> maximilianofel...@gmail.com> wrote:
>>>>>>>
>>>>>>> Hi!
>>>>>>>
>>>>>>> I'm don't usually write a lot on this list but I keep up to date
>>>>>>> with the discussions and I'm a heavy user of Spark. This topic caught my
>>>>>>> attention, as we're currently facing this issue at work. I'm attending 
>>>>>>> to
>>>>>>> the summit and was wondering if it would it be possible for me to join 
>>>>>>> that
>>>>>>> meeting. I might be able to share some helpful usecases and ideas.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Maximiliano Felice
>>>>>>>
>>>>>>> El mar., 22 de may. de 2018 9:14 AM, Leif Walsh <
>>>>>>> leif.wa...@gmail.com> escribió:
>>>>>>>
>>>>>>>> I’m with you on json being more readable than parquet, but we’ve
>>>>>>>> had success using pyarrow’s parquet reader and have been quite happy 
>>>>>>>> with
>>>>>>>> it so far. If your target is python (and probably if not now, then 
>>>>>>>> soon,
>>>>>>>> R), you should look in to it.
>>>>>>>>
>>>>>>>> On Mon, May 21, 2018 at 16:52 Joseph Bradley <jos...@databricks.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Regarding model reading and writing, I'll give quick thoughts
>>>>>>>>> here:
>>>>>>>>> * Our approach was to use the same format but write JSON instead
>>>>>>>>> of Parquet.  It's easier to parse JSON without Spark, and using the 
>>>>>>>>> same
>>>>>>>>> format simplifies architecture.  Plus, some people want to check 
>>>>>>>>> files into
>>>>>>>>> version control, and JSON is nice for that.
>>>>>>>>> * The reader/writer APIs could be extended to take format
>>>>>>>>> parameters (just like DataFrame reader/writers) to handle JSON (and 
>>>>>>>>> maybe,
>>>>>>>>> eventually, handle Parquet in the online serving setting).
>>>>>>>>>
>>>>>>>>> This would be a big project, so proposing a SPIP might be best.
>>>>>>>>> If people are around at the Spark Summit, that could be a good time 
>>>>>>>>> to meet
>>>>>>>>> up & then post notes back to the dev list.
>>>>>>>>>
>>>>>>>>> On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <
>>>>>>>>> felixcheun...@hotmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Specifically I’d like bring part of the discussion to Model and
>>>>>>>>>> PipelineModel, and various ModelReader and SharedReadWrite 
>>>>>>>>>> implementations
>>>>>>>>>> that rely on SparkContext. This is a big blocker on reusing  trained 
>>>>>>>>>> models
>>>>>>>>>> outside of Spark for online serving.
>>>>>>>>>>
>>>>>>>>>> What’s the next step? Would folks be interested in getting
>>>>>>>>>> together to discuss/get some feedback?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _____________________________
>>>>>>>>>> From: Felix Cheung <felixcheun...@hotmail.com>
>>>>>>>>>> Sent: Thursday, May 10, 2018 10:10 AM
>>>>>>>>>> Subject: Re: Revisiting Online serving of Spark models?
>>>>>>>>>> To: Holden Karau <hol...@pigscanfly.ca>, Joseph Bradley <
>>>>>>>>>> jos...@databricks.com>
>>>>>>>>>> Cc: dev <dev@spark.apache.org>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Huge +1 on this!
>>>>>>>>>>
>>>>>>>>>> ------------------------------
>>>>>>>>>> *From:*holden.ka...@gmail.com <holden.ka...@gmail.com> on behalf
>>>>>>>>>> of Holden Karau <hol...@pigscanfly.ca>
>>>>>>>>>> *Sent:* Thursday, May 10, 2018 9:39:26 AM
>>>>>>>>>> *To:* Joseph Bradley
>>>>>>>>>> *Cc:* dev
>>>>>>>>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <
>>>>>>>>>> jos...@databricks.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks for bringing this up Holden!  I'm a strong supporter of
>>>>>>>>>>> this.
>>>>>>>>>>>
>>>>>>>>>>> Awesome! I'm glad other folks think something like this belongs
>>>>>>>>>> in Spark.
>>>>>>>>>>
>>>>>>>>>>> This was one of the original goals for mllib-local: to have
>>>>>>>>>>> local versions of MLlib models which could be deployed without the 
>>>>>>>>>>> big
>>>>>>>>>>> Spark JARs and without a SparkContext or SparkSession.  There are 
>>>>>>>>>>> related
>>>>>>>>>>> commercial offerings like this : ) but the overhead of maintaining 
>>>>>>>>>>> those
>>>>>>>>>>> offerings is pretty high.  Building good APIs within MLlib to avoid 
>>>>>>>>>>> copying
>>>>>>>>>>> logic across libraries will be well worth it.
>>>>>>>>>>>
>>>>>>>>>>> We've talked about this need at Databricks and have also been
>>>>>>>>>>> syncing with the creators of MLeap.  It'd be great to get this
>>>>>>>>>>> functionality into Spark itself.  Some thoughts:
>>>>>>>>>>> * It'd be valuable to have this go beyond adding transform()
>>>>>>>>>>> methods taking a Row to the current Models.  Instead, it would be 
>>>>>>>>>>> ideal to
>>>>>>>>>>> have local, lightweight versions of models in mllib-local, outside 
>>>>>>>>>>> of the
>>>>>>>>>>> main mllib package (for easier deployment with smaller & fewer
>>>>>>>>>>> dependencies).
>>>>>>>>>>> * Supporting Pipelines is important.  For this, it would be
>>>>>>>>>>> ideal to utilize elements of Spark SQL, particularly Rows and 
>>>>>>>>>>> Types, which
>>>>>>>>>>> could be moved into a local sql package.
>>>>>>>>>>> * This architecture may require some awkward APIs currently to
>>>>>>>>>>> have model prediction logic in mllib-local, local model classes in
>>>>>>>>>>> mllib-local, and regular (DataFrame-friendly) model classes in 
>>>>>>>>>>> mllib.  We
>>>>>>>>>>> might find it helpful to break some DeveloperApis in Spark 3.0 to
>>>>>>>>>>> facilitate this architecture while making it feasible for 3rd party
>>>>>>>>>>> developers to extend MLlib APIs (especially in Java).
>>>>>>>>>>>
>>>>>>>>>> I agree this could be interesting, and feed into the other
>>>>>>>>>> discussion around when (or if) we should be considering Spark 3.0
>>>>>>>>>> I _think_ we could probably do it with optional traits people
>>>>>>>>>> could mix in to avoid breaking the current APIs but I could be wrong 
>>>>>>>>>> on
>>>>>>>>>> that point.
>>>>>>>>>>
>>>>>>>>>>> * It could also be worth discussing local DataFrames.  They
>>>>>>>>>>> might not be as important as per-Row transformations, but they 
>>>>>>>>>>> would be
>>>>>>>>>>> helpful for batching for higher throughput.
>>>>>>>>>>>
>>>>>>>>>> That could be interesting as well.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I'll be interested to hear others' thoughts too!
>>>>>>>>>>>
>>>>>>>>>>> Joseph
>>>>>>>>>>>
>>>>>>>>>>> On Wed, May 9, 2018 at 7:18 AM, Holden Karau <
>>>>>>>>>>> hol...@pigscanfly.ca> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi y'all,
>>>>>>>>>>>>
>>>>>>>>>>>> With the renewed interest in ML in Apache Spark now seems like
>>>>>>>>>>>> a good a time as any to revisit the online serving situation in 
>>>>>>>>>>>> Spark ML.
>>>>>>>>>>>> DB & other's have done some excellent working moving a lot of the 
>>>>>>>>>>>> necessary
>>>>>>>>>>>> tools into a local linear algebra package that doesn't depend on 
>>>>>>>>>>>> having a
>>>>>>>>>>>> SparkContext.
>>>>>>>>>>>>
>>>>>>>>>>>> There are a few different commercial and non-commercial
>>>>>>>>>>>> solutions round this, but currently our individual 
>>>>>>>>>>>> transform/predict
>>>>>>>>>>>> methods are private so they either need to copy or re-implement 
>>>>>>>>>>>> (or put
>>>>>>>>>>>> them selves in org.apache.spark) to access them. How would folks 
>>>>>>>>>>>> feel about
>>>>>>>>>>>> adding a new trait for ML pipeline stages to expose to do 
>>>>>>>>>>>> transformation of
>>>>>>>>>>>> single element inputs (or local collections) that could be 
>>>>>>>>>>>> optionally
>>>>>>>>>>>> implemented by stages which support this? That way we can have 
>>>>>>>>>>>> less copy
>>>>>>>>>>>> and paste code possibly getting out of sync with our model 
>>>>>>>>>>>> training.
>>>>>>>>>>>>
>>>>>>>>>>>> I think continuing to have on-line serving grow in different
>>>>>>>>>>>> projects is probably the right path, forward (folks have different 
>>>>>>>>>>>> needs),
>>>>>>>>>>>> but I'd love to see us make it simpler for other projects to build 
>>>>>>>>>>>> reliable
>>>>>>>>>>>> serving tools.
>>>>>>>>>>>>
>>>>>>>>>>>> I realize this maybe puts some of the folks in an awkward
>>>>>>>>>>>> position with their own commercial offerings, but hopefully if we 
>>>>>>>>>>>> make it
>>>>>>>>>>>> easier for everyone the commercial vendors can benefit as well.
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>
>>>>>>>>>>>> Holden :)
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Joseph Bradley
>>>>>>>>>>> Software Engineer - Machine Learning
>>>>>>>>>>> Databricks, Inc.
>>>>>>>>>>> [image: http://databricks.com] <http://databricks.com/>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Joseph Bradley
>>>>>>>>> Software Engineer - Machine Learning
>>>>>>>>> Databricks, Inc.
>>>>>>>>> [image: http://databricks.com] <http://databricks.com/>
>>>>>>>>>
>>>>>>>> --
>>>>>>>> --
>>>>>>>> Cheers,
>>>>>>>> Leif
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>> --
Twitter: https://twitter.com/holdenkarau

Re: Revisiting Online serving of Spark models?

Reply via email to