So I kicked of a thread on user@ to collect people's feedback there but I'll summarize the offline results later this week too.
On Tue, Jun 12, 2018, 5:03 AM Liang-Chi Hsieh <vii...@gmail.com> wrote: > > Hi, > > It'd be great if there can be any sharing of the offline discussion. > Thanks! > > > > Holden Karau wrote > > We’re by the registration sign going to start walking over at 4:05 > > > > On Wed, Jun 6, 2018 at 2:43 PM Maximiliano Felice < > > > maximilianofelice@ > > >> wrote: > > > >> Hi! > >> > >> Do we meet at the entrance? > >> > >> See you > >> > >> > >> El mar., 5 de jun. de 2018 3:07 PM, Nick Pentreath < > >> > > > nick.pentreath@ > > >> escribió: > >> > >>> I will aim to join up at 4pm tomorrow (Wed) too. Look forward to it. > >>> > >>> On Sun, 3 Jun 2018 at 00:24 Holden Karau < > > > holden@ > > > > wrote: > >>> > >>>> On Sat, Jun 2, 2018 at 8:39 PM, Maximiliano Felice < > >>>> > > > maximilianofelice@ > > >> wrote: > >>>> > >>>>> Hi! > >>>>> > >>>>> We're already in San Francisco waiting for the summit. We even think > >>>>> that we spotted @holdenk this afternoon. > >>>>> > >>>> Unless you happened to be walking by my garage probably not super > >>>> likely, spent the day working on scooters/motorcycles (my style is a > >>>> little > >>>> less unique in SF :)). Also if you see me feel free to say hi unless I > >>>> look > >>>> like I haven't had my first coffee of the day, love chatting with > folks > >>>> IRL > >>>> :) > >>>> > >>>>> > >>>>> @chris, we're really interested in the Meetup you're hosting. My team > >>>>> will probably join it since the beginning of you have room for us, > and > >>>>> I'll > >>>>> join it later after discussing the topics on this thread. I'll send > >>>>> you an > >>>>> email regarding this request. > >>>>> > >>>>> Thanks > >>>>> > >>>>> El vie., 1 de jun. de 2018 7:26 AM, Saikat Kanjilal < > >>>>> > > > sxk1969@ > > >> escribió: > >>>>> > >>>>>> @Chris This sounds fantastic, please send summary notes for Seattle > >>>>>> folks > >>>>>> > >>>>>> @Felix I work in downtown Seattle, am wondering if we should a tech > >>>>>> meetup around model serving in spark at my work or elsewhere close, > >>>>>> thoughts? I’m actually in the midst of building microservices to > >>>>>> manage > >>>>>> models and when I say models I mean much more than machine learning > >>>>>> models > >>>>>> (think OR, process models as well) > >>>>>> > >>>>>> Regards > >>>>>> > >>>>>> Sent from my iPhone > >>>>>> > >>>>>> On May 31, 2018, at 10:32 PM, Chris Fregly < > > > chris@ > > > > wrote: > >>>>>> > >>>>>> Hey everyone! > >>>>>> > >>>>>> @Felix: thanks for putting this together. i sent some of you a > >>>>>> quick > >>>>>> calendar event - mostly for me, so i don’t forget! :) > >>>>>> > >>>>>> Coincidentally, this is the focus of June 6th's *Advanced Spark and > >>>>>> TensorFlow Meetup* > >>>>>> < > https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/> > ; > >>>>>> @5:30pm > >>>>>> on June 6th (same night) here in SF! > >>>>>> > >>>>>> Everybody is welcome to come. Here’s the link to the meetup that > >>>>>> includes the signup link: > >>>>>> * > https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/* > >>>>>> < > https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/> > ; > >>>>>> > >>>>>> We have an awesome lineup of speakers covered a lot of deep, > >>>>>> technical > >>>>>> ground. > >>>>>> > >>>>>> For those who can’t attend in person, we’ll be broadcasting live - > >>>>>> and > >>>>>> posting the recording afterward. > >>>>>> > >>>>>> All details are in the meetup link above… > >>>>>> > >>>>>> @holden/felix/nick/joseph/maximiliano/saikat/leif: you’re more than > >>>>>> welcome to give a talk. I can move things around to make room. > >>>>>> > >>>>>> @joseph: I’d personally like an update on the direction of the > >>>>>> Databricks proprietary ML Serving export format which is similar to > >>>>>> PMML > >>>>>> but not a standard in any way. > >>>>>> > >>>>>> Also, the Databricks ML Serving Runtime is only available to > >>>>>> Databricks customers. This seems in conflict with the community > >>>>>> efforts > >>>>>> described here. Can you comment on behalf of Databricks? > >>>>>> > >>>>>> Look forward to your response, joseph. > >>>>>> > >>>>>> See you all soon! > >>>>>> > >>>>>> — > >>>>>> > >>>>>> > >>>>>> *Chris Fregly *Founder @ *PipelineAI* <https://pipeline.ai/> > >>>>>> (100,000 > >>>>>> Users) > >>>>>> Organizer @ *Advanced Spark and TensorFlow Meetup* > >>>>>> <https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/> > ; > >>>>>> (85,000 > >>>>>> Global Members) > >>>>>> > >>>>>> > >>>>>> > >>>>>> *San Francisco - Chicago - Austin - > >>>>>> Washington DC - London - Dusseldorf * > >>>>>> *Try our PipelineAI Community Edition with GPUs and TPUs!! > >>>>>> <http://community.pipeline.ai/>* > >>>>>> > >>>>>> > >>>>>> On May 30, 2018, at 9:32 AM, Felix Cheung < > > > felixcheung_m@ > > > > > >>>>>> wrote: > >>>>>> > >>>>>> Hi! > >>>>>> > >>>>>> Thank you! Let’s meet then > >>>>>> > >>>>>> June 6 4pm > >>>>>> > >>>>>> Moscone West Convention Center > >>>>>> 800 Howard Street, San Francisco, CA 94103 > >>>>>> < > https://maps.google.com/?q=800+Howard+Street,+San+Francisco,+CA+94103&entry=gmail&source=g> > ; > >>>>>> > >>>>>> Ground floor (outside of conference area - should be available for > >>>>>> all) - we will meet and decide where to go > >>>>>> > >>>>>> (Would not send invite because that would be too much noise for dev@ > ) > >>>>>> > >>>>>> To paraphrase Joseph, we will use this to kick off the discusssion > >>>>>> and > >>>>>> post notes after and follow up online. As for Seattle, I would be > >>>>>> very > >>>>>> interested to meet in person lateen and discuss ;) > >>>>>> > >>>>>> > >>>>>> _____________________________ > >>>>>> From: Saikat Kanjilal < > > > sxk1969@ > > > > > >>>>>> Sent: Tuesday, May 29, 2018 11:46 AM > >>>>>> Subject: Re: Revisiting Online serving of Spark models? > >>>>>> To: Maximiliano Felice < > > > maximilianofelice@ > > > > > >>>>>> Cc: Felix Cheung < > > > felixcheung_m@ > > > >, Holden Karau < > >>>>>> > > > holden@ > > >>, Joseph Bradley < > > > joseph@ > > > >, Leif > >>>>>> Walsh < > > > leif.walsh@ > > > >, dev < > > > dev@.apache > > > > > >>>>>> > >>>>>> > >>>>>> Would love to join but am in Seattle, thoughts on how to make this > >>>>>> work? > >>>>>> > >>>>>> Regards > >>>>>> > >>>>>> Sent from my iPhone > >>>>>> > >>>>>> On May 29, 2018, at 10:35 AM, Maximiliano Felice < > >>>>>> > > > maximilianofelice@ > > >> wrote: > >>>>>> > >>>>>> Big +1 to a meeting with fresh air. > >>>>>> > >>>>>> Could anyone send the invites? I don't really know which is the > place > >>>>>> Holden is talking about. > >>>>>> > >>>>>> 2018-05-29 14:27 GMT-03:00 Felix Cheung < > > > felixcheung_m@ > > > >: > >>>>>> > >>>>>>> You had me at blue bottle! > >>>>>>> > >>>>>>> _____________________________ > >>>>>>> From: Holden Karau < > > > holden@ > > > > > >>>>>>> Sent: Tuesday, May 29, 2018 9:47 AM > >>>>>>> Subject: Re: Revisiting Online serving of Spark models? > >>>>>>> To: Felix Cheung < > > > felixcheung_m@ > > > > > >>>>>>> Cc: Saikat Kanjilal < > > > sxk1969@ > > > >, Maximiliano Felice < > >>>>>>> > > > maximilianofelice@ > > >>, Joseph Bradley < > > > joseph@ > > > >, > >>>>>>> Leif Walsh < > > > leif.walsh@ > > > >, dev < > > > dev@.apache > > > > > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> I'm down for that, we could all go for a walk maybe to the mint > >>>>>>> plazaa blue bottle and grab coffee (if the weather holds have our > >>>>>>> design > >>>>>>> meeting outside :p)? > >>>>>>> > >>>>>>> On Tue, May 29, 2018 at 9:37 AM, Felix Cheung < > >>>>>>> > > > felixcheung_m@ > > >> wrote: > >>>>>>> > >>>>>>>> Bump. > >>>>>>>> > >>>>>>>> ------------------------------ > >>>>>>>> *From:* Felix Cheung < > > > felixcheung_m@ > > > > > >>>>>>>> *Sent:* Saturday, May 26, 2018 1:05:29 PM > >>>>>>>> *To:* Saikat Kanjilal; Maximiliano Felice; Joseph Bradley > >>>>>>>> *Cc:* Leif Walsh; Holden Karau; dev > >>>>>>>> > >>>>>>>> *Subject:* Re: Revisiting Online serving of Spark models? > >>>>>>>> > >>>>>>>> Hi! How about we meet the community and discuss on June 6 4pm at > >>>>>>>> (near) the Summit? > >>>>>>>> > >>>>>>>> (I propose we meet at the venue entrance so we could accommodate > >>>>>>>> people might not be in the conference) > >>>>>>>> > >>>>>>>> ------------------------------ > >>>>>>>> *From:* Saikat Kanjilal < > > > sxk1969@ > > > > > >>>>>>>> *Sent:* Tuesday, May 22, 2018 7:47:07 AM > >>>>>>>> *To:* Maximiliano Felice > >>>>>>>> *Cc:* Leif Walsh; Felix Cheung; Holden Karau; Joseph Bradley; dev > >>>>>>>> *Subject:* Re: Revisiting Online serving of Spark models? > >>>>>>>> > >>>>>>>> I’m in the same exact boat as Maximiliano and have use cases as > >>>>>>>> well > >>>>>>>> for model serving and would love to join this discussion. > >>>>>>>> > >>>>>>>> Sent from my iPhone > >>>>>>>> > >>>>>>>> On May 22, 2018, at 6:39 AM, Maximiliano Felice < > >>>>>>>> > > > maximilianofelice@ > > >> wrote: > >>>>>>>> > >>>>>>>> Hi! > >>>>>>>> > >>>>>>>> I'm don't usually write a lot on this list but I keep up to date > >>>>>>>> with the discussions and I'm a heavy user of Spark. This topic > >>>>>>>> caught my > >>>>>>>> attention, as we're currently facing this issue at work. I'm > >>>>>>>> attending to > >>>>>>>> the summit and was wondering if it would it be possible for me to > >>>>>>>> join that > >>>>>>>> meeting. I might be able to share some helpful usecases and ideas. > >>>>>>>> > >>>>>>>> Thanks, > >>>>>>>> Maximiliano Felice > >>>>>>>> > >>>>>>>> El mar., 22 de may. de 2018 9:14 AM, Leif Walsh < > >>>>>>>> > > > leif.walsh@ > > >> escribió: > >>>>>>>> > >>>>>>>>> I’m with you on json being more readable than parquet, but we’ve > >>>>>>>>> had success using pyarrow’s parquet reader and have been quite > >>>>>>>>> happy with > >>>>>>>>> it so far. If your target is python (and probably if not now, > then > >>>>>>>>> soon, > >>>>>>>>> R), you should look in to it. > >>>>>>>>> > >>>>>>>>> On Mon, May 21, 2018 at 16:52 Joseph Bradley < > > > joseph@ > > > > > >>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> Regarding model reading and writing, I'll give quick thoughts > >>>>>>>>>> here: > >>>>>>>>>> * Our approach was to use the same format but write JSON instead > >>>>>>>>>> of Parquet. It's easier to parse JSON without Spark, and using > >>>>>>>>>> the same > >>>>>>>>>> format simplifies architecture. Plus, some people want to check > >>>>>>>>>> files into > >>>>>>>>>> version control, and JSON is nice for that. > >>>>>>>>>> * The reader/writer APIs could be extended to take format > >>>>>>>>>> parameters (just like DataFrame reader/writers) to handle JSON > >>>>>>>>>> (and maybe, > >>>>>>>>>> eventually, handle Parquet in the online serving setting). > >>>>>>>>>> > >>>>>>>>>> This would be a big project, so proposing a SPIP might be best. > >>>>>>>>>> If people are around at the Spark Summit, that could be a good > >>>>>>>>>> time to meet > >>>>>>>>>> up & then post notes back to the dev list. > >>>>>>>>>> > >>>>>>>>>> On Sun, May 20, 2018 at 8:11 PM, Felix Cheung < > >>>>>>>>>> > > > felixcheung_m@ > > >> wrote: > >>>>>>>>>> > >>>>>>>>>>> Specifically I’d like bring part of the discussion to Model and > >>>>>>>>>>> PipelineModel, and various ModelReader and SharedReadWrite > >>>>>>>>>>> implementations > >>>>>>>>>>> that rely on SparkContext. This is a big blocker on reusing > >>>>>>>>>>> trained models > >>>>>>>>>>> outside of Spark for online serving. > >>>>>>>>>>> > >>>>>>>>>>> What’s the next step? Would folks be interested in getting > >>>>>>>>>>> together to discuss/get some feedback? > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> _____________________________ > >>>>>>>>>>> From: Felix Cheung < > > > felixcheung_m@ > > > > > >>>>>>>>>>> Sent: Thursday, May 10, 2018 10:10 AM > >>>>>>>>>>> Subject: Re: Revisiting Online serving of Spark models? > >>>>>>>>>>> To: Holden Karau < > > > holden@ > > > >, Joseph Bradley < > >>>>>>>>>>> > > > joseph@ > > >> > >>>>>>>>>>> Cc: dev < > > > dev@.apache > > > > > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> Huge +1 on this! > >>>>>>>>>>> > >>>>>>>>>>> ------------------------------ > >>>>>>>>>>> *From:* > > > holden.karau@ > > > < > > > holden.karau@ > > > > on behalf > >>>>>>>>>>> of Holden Karau < > > > holden@ > > > > > >>>>>>>>>>> *Sent:* Thursday, May 10, 2018 9:39:26 AM > >>>>>>>>>>> *To:* Joseph Bradley > >>>>>>>>>>> *Cc:* dev > >>>>>>>>>>> *Subject:* Re: Revisiting Online serving of Spark models? > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley < > >>>>>>>>>>> > > > joseph@ > > >> wrote: > >>>>>>>>>>> > >>>>>>>>>>>> Thanks for bringing this up Holden! I'm a strong supporter of > >>>>>>>>>>>> this. > >>>>>>>>>>>> > >>>>>>>>>>>> Awesome! I'm glad other folks think something like this > belongs > >>>>>>>>>>> in Spark. > >>>>>>>>>>> > >>>>>>>>>>>> This was one of the original goals for mllib-local: to have > >>>>>>>>>>>> local versions of MLlib models which could be deployed without > >>>>>>>>>>>> the big > >>>>>>>>>>>> Spark JARs and without a SparkContext or SparkSession. There > >>>>>>>>>>>> are related > >>>>>>>>>>>> commercial offerings like this : ) but the overhead of > >>>>>>>>>>>> maintaining those > >>>>>>>>>>>> offerings is pretty high. Building good APIs within MLlib to > >>>>>>>>>>>> avoid copying > >>>>>>>>>>>> logic across libraries will be well worth it. > >>>>>>>>>>>> > >>>>>>>>>>>> We've talked about this need at Databricks and have also been > >>>>>>>>>>>> syncing with the creators of MLeap. It'd be great to get this > >>>>>>>>>>>> functionality into Spark itself. Some thoughts: > >>>>>>>>>>>> * It'd be valuable to have this go beyond adding transform() > >>>>>>>>>>>> methods taking a Row to the current Models. Instead, it would > >>>>>>>>>>>> be ideal to > >>>>>>>>>>>> have local, lightweight versions of models in mllib-local, > >>>>>>>>>>>> outside of the > >>>>>>>>>>>> main mllib package (for easier deployment with smaller & fewer > >>>>>>>>>>>> dependencies). > >>>>>>>>>>>> * Supporting Pipelines is important. For this, it would be > >>>>>>>>>>>> ideal to utilize elements of Spark SQL, particularly Rows and > >>>>>>>>>>>> Types, which > >>>>>>>>>>>> could be moved into a local sql package. > >>>>>>>>>>>> * This architecture may require some awkward APIs currently to > >>>>>>>>>>>> have model prediction logic in mllib-local, local model > classes > >>>>>>>>>>>> in > >>>>>>>>>>>> mllib-local, and regular (DataFrame-friendly) model classes in > >>>>>>>>>>>> mllib. We > >>>>>>>>>>>> might find it helpful to break some DeveloperApis in Spark 3.0 > >>>>>>>>>>>> to > >>>>>>>>>>>> facilitate this architecture while making it feasible for 3rd > >>>>>>>>>>>> party > >>>>>>>>>>>> developers to extend MLlib APIs (especially in Java). > >>>>>>>>>>>> > >>>>>>>>>>> I agree this could be interesting, and feed into the other > >>>>>>>>>>> discussion around when (or if) we should be considering Spark > >>>>>>>>>>> 3.0 > >>>>>>>>>>> I _think_ we could probably do it with optional traits people > >>>>>>>>>>> could mix in to avoid breaking the current APIs but I could be > >>>>>>>>>>> wrong on > >>>>>>>>>>> that point. > >>>>>>>>>>> > >>>>>>>>>>>> * It could also be worth discussing local DataFrames. They > >>>>>>>>>>>> might not be as important as per-Row transformations, but they > >>>>>>>>>>>> would be > >>>>>>>>>>>> helpful for batching for higher throughput. > >>>>>>>>>>>> > >>>>>>>>>>> That could be interesting as well. > >>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> I'll be interested to hear others' thoughts too! > >>>>>>>>>>>> > >>>>>>>>>>>> Joseph > >>>>>>>>>>>> > >>>>>>>>>>>> On Wed, May 9, 2018 at 7:18 AM, Holden Karau < > >>>>>>>>>>>> > > > holden@ > > >> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>>> Hi y'all, > >>>>>>>>>>>>> > >>>>>>>>>>>>> With the renewed interest in ML in Apache Spark now seems > like > >>>>>>>>>>>>> a good a time as any to revisit the online serving situation > >>>>>>>>>>>>> in Spark ML. > >>>>>>>>>>>>> DB & other's have done some excellent working moving a lot of > >>>>>>>>>>>>> the necessary > >>>>>>>>>>>>> tools into a local linear algebra package that doesn't depend > >>>>>>>>>>>>> on having a > >>>>>>>>>>>>> SparkContext. > >>>>>>>>>>>>> > >>>>>>>>>>>>> There are a few different commercial and non-commercial > >>>>>>>>>>>>> solutions round this, but currently our individual > >>>>>>>>>>>>> transform/predict > >>>>>>>>>>>>> methods are private so they either need to copy or > >>>>>>>>>>>>> re-implement (or put > >>>>>>>>>>>>> them selves in org.apache.spark) to access them. How would > >>>>>>>>>>>>> folks feel about > >>>>>>>>>>>>> adding a new trait for ML pipeline stages to expose to do > >>>>>>>>>>>>> transformation of > >>>>>>>>>>>>> single element inputs (or local collections) that could be > >>>>>>>>>>>>> optionally > >>>>>>>>>>>>> implemented by stages which support this? That way we can > have > >>>>>>>>>>>>> less copy > >>>>>>>>>>>>> and paste code possibly getting out of sync with our model > >>>>>>>>>>>>> training. > >>>>>>>>>>>>> > >>>>>>>>>>>>> I think continuing to have on-line serving grow in different > >>>>>>>>>>>>> projects is probably the right path, forward (folks have > >>>>>>>>>>>>> different needs), > >>>>>>>>>>>>> but I'd love to see us make it simpler for other projects to > >>>>>>>>>>>>> build reliable > >>>>>>>>>>>>> serving tools. > >>>>>>>>>>>>> > >>>>>>>>>>>>> I realize this maybe puts some of the folks in an awkward > >>>>>>>>>>>>> position with their own commercial offerings, but hopefully > if > >>>>>>>>>>>>> we make it > >>>>>>>>>>>>> easier for everyone the commercial vendors can benefit as > >>>>>>>>>>>>> well. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Cheers, > >>>>>>>>>>>>> > >>>>>>>>>>>>> Holden :) > >>>>>>>>>>>>> > >>>>>>>>>>>>> -- > >>>>>>>>>>>>> Twitter: https://twitter.com/holdenkarau > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> -- > >>>>>>>>>>>> Joseph Bradley > >>>>>>>>>>>> Software Engineer - Machine Learning > >>>>>>>>>>>> Databricks, Inc. > >>>>>>>>>>>> [image: http://databricks.com] <http://databricks.com/> > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> -- > >>>>>>>>>>> Twitter: https://twitter.com/holdenkarau > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> -- > >>>>>>>>>> Joseph Bradley > >>>>>>>>>> Software Engineer - Machine Learning > >>>>>>>>>> Databricks, Inc. > >>>>>>>>>> [image: http://databricks.com] <http://databricks.com/> > >>>>>>>>>> > >>>>>>>>> -- > >>>>>>>>> -- > >>>>>>>>> Cheers, > >>>>>>>>> Leif > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> -- > >>>>>>> Twitter: https://twitter.com/holdenkarau > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>> > >>>> > >>>> -- > >>>> Twitter: https://twitter.com/holdenkarau > >>>> > >>> -- > > Twitter: https://twitter.com/holdenkarau > > > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >