Big +1 to a meeting with fresh air. Could anyone send the invites? I don't really know which is the place Holden is talking about.
2018-05-29 14:27 GMT-03:00 Felix Cheung <felixcheun...@hotmail.com>: > You had me at blue bottle! > > _____________________________ > From: Holden Karau <hol...@pigscanfly.ca> > Sent: Tuesday, May 29, 2018 9:47 AM > Subject: Re: Revisiting Online serving of Spark models? > To: Felix Cheung <felixcheun...@hotmail.com> > Cc: Saikat Kanjilal <sxk1...@hotmail.com>, Maximiliano Felice < > maximilianofel...@gmail.com>, Joseph Bradley <jos...@databricks.com>, > Leif Walsh <leif.wa...@gmail.com>, dev <dev@spark.apache.org> > > > > I'm down for that, we could all go for a walk maybe to the mint plazaa > blue bottle and grab coffee (if the weather holds have our design meeting > outside :p)? > > On Tue, May 29, 2018 at 9:37 AM, Felix Cheung <felixcheun...@hotmail.com> > wrote: > >> Bump. >> >> ------------------------------ >> *From:* Felix Cheung <felixcheun...@hotmail.com> >> *Sent:* Saturday, May 26, 2018 1:05:29 PM >> *To:* Saikat Kanjilal; Maximiliano Felice; Joseph Bradley >> *Cc:* Leif Walsh; Holden Karau; dev >> >> *Subject:* Re: Revisiting Online serving of Spark models? >> >> Hi! How about we meet the community and discuss on June 6 4pm at (near) >> the Summit? >> >> (I propose we meet at the venue entrance so we could accommodate people >> might not be in the conference) >> >> ------------------------------ >> *From:* Saikat Kanjilal <sxk1...@hotmail.com> >> *Sent:* Tuesday, May 22, 2018 7:47:07 AM >> *To:* Maximiliano Felice >> *Cc:* Leif Walsh; Felix Cheung; Holden Karau; Joseph Bradley; dev >> *Subject:* Re: Revisiting Online serving of Spark models? >> >> I’m in the same exact boat as Maximiliano and have use cases as well for >> model serving and would love to join this discussion. >> >> Sent from my iPhone >> >> On May 22, 2018, at 6:39 AM, Maximiliano Felice < >> maximilianofel...@gmail.com> wrote: >> >> Hi! >> >> I'm don't usually write a lot on this list but I keep up to date with the >> discussions and I'm a heavy user of Spark. This topic caught my attention, >> as we're currently facing this issue at work. I'm attending to the summit >> and was wondering if it would it be possible for me to join that meeting. I >> might be able to share some helpful usecases and ideas. >> >> Thanks, >> Maximiliano Felice >> >> El mar., 22 de may. de 2018 9:14 AM, Leif Walsh <leif.wa...@gmail.com> >> escribió: >> >>> I’m with you on json being more readable than parquet, but we’ve had >>> success using pyarrow’s parquet reader and have been quite happy with it so >>> far. If your target is python (and probably if not now, then soon, R), you >>> should look in to it. >>> >>> On Mon, May 21, 2018 at 16:52 Joseph Bradley <jos...@databricks.com> >>> wrote: >>> >>>> Regarding model reading and writing, I'll give quick thoughts here: >>>> * Our approach was to use the same format but write JSON instead of >>>> Parquet. It's easier to parse JSON without Spark, and using the same >>>> format simplifies architecture. Plus, some people want to check files into >>>> version control, and JSON is nice for that. >>>> * The reader/writer APIs could be extended to take format parameters >>>> (just like DataFrame reader/writers) to handle JSON (and maybe, eventually, >>>> handle Parquet in the online serving setting). >>>> >>>> This would be a big project, so proposing a SPIP might be best. If >>>> people are around at the Spark Summit, that could be a good time to meet up >>>> & then post notes back to the dev list. >>>> >>>> On Sun, May 20, 2018 at 8:11 PM, Felix Cheung < >>>> felixcheun...@hotmail.com> wrote: >>>> >>>>> Specifically I’d like bring part of the discussion to Model and >>>>> PipelineModel, and various ModelReader and SharedReadWrite implementations >>>>> that rely on SparkContext. This is a big blocker on reusing trained >>>>> models >>>>> outside of Spark for online serving. >>>>> >>>>> What’s the next step? Would folks be interested in getting together to >>>>> discuss/get some feedback? >>>>> >>>>> >>>>> _____________________________ >>>>> From: Felix Cheung <felixcheun...@hotmail.com> >>>>> Sent: Thursday, May 10, 2018 10:10 AM >>>>> Subject: Re: Revisiting Online serving of Spark models? >>>>> To: Holden Karau <hol...@pigscanfly.ca>, Joseph Bradley < >>>>> jos...@databricks.com> >>>>> Cc: dev <dev@spark.apache.org> >>>>> >>>>> >>>>> >>>>> Huge +1 on this! >>>>> >>>>> ------------------------------ >>>>> *From:*holden.ka...@gmail.com <holden.ka...@gmail.com> on behalf of >>>>> Holden Karau <hol...@pigscanfly.ca> >>>>> *Sent:* Thursday, May 10, 2018 9:39:26 AM >>>>> *To:* Joseph Bradley >>>>> *Cc:* dev >>>>> *Subject:* Re: Revisiting Online serving of Spark models? >>>>> >>>>> >>>>> >>>>> On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <jos...@databricks.com >>>>> > wrote: >>>>> >>>>>> Thanks for bringing this up Holden! I'm a strong supporter of this. >>>>>> >>>>>> Awesome! I'm glad other folks think something like this belongs in >>>>> Spark. >>>>> >>>>>> This was one of the original goals for mllib-local: to have local >>>>>> versions of MLlib models which could be deployed without the big Spark >>>>>> JARs >>>>>> and without a SparkContext or SparkSession. There are related commercial >>>>>> offerings like this : ) but the overhead of maintaining those offerings >>>>>> is >>>>>> pretty high. Building good APIs within MLlib to avoid copying logic >>>>>> across >>>>>> libraries will be well worth it. >>>>>> >>>>>> We've talked about this need at Databricks and have also been syncing >>>>>> with the creators of MLeap. It'd be great to get this functionality into >>>>>> Spark itself. Some thoughts: >>>>>> * It'd be valuable to have this go beyond adding transform() methods >>>>>> taking a Row to the current Models. Instead, it would be ideal to have >>>>>> local, lightweight versions of models in mllib-local, outside of the main >>>>>> mllib package (for easier deployment with smaller & fewer dependencies). >>>>>> * Supporting Pipelines is important. For this, it would be ideal to >>>>>> utilize elements of Spark SQL, particularly Rows and Types, which could >>>>>> be >>>>>> moved into a local sql package. >>>>>> * This architecture may require some awkward APIs currently to have >>>>>> model prediction logic in mllib-local, local model classes in >>>>>> mllib-local, >>>>>> and regular (DataFrame-friendly) model classes in mllib. We might find >>>>>> it >>>>>> helpful to break some DeveloperApis in Spark 3.0 to facilitate this >>>>>> architecture while making it feasible for 3rd party developers to extend >>>>>> MLlib APIs (especially in Java). >>>>>> >>>>> I agree this could be interesting, and feed into the other discussion >>>>> around when (or if) we should be considering Spark 3.0 >>>>> I _think_ we could probably do it with optional traits people could >>>>> mix in to avoid breaking the current APIs but I could be wrong on that >>>>> point. >>>>> >>>>>> * It could also be worth discussing local DataFrames. They might not >>>>>> be as important as per-Row transformations, but they would be helpful for >>>>>> batching for higher throughput. >>>>>> >>>>> That could be interesting as well. >>>>> >>>>>> >>>>>> I'll be interested to hear others' thoughts too! >>>>>> >>>>>> Joseph >>>>>> >>>>>> On Wed, May 9, 2018 at 7:18 AM, Holden Karau <hol...@pigscanfly.ca> >>>>>> wrote: >>>>>> >>>>>>> Hi y'all, >>>>>>> >>>>>>> With the renewed interest in ML in Apache Spark now seems like a >>>>>>> good a time as any to revisit the online serving situation in Spark ML. >>>>>>> DB >>>>>>> & other's have done some excellent working moving a lot of the necessary >>>>>>> tools into a local linear algebra package that doesn't depend on having >>>>>>> a >>>>>>> SparkContext. >>>>>>> >>>>>>> There are a few different commercial and non-commercial solutions >>>>>>> round this, but currently our individual transform/predict methods are >>>>>>> private so they either need to copy or re-implement (or put them selves >>>>>>> in >>>>>>> org.apache.spark) to access them. How would folks feel about adding a >>>>>>> new >>>>>>> trait for ML pipeline stages to expose to do transformation of single >>>>>>> element inputs (or local collections) that could be optionally >>>>>>> implemented >>>>>>> by stages which support this? That way we can have less copy and paste >>>>>>> code >>>>>>> possibly getting out of sync with our model training. >>>>>>> >>>>>>> I think continuing to have on-line serving grow in different >>>>>>> projects is probably the right path, forward (folks have different >>>>>>> needs), >>>>>>> but I'd love to see us make it simpler for other projects to build >>>>>>> reliable >>>>>>> serving tools. >>>>>>> >>>>>>> I realize this maybe puts some of the folks in an awkward position >>>>>>> with their own commercial offerings, but hopefully if we make it easier >>>>>>> for >>>>>>> everyone the commercial vendors can benefit as well. >>>>>>> >>>>>>> Cheers, >>>>>>> >>>>>>> Holden :) >>>>>>> >>>>>>> -- >>>>>>> Twitter: https://twitter.com/holdenkarau >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> Joseph Bradley >>>>>> >>>>>> Software Engineer - Machine Learning >>>>>> >>>>>> Databricks, Inc. >>>>>> >>>>>> [image: http://databricks.com] <http://databricks.com/> >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Twitter: https://twitter.com/holdenkarau >>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> >>>> Joseph Bradley >>>> >>>> Software Engineer - Machine Learning >>>> >>>> Databricks, Inc. >>>> >>>> [image: http://databricks.com] <http://databricks.com/> >>>> >>> -- >>> -- >>> Cheers, >>> Leif >>> >> > > > -- > Twitter: https://twitter.com/holdenkarau > > >