I'm down for that, we could all go for a walk maybe to the mint plazaa blue bottle and grab coffee (if the weather holds have our design meeting outside :p)?
On Tue, May 29, 2018 at 9:37 AM, Felix Cheung <felixcheun...@hotmail.com> wrote: > Bump. > > ------------------------------ > *From:* Felix Cheung <felixcheun...@hotmail.com> > *Sent:* Saturday, May 26, 2018 1:05:29 PM > *To:* Saikat Kanjilal; Maximiliano Felice; Joseph Bradley > *Cc:* Leif Walsh; Holden Karau; dev > > *Subject:* Re: Revisiting Online serving of Spark models? > > Hi! How about we meet the community and discuss on June 6 4pm at (near) > the Summit? > > (I propose we meet at the venue entrance so we could accommodate people > might not be in the conference) > > ------------------------------ > *From:* Saikat Kanjilal <sxk1...@hotmail.com> > *Sent:* Tuesday, May 22, 2018 7:47:07 AM > *To:* Maximiliano Felice > *Cc:* Leif Walsh; Felix Cheung; Holden Karau; Joseph Bradley; dev > *Subject:* Re: Revisiting Online serving of Spark models? > > I’m in the same exact boat as Maximiliano and have use cases as well for > model serving and would love to join this discussion. > > Sent from my iPhone > > On May 22, 2018, at 6:39 AM, Maximiliano Felice < > maximilianofel...@gmail.com> wrote: > > Hi! > > I'm don't usually write a lot on this list but I keep up to date with the > discussions and I'm a heavy user of Spark. This topic caught my attention, > as we're currently facing this issue at work. I'm attending to the summit > and was wondering if it would it be possible for me to join that meeting. I > might be able to share some helpful usecases and ideas. > > Thanks, > Maximiliano Felice > > El mar., 22 de may. de 2018 9:14 AM, Leif Walsh <leif.wa...@gmail.com> > escribió: > >> I’m with you on json being more readable than parquet, but we’ve had >> success using pyarrow’s parquet reader and have been quite happy with it so >> far. If your target is python (and probably if not now, then soon, R), you >> should look in to it. >> >> On Mon, May 21, 2018 at 16:52 Joseph Bradley <jos...@databricks.com> >> wrote: >> >>> Regarding model reading and writing, I'll give quick thoughts here: >>> * Our approach was to use the same format but write JSON instead of >>> Parquet. It's easier to parse JSON without Spark, and using the same >>> format simplifies architecture. Plus, some people want to check files into >>> version control, and JSON is nice for that. >>> * The reader/writer APIs could be extended to take format parameters >>> (just like DataFrame reader/writers) to handle JSON (and maybe, eventually, >>> handle Parquet in the online serving setting). >>> >>> This would be a big project, so proposing a SPIP might be best. If >>> people are around at the Spark Summit, that could be a good time to meet up >>> & then post notes back to the dev list. >>> >>> On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <felixcheun...@hotmail.com >>> > wrote: >>> >>>> Specifically I’d like bring part of the discussion to Model and >>>> PipelineModel, and various ModelReader and SharedReadWrite implementations >>>> that rely on SparkContext. This is a big blocker on reusing trained models >>>> outside of Spark for online serving. >>>> >>>> What’s the next step? Would folks be interested in getting together to >>>> discuss/get some feedback? >>>> >>>> >>>> _____________________________ >>>> From: Felix Cheung <felixcheun...@hotmail.com> >>>> Sent: Thursday, May 10, 2018 10:10 AM >>>> Subject: Re: Revisiting Online serving of Spark models? >>>> To: Holden Karau <hol...@pigscanfly.ca>, Joseph Bradley < >>>> jos...@databricks.com> >>>> Cc: dev <dev@spark.apache.org> >>>> >>>> >>>> >>>> Huge +1 on this! >>>> >>>> ------------------------------ >>>> *From:* holden.ka...@gmail.com <holden.ka...@gmail.com> on behalf of >>>> Holden Karau <hol...@pigscanfly.ca> >>>> *Sent:* Thursday, May 10, 2018 9:39:26 AM >>>> *To:* Joseph Bradley >>>> *Cc:* dev >>>> *Subject:* Re: Revisiting Online serving of Spark models? >>>> >>>> >>>> >>>> On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <jos...@databricks.com> >>>> wrote: >>>> >>>>> Thanks for bringing this up Holden! I'm a strong supporter of this. >>>>> >>>>> Awesome! I'm glad other folks think something like this belongs in >>>> Spark. >>>> >>>>> This was one of the original goals for mllib-local: to have local >>>>> versions of MLlib models which could be deployed without the big Spark >>>>> JARs >>>>> and without a SparkContext or SparkSession. There are related commercial >>>>> offerings like this : ) but the overhead of maintaining those offerings is >>>>> pretty high. Building good APIs within MLlib to avoid copying logic >>>>> across >>>>> libraries will be well worth it. >>>>> >>>>> We've talked about this need at Databricks and have also been syncing >>>>> with the creators of MLeap. It'd be great to get this functionality into >>>>> Spark itself. Some thoughts: >>>>> * It'd be valuable to have this go beyond adding transform() methods >>>>> taking a Row to the current Models. Instead, it would be ideal to have >>>>> local, lightweight versions of models in mllib-local, outside of the main >>>>> mllib package (for easier deployment with smaller & fewer dependencies). >>>>> * Supporting Pipelines is important. For this, it would be ideal to >>>>> utilize elements of Spark SQL, particularly Rows and Types, which could be >>>>> moved into a local sql package. >>>>> * This architecture may require some awkward APIs currently to have >>>>> model prediction logic in mllib-local, local model classes in mllib-local, >>>>> and regular (DataFrame-friendly) model classes in mllib. We might find it >>>>> helpful to break some DeveloperApis in Spark 3.0 to facilitate this >>>>> architecture while making it feasible for 3rd party developers to extend >>>>> MLlib APIs (especially in Java). >>>>> >>>> I agree this could be interesting, and feed into the other discussion >>>> around when (or if) we should be considering Spark 3.0 >>>> I _think_ we could probably do it with optional traits people could mix >>>> in to avoid breaking the current APIs but I could be wrong on that point. >>>> >>>>> * It could also be worth discussing local DataFrames. They might not >>>>> be as important as per-Row transformations, but they would be helpful for >>>>> batching for higher throughput. >>>>> >>>> That could be interesting as well. >>>> >>>>> >>>>> I'll be interested to hear others' thoughts too! >>>>> >>>>> Joseph >>>>> >>>>> On Wed, May 9, 2018 at 7:18 AM, Holden Karau <hol...@pigscanfly.ca> >>>>> wrote: >>>>> >>>>>> Hi y'all, >>>>>> >>>>>> With the renewed interest in ML in Apache Spark now seems like a good >>>>>> a time as any to revisit the online serving situation in Spark ML. DB & >>>>>> other's have done some excellent working moving a lot of the necessary >>>>>> tools into a local linear algebra package that doesn't depend on having a >>>>>> SparkContext. >>>>>> >>>>>> There are a few different commercial and non-commercial solutions >>>>>> round this, but currently our individual transform/predict methods are >>>>>> private so they either need to copy or re-implement (or put them selves >>>>>> in >>>>>> org.apache.spark) to access them. How would folks feel about adding a new >>>>>> trait for ML pipeline stages to expose to do transformation of single >>>>>> element inputs (or local collections) that could be optionally >>>>>> implemented >>>>>> by stages which support this? That way we can have less copy and paste >>>>>> code >>>>>> possibly getting out of sync with our model training. >>>>>> >>>>>> I think continuing to have on-line serving grow in different projects >>>>>> is probably the right path, forward (folks have different needs), but I'd >>>>>> love to see us make it simpler for other projects to build reliable >>>>>> serving >>>>>> tools. >>>>>> >>>>>> I realize this maybe puts some of the folks in an awkward position >>>>>> with their own commercial offerings, but hopefully if we make it easier >>>>>> for >>>>>> everyone the commercial vendors can benefit as well. >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Holden :) >>>>>> >>>>>> -- >>>>>> Twitter: https://twitter.com/holdenkarau >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> Joseph Bradley >>>>> >>>>> Software Engineer - Machine Learning >>>>> >>>>> Databricks, Inc. >>>>> >>>>> [image: http://databricks.com] <http://databricks.com/> >>>>> >>>> >>>> >>>> >>>> -- >>>> Twitter: https://twitter.com/holdenkarau >>>> >>>> >>>> >>> >>> >>> -- >>> >>> Joseph Bradley >>> >>> Software Engineer - Machine Learning >>> >>> Databricks, Inc. >>> >>> [image: http://databricks.com] <http://databricks.com/> >>> >> -- >> -- >> Cheers, >> Leif >> > -- Twitter: https://twitter.com/holdenkarau