I like that idea. I’ll be around Spark Summit. On Mon, May 21, 2018 at 1:52 PM Joseph Bradley <jos...@databricks.com> wrote:
> Regarding model reading and writing, I'll give quick thoughts here: > * Our approach was to use the same format but write JSON instead of > Parquet. It's easier to parse JSON without Spark, and using the same > format simplifies architecture. Plus, some people want to check files into > version control, and JSON is nice for that. > * The reader/writer APIs could be extended to take format parameters (just > like DataFrame reader/writers) to handle JSON (and maybe, eventually, > handle Parquet in the online serving setting). > > This would be a big project, so proposing a SPIP might be best. If people > are around at the Spark Summit, that could be a good time to meet up & then > post notes back to the dev list. > > On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <felixcheun...@hotmail.com> > wrote: > >> Specifically I’d like bring part of the discussion to Model and >> PipelineModel, and various ModelReader and SharedReadWrite implementations >> that rely on SparkContext. This is a big blocker on reusing trained models >> outside of Spark for online serving. >> >> What’s the next step? Would folks be interested in getting together to >> discuss/get some feedback? >> >> >> _____________________________ >> From: Felix Cheung <felixcheun...@hotmail.com> >> Sent: Thursday, May 10, 2018 10:10 AM >> Subject: Re: Revisiting Online serving of Spark models? >> To: Holden Karau <hol...@pigscanfly.ca>, Joseph Bradley < >> jos...@databricks.com> >> Cc: dev <dev@spark.apache.org> >> >> >> >> Huge +1 on this! >> >> ------------------------------ >> *From:* holden.ka...@gmail.com <holden.ka...@gmail.com> on behalf of >> Holden Karau <hol...@pigscanfly.ca> >> *Sent:* Thursday, May 10, 2018 9:39:26 AM >> *To:* Joseph Bradley >> *Cc:* dev >> *Subject:* Re: Revisiting Online serving of Spark models? >> >> >> >> On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <jos...@databricks.com> >> wrote: >> >>> Thanks for bringing this up Holden! I'm a strong supporter of this. >>> >>> Awesome! I'm glad other folks think something like this belongs in Spark. >> >>> This was one of the original goals for mllib-local: to have local >>> versions of MLlib models which could be deployed without the big Spark JARs >>> and without a SparkContext or SparkSession. There are related commercial >>> offerings like this : ) but the overhead of maintaining those offerings is >>> pretty high. Building good APIs within MLlib to avoid copying logic across >>> libraries will be well worth it. >>> >>> We've talked about this need at Databricks and have also been syncing >>> with the creators of MLeap. It'd be great to get this functionality into >>> Spark itself. Some thoughts: >>> * It'd be valuable to have this go beyond adding transform() methods >>> taking a Row to the current Models. Instead, it would be ideal to have >>> local, lightweight versions of models in mllib-local, outside of the main >>> mllib package (for easier deployment with smaller & fewer dependencies). >>> * Supporting Pipelines is important. For this, it would be ideal to >>> utilize elements of Spark SQL, particularly Rows and Types, which could be >>> moved into a local sql package. >>> * This architecture may require some awkward APIs currently to have >>> model prediction logic in mllib-local, local model classes in mllib-local, >>> and regular (DataFrame-friendly) model classes in mllib. We might find it >>> helpful to break some DeveloperApis in Spark 3.0 to facilitate this >>> architecture while making it feasible for 3rd party developers to extend >>> MLlib APIs (especially in Java). >>> >> I agree this could be interesting, and feed into the other discussion >> around when (or if) we should be considering Spark 3.0 >> I _think_ we could probably do it with optional traits people could mix >> in to avoid breaking the current APIs but I could be wrong on that point. >> >>> * It could also be worth discussing local DataFrames. They might not be >>> as important as per-Row transformations, but they would be helpful for >>> batching for higher throughput. >>> >> That could be interesting as well. >> >>> >>> I'll be interested to hear others' thoughts too! >>> >>> Joseph >>> >>> On Wed, May 9, 2018 at 7:18 AM, Holden Karau <hol...@pigscanfly.ca> >>> wrote: >>> >>>> Hi y'all, >>>> >>>> With the renewed interest in ML in Apache Spark now seems like a good a >>>> time as any to revisit the online serving situation in Spark ML. DB & >>>> other's have done some excellent working moving a lot of the necessary >>>> tools into a local linear algebra package that doesn't depend on having a >>>> SparkContext. >>>> >>>> There are a few different commercial and non-commercial solutions round >>>> this, but currently our individual transform/predict methods are private so >>>> they either need to copy or re-implement (or put them selves in >>>> org.apache.spark) to access them. How would folks feel about adding a new >>>> trait for ML pipeline stages to expose to do transformation of single >>>> element inputs (or local collections) that could be optionally implemented >>>> by stages which support this? That way we can have less copy and paste code >>>> possibly getting out of sync with our model training. >>>> >>>> I think continuing to have on-line serving grow in different projects >>>> is probably the right path, forward (folks have different needs), but I'd >>>> love to see us make it simpler for other projects to build reliable serving >>>> tools. >>>> >>>> I realize this maybe puts some of the folks in an awkward position with >>>> their own commercial offerings, but hopefully if we make it easier for >>>> everyone the commercial vendors can benefit as well. >>>> >>>> Cheers, >>>> >>>> Holden :) >>>> >>>> -- >>>> Twitter: https://twitter.com/holdenkarau >>>> >>> >>> >>> >>> -- >>> >>> Joseph Bradley >>> >>> Software Engineer - Machine Learning >>> >>> Databricks, Inc. >>> >>> [image: http://databricks.com] <http://databricks.com/> >>> >> >> >> >> -- >> Twitter: https://twitter.com/holdenkarau >> >> >> > > > -- > > Joseph Bradley > > Software Engineer - Machine Learning > > Databricks, Inc. > > [image: http://databricks.com] <http://databricks.com/> > -- Twitter: https://twitter.com/holdenkarau