Re: Revisiting Online serving of Spark models?

Saikat Kanjilal Tue, 22 May 2018 07:47:38 -0700

I’m in the same exact boat as Maximiliano and have use cases as well for model 
serving and would love to join this discussion.


Sent from my iPhone

On May 22, 2018, at 6:39 AM, Maximiliano Felice 
<maximilianofel...@gmail.com<mailto:maximilianofel...@gmail.com>> wrote:

Hi!

I'm don't usually write a lot on this list but I keep up to date with the 
discussions and I'm a heavy user of Spark. This topic caught my attention, as 
we're currently facing this issue at work. I'm attending to the summit and was 
wondering if it would it be possible for me to join that meeting. I might be 
able to share some helpful usecases and ideas.

Thanks,
Maximiliano Felice

El mar., 22 de may. de 2018 9:14 AM, Leif Walsh 
<leif.wa...@gmail.com<mailto:leif.wa...@gmail.com>> escribió:
I’m with you on json being more readable than parquet, but we’ve had success 
using pyarrow’s parquet reader and have been quite happy with it so far. If 
your target is python (and probably if not now, then soon, R), you should look 
in to it.

On Mon, May 21, 2018 at 16:52 Joseph Bradley 
<jos...@databricks.com<mailto:jos...@databricks.com>> wrote:
Regarding model reading and writing, I'll give quick thoughts here:
* Our approach was to use the same format but write JSON instead of Parquet.  
It's easier to parse JSON without Spark, and using the same format simplifies 
architecture.  Plus, some people want to check files into version control, and 
JSON is nice for that.
* The reader/writer APIs could be extended to take format parameters (just like 
DataFrame reader/writers) to handle JSON (and maybe, eventually, handle Parquet 
in the online serving setting).

This would be a big project, so proposing a SPIP might be best.  If people are 
around at the Spark Summit, that could be a good time to meet up & then post 
notes back to the dev list.

On Sun, May 20, 2018 at 8:11 PM, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
Specifically I’d like bring part of the discussion to Model and PipelineModel, 
and various ModelReader and SharedReadWrite implementations that rely on 
SparkContext. This is a big blocker on reusing  trained models outside of Spark 
for online serving.

What’s the next step? Would folks be interested in getting together to 
discuss/get some feedback?


_____________________________
From: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>
Sent: Thursday, May 10, 2018 10:10 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Holden Karau <hol...@pigscanfly.ca<mailto:hol...@pigscanfly.ca>>, Joseph 
Bradley <jos...@databricks.com<mailto:jos...@databricks.com>>
Cc: dev <dev@spark.apache.org<mailto:dev@spark.apache.org>>



Huge +1 on this!

________________________________
From: holden.ka...@gmail.com<mailto:holden.ka...@gmail.com> 
<holden.ka...@gmail.com<mailto:holden.ka...@gmail.com>> on behalf of Holden 
Karau <hol...@pigscanfly.ca<mailto:hol...@pigscanfly.ca>>
Sent: Thursday, May 10, 2018 9:39:26 AM
To: Joseph Bradley
Cc: dev
Subject: Re: Revisiting Online serving of Spark models?



On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley 
<jos...@databricks.com<mailto:jos...@databricks.com>> wrote:
Thanks for bringing this up Holden!  I'm a strong supporter of this.

Awesome! I'm glad other folks think something like this belongs in Spark.
This was one of the original goals for mllib-local: to have local versions of 
MLlib models which could be deployed without the big Spark JARs and without a 
SparkContext or SparkSession.  There are related commercial offerings like this 
: ) but the overhead of maintaining those offerings is pretty high.  Building 
good APIs within MLlib to avoid copying logic across libraries will be well 
worth it.

We've talked about this need at Databricks and have also been syncing with the 
creators of MLeap.  It'd be great to get this functionality into Spark itself.  
Some thoughts:
* It'd be valuable to have this go beyond adding transform() methods taking a 
Row to the current Models.  Instead, it would be ideal to have local, 
lightweight versions of models in mllib-local, outside of the main mllib 
package (for easier deployment with smaller & fewer dependencies).
* Supporting Pipelines is important.  For this, it would be ideal to utilize 
elements of Spark SQL, particularly Rows and Types, which could be moved into a 
local sql package.
* This architecture may require some awkward APIs currently to have model 
prediction logic in mllib-local, local model classes in mllib-local, and 
regular (DataFrame-friendly) model classes in mllib.  We might find it helpful 
to break some DeveloperApis in Spark 3.0 to facilitate this architecture while 
making it feasible for 3rd party developers to extend MLlib APIs (especially in 
Java).
I agree this could be interesting, and feed into the other discussion around 
when (or if) we should be considering Spark 3.0
I _think_ we could probably do it with optional traits people could mix in to 
avoid breaking the current APIs but I could be wrong on that point.
* It could also be worth discussing local DataFrames.  They might not be as 
important as per-Row transformations, but they would be helpful for batching 
for higher throughput.
That could be interesting as well.

I'll be interested to hear others' thoughts too!

Joseph

On Wed, May 9, 2018 at 7:18 AM, Holden Karau 
<hol...@pigscanfly.ca<mailto:hol...@pigscanfly.ca>> wrote:
Hi y'all,

With the renewed interest in ML in Apache Spark now seems like a good a time as 
any to revisit the online serving situation in Spark ML. DB & other's have done 
some excellent working moving a lot of the necessary tools into a local linear 
algebra package that doesn't depend on having a SparkContext.

There are a few different commercial and non-commercial solutions round this, 
but currently our individual transform/predict methods are private so they 
either need to copy or re-implement (or put them selves in org.apache.spark) to 
access them. How would folks feel about adding a new trait for ML pipeline 
stages to expose to do transformation of single element inputs (or local 
collections) that could be optionally implemented by stages which support this? 
That way we can have less copy and paste code possibly getting out of sync with 
our model training.

I think continuing to have on-line serving grow in different projects is 
probably the right path, forward (folks have different needs), but I'd love to 
see us make it simpler for other projects to build reliable serving tools.

I realize this maybe puts some of the folks in an awkward position with their 
own commercial offerings, but hopefully if we make it easier for everyone the 
commercial vendors can benefit as well.

Cheers,

Holden :)

--
Twitter: https://twitter.com/holdenkarau



--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[http://databricks.com]<http://databricks.com/>



--
Twitter: https://twitter.com/holdenkarau





--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[http://databricks.com]<http://databricks.com/>

--
--
Cheers,
Leif

Re: Revisiting Online serving of Spark models?

Reply via email to