Thanks for bringing this up Holden!  I'm a strong supporter of this.

This was one of the original goals for mllib-local: to have local versions
of MLlib models which could be deployed without the big Spark JARs and
without a SparkContext or SparkSession.  There are related commercial
offerings like this : ) but the overhead of maintaining those offerings is
pretty high.  Building good APIs within MLlib to avoid copying logic across
libraries will be well worth it.

We've talked about this need at Databricks and have also been syncing with
the creators of MLeap.  It'd be great to get this functionality into Spark
itself.  Some thoughts:
* It'd be valuable to have this go beyond adding transform() methods taking
a Row to the current Models.  Instead, it would be ideal to have local,
lightweight versions of models in mllib-local, outside of the main mllib
package (for easier deployment with smaller & fewer dependencies).
* Supporting Pipelines is important.  For this, it would be ideal to
utilize elements of Spark SQL, particularly Rows and Types, which could be
moved into a local sql package.
* This architecture may require some awkward APIs currently to have model
prediction logic in mllib-local, local model classes in mllib-local, and
regular (DataFrame-friendly) model classes in mllib.  We might find it
helpful to break some DeveloperApis in Spark 3.0 to facilitate this
architecture while making it feasible for 3rd party developers to extend
MLlib APIs (especially in Java).
* It could also be worth discussing local DataFrames.  They might not be as
important as per-Row transformations, but they would be helpful for
batching for higher throughput.

I'll be interested to hear others' thoughts too!

Joseph

On Wed, May 9, 2018 at 7:18 AM, Holden Karau <hol...@pigscanfly.ca> wrote:

> Hi y'all,
>
> With the renewed interest in ML in Apache Spark now seems like a good a
> time as any to revisit the online serving situation in Spark ML. DB &
> other's have done some excellent working moving a lot of the necessary
> tools into a local linear algebra package that doesn't depend on having a
> SparkContext.
>
> There are a few different commercial and non-commercial solutions round
> this, but currently our individual transform/predict methods are private so
> they either need to copy or re-implement (or put them selves in
> org.apache.spark) to access them. How would folks feel about adding a new
> trait for ML pipeline stages to expose to do transformation of single
> element inputs (or local collections) that could be optionally implemented
> by stages which support this? That way we can have less copy and paste code
> possibly getting out of sync with our model training.
>
> I think continuing to have on-line serving grow in different projects is
> probably the right path, forward (folks have different needs), but I'd love
> to see us make it simpler for other projects to build reliable serving
> tools.
>
> I realize this maybe puts some of the folks in an awkward position with
> their own commercial offerings, but hopefully if we make it easier for
> everyone the commercial vendors can benefit as well.
>
> Cheers,
>
> Holden :)
>
> --
> Twitter: https://twitter.com/holdenkarau
>



-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>

Reply via email to