I'm skeptical.  Serving synchronous queries from a model at scale is a
fundamentally different activity. As you note, it doesn't logically involve
Spark. If it has to happen in milliseconds it's going to be in-core.
Scoring even 10qps with a Spark job per request is probably a non-starter;
think of the thousands of tasks per second and the overhead of just
tracking them.

When you say the RDDs support point prediction, I think you mean that those
older models expose a method to score a Vector. They are not somehow
exposing distributed point prediction. You could add this to the newer
models, but it raises the question of how to make the Row to feed it? the
.mllib punts on this and assumes you can construct the Vector.

I think this sweeps a lot under the rug in assuming that there can just be
a "local" version of every Transformer -- but, even if there could be,
consider how much extra implementation that is. Lots of them probably could
be but I'm not sure that all can.

The bigger problem in my experience is the Pipelines don't generally
encapsulate the entire pipeline from source data to score. They encapsulate
the part after computing underlying features. That is, if one of your
features is "total clicks from this user", that's the product of a
DataFrame operation that precedes a Pipeline. This can't be turned into a
non-distributed, non-Spark local version.

Solving subsets of this problem could still be useful, and you've
highlighted some external projects that try. I'd also highlight PMML as an
established interchange format for just the model part, and for cases that
don't involve much or any pipeline, it's a better fit paired with a library
that can score from PMML.

I think this is one of those things that could live outside the project,
because it's more not-Spark than Spark. Remember too that building a
solution into the project blesses one at the expense of others.


On Sun, Mar 12, 2017 at 10:15 PM Asher Krim <ak...@hubspot.com> wrote:

> Hi All,
>
> I spent a lot of time at Spark Summit East this year talking with Spark
> developers and committers about challenges with productizing Spark. One of
> the biggest shortcomings I've encountered in Spark ML pipelines is the lack
> of a way to serve single requests with any reasonable performance.
> SPARK-10413 explores adding methods for single item prediction, but I'd
> like to explore a more holistic approach - a separate local api, with
> models that support transformations without depending on Spark at all.
>
> I've written up a doc
> <https://docs.google.com/document/d/1Ha4DRMio5A7LjPqiHUnwVzbaxbev6ys04myyz6nDgI4/edit?usp=sharing>
> detailing the approach, and I'm happy to discuss alternatives. If this
> gains traction, I can create a branch with a minimal example on a simple
> transformer (probably something like CountVectorizerModel) so we have
> something concrete to continue the discussion on.
>
> Thanks,
> Asher Krim
> Senior Software Engineer
>

Reply via email to