Although I love the cool idea of Asher, I'd rather +1 for Sean's view; I think it would be much better to live outside of the project.
Best, Dongjin On Mon, Mar 13, 2017 at 5:39 PM, Sean Owen <so...@cloudera.com> wrote: > I'm skeptical. Serving synchronous queries from a model at scale is a > fundamentally different activity. As you note, it doesn't logically involve > Spark. If it has to happen in milliseconds it's going to be in-core. > Scoring even 10qps with a Spark job per request is probably a non-starter; > think of the thousands of tasks per second and the overhead of just > tracking them. > > When you say the RDDs support point prediction, I think you mean that > those older models expose a method to score a Vector. They are not somehow > exposing distributed point prediction. You could add this to the newer > models, but it raises the question of how to make the Row to feed it? the > .mllib punts on this and assumes you can construct the Vector. > > I think this sweeps a lot under the rug in assuming that there can just be > a "local" version of every Transformer -- but, even if there could be, > consider how much extra implementation that is. Lots of them probably could > be but I'm not sure that all can. > > The bigger problem in my experience is the Pipelines don't generally > encapsulate the entire pipeline from source data to score. They encapsulate > the part after computing underlying features. That is, if one of your > features is "total clicks from this user", that's the product of a > DataFrame operation that precedes a Pipeline. This can't be turned into a > non-distributed, non-Spark local version. > > Solving subsets of this problem could still be useful, and you've > highlighted some external projects that try. I'd also highlight PMML as an > established interchange format for just the model part, and for cases that > don't involve much or any pipeline, it's a better fit paired with a library > that can score from PMML. > > I think this is one of those things that could live outside the project, > because it's more not-Spark than Spark. Remember too that building a > solution into the project blesses one at the expense of others. > > > On Sun, Mar 12, 2017 at 10:15 PM Asher Krim <ak...@hubspot.com> wrote: > >> Hi All, >> >> I spent a lot of time at Spark Summit East this year talking with Spark >> developers and committers about challenges with productizing Spark. One of >> the biggest shortcomings I've encountered in Spark ML pipelines is the lack >> of a way to serve single requests with any reasonable performance. >> SPARK-10413 explores adding methods for single item prediction, but I'd >> like to explore a more holistic approach - a separate local api, with >> models that support transformations without depending on Spark at all. >> >> I've written up a doc >> <https://docs.google.com/document/d/1Ha4DRMio5A7LjPqiHUnwVzbaxbev6ys04myyz6nDgI4/edit?usp=sharing> >> detailing the approach, and I'm happy to discuss alternatives. If this >> gains traction, I can create a branch with a minimal example on a simple >> transformer (probably something like CountVectorizerModel) so we have >> something concrete to continue the discussion on. >> >> Thanks, >> Asher Krim >> Senior Software Engineer >> > -- *Dongjin Lee* *Software developer in Line+.So interested in massive-scale machine learning.facebook: www.facebook.com/dongjin.lee.kr <http://www.facebook.com/dongjin.lee.kr>linkedin: kr.linkedin.com/in/dongjinleekr <http://kr.linkedin.com/in/dongjinleekr>github: <http://goog_969573159/>github.com/dongjinleekr <http://github.com/dongjinleekr>twitter: www.twitter.com/dongjinleekr <http://www.twitter.com/dongjinleekr>*