Re: Spark Local Pipelines

2017-05-18 Thread Asher Krim
Hi Cristian,

There's a jira (https://issues.apache.org/jira/browse/SPARK-16365) where
this issue has been discussed as well. I feel very strongly about the need
for this feature. I've been implementing local versions of transformers as
needed, which has made working with Spark ml much less pleasant and safe
(due to possible train-serve skews) than it can be. Internally, the lack of
this feature has caused debates about how appropriate Spark really is for
production ML.

Asher Krim
Senior Software Engineer

On Thu, May 18, 2017 at 4:24 AM, Cristian Opris 
wrote:

> Reviving an old thread: there's another potential argument for exposing
> 'local' (by which I assume it's meant non-distributed) implementations of
> the algorithms: sometimes it's useful to apply the algorithm on relatively
> small groupings of data in a very large dataset. In this case Spark would
> only serve to distribute the data and apply the algorithm locally on each
> partition/grouping of data, perhaps as an UDF.
>
> This perhaps could currently be achieved with the scikit integration, but
> would be useful to consider making it possible to use the Spark
> implementation of the algorithm, where that algorithm is not an inherently
> distributed implementation. CountVectorizer is a good example, nothing in
> there inherently requires a DataFrame.
>
> In practice this would only require to expose the core implementation of
> the algorithms where possible.
>
> On 13 March 2017 at 16:28, Asher Krim  wrote:
>
>> Thanks for the feedback.
>>
>> If we strip away all of the fancy stuff, my proposal boils down to
>> exposing the logic used in Spark's ML library. In an ideal world, Spark
>> would possibly have relied on an existing ML implementation rather than
>> reimplement, since there's very little that's Spark specific about using ML
>> models. As Sean says, it may make most sense to have localPipelines live
>> outside of Spark. However it would be really beneficial for Spark ML
>> pipelines adoption if they used non-Spark logic. This would eliminate
>> issues with train-serve skew and close the potential for bugs.
>>
>> I'll leave some more comments in-line to Sean's response:
>>
>> I'm skeptical.  Serving synchronous queries from a model at scale is a
>> fundamentally different activity. As you note, it doesn't logically involve
>> Spark. If it has to happen in milliseconds it's going to be in-core.
>> Scoring even 10qps with a Spark job per request is probably a non-starter;
>> think of the thousands of tasks per second and the overhead of just
>> tracking them.
>>
>> When you say the RDDs support point prediction, I think you mean that
>> those older models expose a method to score a Vector. They are not somehow
>> exposing distributed point prediction. You could add this to the newer
>> models, but it raises the question of how to make the Row to feed it? the
>> .mllib punts on this and assumes you can construct the Vector.
>> AK: In my mind, punting is exactly the right solution - no overhead, full
>> control to the user
>>
>> I think this sweeps a lot under the rug in assuming that there can just
>> be a "local" version of every Transformer -- but, even if there could be,
>> consider how much extra implementation that is. Lots of them probably could
>> be but I'm not sure that all can.
>> AK: I'm not aware of models for which this is not possible - there are no
>> Spark-only algorithms that I'm aware of. The work to convert Spark to Local
>> models may be more involved for some implementations, sure, but I don't
>> think any would be too bad. However if there is something that's
>> impossible, then that's fine too. I'm not sure we have to commit to having
>> local versions for every single model
>>
>> The bigger problem in my experience is the Pipelines don't generally
>> encapsulate the entire pipeline from source data to score. They encapsulate
>> the part after computing underlying features. That is, if one of your
>> features is "total clicks from this user", that's the product of a
>> DataFrame operation that precedes a Pipeline. This can't be turned into a
>> non-distributed, non-Spark local version.
>> AK: That's a great point, and a really good argument for keeping any
>> local pipeline logic outside of Spark
>>
>> Solving subsets of this problem could still be useful, and you've
>> highlighted some external projects that try. I'd also highlight PMML as an
>> established interchange format for just the model part, and for cases that
>> don't involve much or any pipeline, it's a better fit paired with a library
>> that can score from PMML.
>> AK: The problem with solutions like PMML is that they can tell you WHAT
>> to do, but not HOW EXACTLY to do it. At the end of the day, the best
>> model-description possible would be the metadata+ the code itself. That's
>> the crux of my proposal - expose the implementation so users can use Spark
>> models with the same exact code that was 

Re: Spark Local Pipelines

2017-03-13 Thread Asher Krim
Thanks for the feedback.

If we strip away all of the fancy stuff, my proposal boils down to exposing
the logic used in Spark's ML library. In an ideal world, Spark would
possibly have relied on an existing ML implementation rather than
reimplement, since there's very little that's Spark specific about using ML
models. As Sean says, it may make most sense to have localPipelines live
outside of Spark. However it would be really beneficial for Spark ML
pipelines adoption if they used non-Spark logic. This would eliminate
issues with train-serve skew and close the potential for bugs.

I'll leave some more comments in-line to Sean's response:

I'm skeptical.  Serving synchronous queries from a model at scale is a
fundamentally different activity. As you note, it doesn't logically involve
Spark. If it has to happen in milliseconds it's going to be in-core.
Scoring even 10qps with a Spark job per request is probably a non-starter;
think of the thousands of tasks per second and the overhead of just
tracking them.

When you say the RDDs support point prediction, I think you mean that those
older models expose a method to score a Vector. They are not somehow
exposing distributed point prediction. You could add this to the newer
models, but it raises the question of how to make the Row to feed it? the
.mllib punts on this and assumes you can construct the Vector.
AK: In my mind, punting is exactly the right solution - no overhead, full
control to the user

I think this sweeps a lot under the rug in assuming that there can just be
a "local" version of every Transformer -- but, even if there could be,
consider how much extra implementation that is. Lots of them probably could
be but I'm not sure that all can.
AK: I'm not aware of models for which this is not possible - there are no
Spark-only algorithms that I'm aware of. The work to convert Spark to Local
models may be more involved for some implementations, sure, but I don't
think any would be too bad. However if there is something that's
impossible, then that's fine too. I'm not sure we have to commit to having
local versions for every single model

The bigger problem in my experience is the Pipelines don't generally
encapsulate the entire pipeline from source data to score. They encapsulate
the part after computing underlying features. That is, if one of your
features is "total clicks from this user", that's the product of a
DataFrame operation that precedes a Pipeline. This can't be turned into a
non-distributed, non-Spark local version.
AK: That's a great point, and a really good argument for keeping any local
pipeline logic outside of Spark

Solving subsets of this problem could still be useful, and you've
highlighted some external projects that try. I'd also highlight PMML as an
established interchange format for just the model part, and for cases that
don't involve much or any pipeline, it's a better fit paired with a library
that can score from PMML.
AK: The problem with solutions like PMML is that they can tell you WHAT to
do, but not HOW EXACTLY to do it. At the end of the day, the best
model-description possible would be the metadata+ the code itself. That's
the crux of my proposal - expose the implementation so users can use Spark
models with the same exact code that was used to train

I think this is one of those things that could live outside the project,
because it's more not-Spark than Spark. Remember too that building a
solution into the project blesses one at the expense of others.

Asher Krim
Senior Software Engineer

On Mon, Mar 13, 2017 at 11:08 AM, Dongjin Lee  wrote:

> Although I love the cool idea of Asher, I'd rather +1 for Sean's view; I
> think it would be much better to live outside of the project.
>
> Best,
> Dongjin
>
> On Mon, Mar 13, 2017 at 5:39 PM, Sean Owen  wrote:
>
>> I'm skeptical.  Serving synchronous queries from a model at scale is a
>> fundamentally different activity. As you note, it doesn't logically involve
>> Spark. If it has to happen in milliseconds it's going to be in-core.
>> Scoring even 10qps with a Spark job per request is probably a non-starter;
>> think of the thousands of tasks per second and the overhead of just
>> tracking them.
>>
>> When you say the RDDs support point prediction, I think you mean that
>> those older models expose a method to score a Vector. They are not somehow
>> exposing distributed point prediction. You could add this to the newer
>> models, but it raises the question of how to make the Row to feed it? the
>> .mllib punts on this and assumes you can construct the Vector.
>>
>> I think this sweeps a lot under the rug in assuming that there can just
>> be a "local" version of every Transformer -- but, even if there could be,
>> consider how much extra implementation that is. Lots of them probably could
>> be but I'm not sure that all can.
>>
>> The bigger problem in my experience is the Pipelines don't generally
>> encapsulate the entire pipeline 

Re: Spark Local Pipelines

2017-03-13 Thread Dongjin Lee
Although I love the cool idea of Asher, I'd rather +1 for Sean's view; I
think it would be much better to live outside of the project.

Best,
Dongjin

On Mon, Mar 13, 2017 at 5:39 PM, Sean Owen  wrote:

> I'm skeptical.  Serving synchronous queries from a model at scale is a
> fundamentally different activity. As you note, it doesn't logically involve
> Spark. If it has to happen in milliseconds it's going to be in-core.
> Scoring even 10qps with a Spark job per request is probably a non-starter;
> think of the thousands of tasks per second and the overhead of just
> tracking them.
>
> When you say the RDDs support point prediction, I think you mean that
> those older models expose a method to score a Vector. They are not somehow
> exposing distributed point prediction. You could add this to the newer
> models, but it raises the question of how to make the Row to feed it? the
> .mllib punts on this and assumes you can construct the Vector.
>
> I think this sweeps a lot under the rug in assuming that there can just be
> a "local" version of every Transformer -- but, even if there could be,
> consider how much extra implementation that is. Lots of them probably could
> be but I'm not sure that all can.
>
> The bigger problem in my experience is the Pipelines don't generally
> encapsulate the entire pipeline from source data to score. They encapsulate
> the part after computing underlying features. That is, if one of your
> features is "total clicks from this user", that's the product of a
> DataFrame operation that precedes a Pipeline. This can't be turned into a
> non-distributed, non-Spark local version.
>
> Solving subsets of this problem could still be useful, and you've
> highlighted some external projects that try. I'd also highlight PMML as an
> established interchange format for just the model part, and for cases that
> don't involve much or any pipeline, it's a better fit paired with a library
> that can score from PMML.
>
> I think this is one of those things that could live outside the project,
> because it's more not-Spark than Spark. Remember too that building a
> solution into the project blesses one at the expense of others.
>
>
> On Sun, Mar 12, 2017 at 10:15 PM Asher Krim  wrote:
>
>> Hi All,
>>
>> I spent a lot of time at Spark Summit East this year talking with Spark
>> developers and committers about challenges with productizing Spark. One of
>> the biggest shortcomings I've encountered in Spark ML pipelines is the lack
>> of a way to serve single requests with any reasonable performance.
>> SPARK-10413 explores adding methods for single item prediction, but I'd
>> like to explore a more holistic approach - a separate local api, with
>> models that support transformations without depending on Spark at all.
>>
>> I've written up a doc
>> 
>> detailing the approach, and I'm happy to discuss alternatives. If this
>> gains traction, I can create a branch with a minimal example on a simple
>> transformer (probably something like CountVectorizerModel) so we have
>> something concrete to continue the discussion on.
>>
>> Thanks,
>> Asher Krim
>> Senior Software Engineer
>>
>


-- 
*Dongjin Lee*


*Software developer in Line+.So interested in massive-scale machine
learning.facebook: www.facebook.com/dongjin.lee.kr
linkedin:
kr.linkedin.com/in/dongjinleekr
github:
github.com/dongjinleekr
twitter: www.twitter.com/dongjinleekr
*


Re: Spark Local Pipelines

2017-03-13 Thread Sean Owen
I'm skeptical.  Serving synchronous queries from a model at scale is a
fundamentally different activity. As you note, it doesn't logically involve
Spark. If it has to happen in milliseconds it's going to be in-core.
Scoring even 10qps with a Spark job per request is probably a non-starter;
think of the thousands of tasks per second and the overhead of just
tracking them.

When you say the RDDs support point prediction, I think you mean that those
older models expose a method to score a Vector. They are not somehow
exposing distributed point prediction. You could add this to the newer
models, but it raises the question of how to make the Row to feed it? the
.mllib punts on this and assumes you can construct the Vector.

I think this sweeps a lot under the rug in assuming that there can just be
a "local" version of every Transformer -- but, even if there could be,
consider how much extra implementation that is. Lots of them probably could
be but I'm not sure that all can.

The bigger problem in my experience is the Pipelines don't generally
encapsulate the entire pipeline from source data to score. They encapsulate
the part after computing underlying features. That is, if one of your
features is "total clicks from this user", that's the product of a
DataFrame operation that precedes a Pipeline. This can't be turned into a
non-distributed, non-Spark local version.

Solving subsets of this problem could still be useful, and you've
highlighted some external projects that try. I'd also highlight PMML as an
established interchange format for just the model part, and for cases that
don't involve much or any pipeline, it's a better fit paired with a library
that can score from PMML.

I think this is one of those things that could live outside the project,
because it's more not-Spark than Spark. Remember too that building a
solution into the project blesses one at the expense of others.


On Sun, Mar 12, 2017 at 10:15 PM Asher Krim  wrote:

> Hi All,
>
> I spent a lot of time at Spark Summit East this year talking with Spark
> developers and committers about challenges with productizing Spark. One of
> the biggest shortcomings I've encountered in Spark ML pipelines is the lack
> of a way to serve single requests with any reasonable performance.
> SPARK-10413 explores adding methods for single item prediction, but I'd
> like to explore a more holistic approach - a separate local api, with
> models that support transformations without depending on Spark at all.
>
> I've written up a doc
> 
> detailing the approach, and I'm happy to discuss alternatives. If this
> gains traction, I can create a branch with a minimal example on a simple
> transformer (probably something like CountVectorizerModel) so we have
> something concrete to continue the discussion on.
>
> Thanks,
> Asher Krim
> Senior Software Engineer
>


Re: Spark Local Pipelines

2017-03-13 Thread Georg Heiler
Great idea. I see the same problem.
I would suggest checking the following projects as a kick start as well (
not only mleap)
https://github.com/ucbrise/clipper and
https://github.com/Hydrospheredata/mist

Regards Georg
Asher Krim  schrieb am So. 12. März 2017 um 23:21:

> Hi All,
>
> I spent a lot of time at Spark Summit East this year talking with Spark
> developers and committers about challenges with productizing Spark. One of
> the biggest shortcomings I've encountered in Spark ML pipelines is the lack
> of a way to serve single requests with any reasonable performance.
> SPARK-10413 explores adding methods for single item prediction, but I'd
> like to explore a more holistic approach - a separate local api, with
> models that support transformations without depending on Spark at all.
>
> I've written up a doc
> 
> detailing the approach, and I'm happy to discuss alternatives. If this
> gains traction, I can create a branch with a minimal example on a simple
> transformer (probably something like CountVectorizerModel) so we have
> something concrete to continue the discussion on.
>
> Thanks,
> Asher Krim
> Senior Software Engineer
>


Spark Local Pipelines

2017-03-12 Thread Asher Krim
Hi All,

I spent a lot of time at Spark Summit East this year talking with Spark
developers and committers about challenges with productizing Spark. One of
the biggest shortcomings I've encountered in Spark ML pipelines is the lack
of a way to serve single requests with any reasonable performance.
SPARK-10413 explores adding methods for single item prediction, but I'd
like to explore a more holistic approach - a separate local api, with
models that support transformations without depending on Spark at all.

I've written up a doc

detailing the approach, and I'm happy to discuss alternatives. If this
gains traction, I can create a branch with a minimal example on a simple
transformer (probably something like CountVectorizerModel) so we have
something concrete to continue the discussion on.

Thanks,
Asher Krim
Senior Software Engineer