[jira] [Commented] (SPARK-16431) Add a unified method that accepts single instances to feature transformers and predictors

Joseph K. Bradley (JIRA) Fri, 22 Jul 2016 11:08:48 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-16431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15389954#comment-15389954
 ]


Joseph K. Bradley commented on SPARK-16431:
-------------------------------------------

Single-row prediction is something we are working towards, but supporting it 
for general Pipelines will require much more design and coordination with Spark 
SQL.  The current plan is to:
* make methods like predict() public (short term solution) [SPARK-10413]
* support general Pipelines...to be designed

The current functionality in your PR is most similar to [SPARK-10413].  If you 
would be interested in [SPARK-10413], it would be great to get your help there; 
there are more subtasks to create and solve.

For the long term, we will need to spend some time thinking about the right way 
to support general Pipelines, with multiple columns being passed in and out of 
Transformers.

I'll close this issue, but it would be great to get your help elsewhere!  
Please also check out 
[https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark]; it 
is important to get familiar with the process of contributing with small 
patches before jumping into major changes.  Thanks!

> Add a unified method that accepts single instances to feature transformers 
> and predictors
> -----------------------------------------------------------------------------------------
>
>                 Key: SPARK-16431
>                 URL: https://issues.apache.org/jira/browse/SPARK-16431
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>            Reporter: Hussein Hazimeh
>            Priority: Minor
>
> Current transformers in spark.ml can only operate on DataFrames and don't 
> have a method that accepts single instances. A typical transformer has a 
> User-Defined Function (udf) in its *transform* method which includes a set of 
> operations on the features of a single instance:
> {code}val column_operation = udf {operations on single instance}{code}
> Adding a new method that operates directly on single instances (e.g. called 
> *transformInstance*) and using it in the udf instead can be useful:
> {code}def transformInstance(features: featureType): OutputType = {operations 
> on single instance}
> val column_operation = udf {transformInstance}{code}
> Predictors also don’t have a public method that does predictions on single 
> instances. *transformInstance* can be easily added to predictors by acting as 
> a wrapper for the internal method predict (which takes features as input).
> This simple change has (at least) three benefits.
> # Providing a low-latency transformation/prediction method to support machine 
> learning applications that require real-time predictions. The current 
> *transform* method has a relatively high latency when transforming single 
> instances or small batches due to the overhead introduced by DataFrame 
> operations. I measured the latency required to classify a single instance in 
> the 20 Newsgroups dataset using the current *transform* method and the 
> proposed *transformInstance*.  The ML pipeline contains a tokenizer, stopword 
> remover, TF hasher, IDF, scaler, and Logisitc Regression. The table below 
> shows the latency percentiles in milliseconds after measuring the time to 
> classify 700 documents. 
> ||Transformation Method||P50||P90||P99||Max||
> |*transform*|31.44|39.43|67.75|126.97|
> |*transformInstance*|0.16|0.38|1.16|3.2|
> *transformInstance* is 200 times faster on average and can classify a 
> document in less than a millisecond.  By profiling the code of *transform*, 
> it turns out that every transformer in the pipeline wastes 5 milliseconds on 
> average in DataFrame-related operations when transforming a single instance. 
> This implies that the latency increases linearly with the pipeline size which 
> can be problematic. 
> # Increasing code readability and allowing easier debugging as operations on 
> rows are now combined into a function that can be tested independently of the 
> higher-level *transform* method.
> # Adding flexibility to create new models: for example, check this 
> [comment|https://github.com/apache/spark/pull/8883#issuecomment-215559305] on 
> supporting new ensemble methods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16431) Add a unified method that accepts single instances to feature transformers and predictors

Reply via email to