GitHub user husseinhazimeh opened a pull request:

    https://github.com/apache/spark/pull/14101

    [SPARK-16431] [ML] Add a unified method that accepts single instances to 
feature transformers and predictors

    ## What changes were proposed in this pull request?
    Current feature transformers in spark.ml can only operate on DataFrames and 
don't have a method that accepts single instances. A typical transformer has a 
User-Defined Function (udf) in its `transform` method which includes a set of 
operations on the features of a single instance:
    
    ```
    val column_operation = udf {operations on single instance}
    ```
    
    Adding a new method called `transformInstance` that operates directly on 
single instances and using it in the udf instead can be useful:
    
    ```
    def transformInstance(features: featuresType): OutputType = {operations on 
single instance}
    
    val column_operation = udf {transformInstance}
    ```
    
    Predictors also don't have a public method that does predictions on single 
instances. `transformInstance` can be easily added to predictors by acting as a 
wrapper for the internal method predict (which takes features as input).
    
    Note: The proposed method in this change is added to all predictors and 
feature transformers except OnehotEncoder, VectorSlicer, and Word2Vec, which 
might require bigger changes due to dependencies on the dataset's schema (they 
can be fixed using simple hacks but this needs to be discussed)
    
    ## Benefits
    
    1. Providing a low-latency transformation/prediction method to support 
machine learning applications that require real-time predictions. The current 
`transform` method has a relatively high latency when transforming single 
instances or small batches due to the overhead introduced by DataFrame 
operations. I measured the latency required to classify a single instance in 
the 20 Newsgroups dataset using the current `transform` method and the proposed 
`transformInstance`.  The ML pipeline contains a tokenizer, stopword remover, 
TF hasher, IDF, scaler, and Logisitc Regression. The table below shows the 
latency percentiles in milliseconds after measuring the time to classify 700 
documents.
    
     Transformation Method | P50 | P90 | P99 | Max
     --------------------- | --- | --- | --- | ---
     transform | 31.44 | 39.43 | 67.75 | 126.97
     transformInstance | 0.16 | 0.38 | 1.16 | 3.2
    
     `transformInstance` is 200 times faster on average and can classify a 
document in less than a millisecond.  By profiling the code of `transform`, it 
turns out that every transformer in the pipeline wastes 5 milliseconds on 
average in DataFrame-related operations when transforming a single instance. 
This implies that the latency increases linearly with the pipeline size which 
can be problematic.
     
    2. Increasing code readability and allowing easier debugging as operations 
on rows are now combined into a function that can be tested independently of 
the higher-level `transform` method.
    
    3. Adding flexibility to create new models: for example, check this 
[comment](https://github.com/apache/spark/pull/8883#issuecomment-215559305) on 
supporting new ensemble methods.
    
    ## How was this patch tested?
    The current tests for transformers and predictors, which invoke 
`transformInstance` internally, passed. 
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/husseinhazimeh/spark lowlatency

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14101.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14101
    
----
commit e8b3de1e599225fa71fecc17aaa34998863fb38b
Author: Hussein Hazimeh <hazi...@mit.edu>
Date:   2016-07-07T20:50:22Z

    Add transformInstance method to predictors and transformers

commit ca213e338bde7da2e308b2ffd9c3fa1b5d26122e
Author: Hussein Hazimeh <h...@ieee.org>
Date:   2016-07-07T21:03:46Z

    Update LogisticRegression.scala

commit 1fe5b18a0519d324ed53108ddd809a421a811f50
Author: Hussein Hazimeh <h...@ieee.org>
Date:   2016-07-07T21:21:45Z

    Update HashingTF.scala

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to