GitHub user husseinhazimeh opened a pull request: https://github.com/apache/spark/pull/14101
[SPARK-16431] [ML] Add a unified method that accepts single instances to feature transformers and predictors ## What changes were proposed in this pull request? Current feature transformers in spark.ml can only operate on DataFrames and don't have a method that accepts single instances. A typical transformer has a User-Defined Function (udf) in its `transform` method which includes a set of operations on the features of a single instance: ``` val column_operation = udf {operations on single instance} ``` Adding a new method called `transformInstance` that operates directly on single instances and using it in the udf instead can be useful: ``` def transformInstance(features: featuresType): OutputType = {operations on single instance} val column_operation = udf {transformInstance} ``` Predictors also don't have a public method that does predictions on single instances. `transformInstance` can be easily added to predictors by acting as a wrapper for the internal method predict (which takes features as input). Note: The proposed method in this change is added to all predictors and feature transformers except OnehotEncoder, VectorSlicer, and Word2Vec, which might require bigger changes due to dependencies on the dataset's schema (they can be fixed using simple hacks but this needs to be discussed) ## Benefits 1. Providing a low-latency transformation/prediction method to support machine learning applications that require real-time predictions. The current `transform` method has a relatively high latency when transforming single instances or small batches due to the overhead introduced by DataFrame operations. I measured the latency required to classify a single instance in the 20 Newsgroups dataset using the current `transform` method and the proposed `transformInstance`. The ML pipeline contains a tokenizer, stopword remover, TF hasher, IDF, scaler, and Logisitc Regression. The table below shows the latency percentiles in milliseconds after measuring the time to classify 700 documents. Transformation Method | P50 | P90 | P99 | Max --------------------- | --- | --- | --- | --- transform | 31.44 | 39.43 | 67.75 | 126.97 transformInstance | 0.16 | 0.38 | 1.16 | 3.2 `transformInstance` is 200 times faster on average and can classify a document in less than a millisecond. By profiling the code of `transform`, it turns out that every transformer in the pipeline wastes 5 milliseconds on average in DataFrame-related operations when transforming a single instance. This implies that the latency increases linearly with the pipeline size which can be problematic. 2. Increasing code readability and allowing easier debugging as operations on rows are now combined into a function that can be tested independently of the higher-level `transform` method. 3. Adding flexibility to create new models: for example, check this [comment](https://github.com/apache/spark/pull/8883#issuecomment-215559305) on supporting new ensemble methods. ## How was this patch tested? The current tests for transformers and predictors, which invoke `transformInstance` internally, passed. You can merge this pull request into a Git repository by running: $ git pull https://github.com/husseinhazimeh/spark lowlatency Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14101.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14101 ---- commit e8b3de1e599225fa71fecc17aaa34998863fb38b Author: Hussein Hazimeh <hazi...@mit.edu> Date: 2016-07-07T20:50:22Z Add transformInstance method to predictors and transformers commit ca213e338bde7da2e308b2ffd9c3fa1b5d26122e Author: Hussein Hazimeh <h...@ieee.org> Date: 2016-07-07T21:03:46Z Update LogisticRegression.scala commit 1fe5b18a0519d324ed53108ddd809a421a811f50 Author: Hussein Hazimeh <h...@ieee.org> Date: 2016-07-07T21:21:45Z Update HashingTF.scala ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org