[Architecture] Using SparkMLlib's Pipeline in WSO2 ML

Nirmal Fernando Tue, 29 Sep 2015 21:58:20 -0700

spark.apache.org/docs/latest/ml-guide.html#pipeline [1]

This would make us plug any Spark transformation easily and dynamically
without the need of knowing RDD types.


Same could be used to build ensemble support.

[1]
Pipeline

In machine learning, it is common to run a sequence of algorithms to
process and learn from data. E.g., a simple text document processing
workflow might include several stages:

   - Split each document’s text into words.
   - Convert each document’s words into a numerical feature vector.
   - Learn a prediction model using the feature vectors and labels.

Spark ML represents such a workflow as a Pipeline, which consists of a
sequence of PipelineStages (Transformers and Estimators) to be run in a
specific order. We will use this simple workflow as a running example in
this section.
<http://spark.apache.org/docs/latest/ml-guide.html#how-it-works>How it works

A Pipeline is specified as a sequence of stages, and each stage is either a
Transformer or an Estimator. These stages are run in order, and the input
DataFrame is transformed as it passes through each stage. For Transformer
stages, the transform() method is called on the DataFrame. For Estimator
stages, the fit() method is called to produce a Transformer (which becomes
part of the PipelineModel, or fitted Pipeline), and that Transformer’s
transform() method is called on the DataFrame.

We illustrate this for the simple text document workflow. The figure below
is for the *training time* usage of a Pipeline.

[image: Spark ML Pipeline Example]

Above, the top row represents a Pipeline with three stages. The first two (
Tokenizer and HashingTF) are Transformers (blue), and the third (
LogisticRegression) is an Estimator (red). The bottom row represents data
flowing through the pipeline, where cylinders indicate DataFrames. The
Pipeline.fit() method is called on the original DataFrame, which has raw
text documents and labels. The Tokenizer.transform() method splits the raw
text documents into words, adding a new column with words to the DataFrame.
The HashingTF.transform() method converts the words column into feature
vectors, adding a new column with those vectors to the DataFrame. Now,
since LogisticRegression is an Estimator, the Pipeline first calls
LogisticRegression.fit() to produce a LogisticRegressionModel. If the
Pipeline had more stages, it would call the LogisticRegressionModel’s
transform() method on the DataFrame before passing the DataFrame to the
next stage.

A Pipeline is an Estimator. Thus, after a Pipeline’s fit() method runs, it
produces a PipelineModel, which is a Transformer. This PipelineModel is
used at *test time*; the figure below illustrates this usage.

[image: Spark ML PipelineModel Example]

In the figure above, the PipelineModel has the same number of stages as the
original Pipeline, but all Estimators in the original Pipeline have become
Transformers. When the PipelineModel’s transform() method is called on a
test dataset, the data are passed through the fitted pipeline in order.
Each stage’s transform() method updates the dataset and passes it to the
next stage.

Pipelines and PipelineModels help to ensure that training and test data go
through identical feature processing steps.

-- 

Thanks & regards,
Nirmal

Team Lead - WSO2 Machine Learner
Associate Technical Lead - Data Technologies Team, WSO2 Inc.
Mobile: +94715779733
Blog: http://nirmalfdo.blogspot.com/

_______________________________________________
Architecture mailing list
Architecture@wso2.org
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

[Architecture] Using SparkMLlib's Pipeline in WSO2 ML

Reply via email to