I think we can, since what she's building also a set of Spark
transformations.

On Wed, Sep 30, 2015 at 11:41 AM, Srinath Perera <srin...@wso2.com> wrote:

> Very nice!
>
> Can we do Text analytics pipeline Nethaji is building using this? So it
> can serve as the PoC?
>
> Thanks
> Srinath
>
> On Wed, Sep 30, 2015 at 10:27 AM, Nirmal Fernando <nir...@wso2.com> wrote:
>
>> spark.apache.org/docs/latest/ml-guide.html#pipeline [1]
>>
>> This would make us plug any Spark transformation easily and dynamically
>> without the need of knowing RDD types.
>>
>> Same could be used to build ensemble support.
>>
>> [1]
>> Pipeline
>>
>> In machine learning, it is common to run a sequence of algorithms to
>> process and learn from data. E.g., a simple text document processing
>> workflow might include several stages:
>>
>>    - Split each document’s text into words.
>>    - Convert each document’s words into a numerical feature vector.
>>    - Learn a prediction model using the feature vectors and labels.
>>
>> Spark ML represents such a workflow as a Pipeline, which consists of a
>> sequence of PipelineStages (Transformers and Estimators) to be run in a
>> specific order. We will use this simple workflow as a running example in
>> this section.
>> <http://spark.apache.org/docs/latest/ml-guide.html#how-it-works>How it
>> works
>>
>> A Pipeline is specified as a sequence of stages, and each stage is
>> either a Transformer or an Estimator. These stages are run in order, and
>> the input DataFrame is transformed as it passes through each stage. For
>> Transformer stages, the transform() method is called on the DataFrame.
>> For Estimator stages, the fit() method is called to produce a Transformer
>> (which becomes part of the PipelineModel, or fitted Pipeline), and that
>> Transformer’s transform() method is called on the DataFrame.
>>
>> We illustrate this for the simple text document workflow. The figure
>> below is for the *training time* usage of a Pipeline.
>>
>> [image: Spark ML Pipeline Example]
>>
>> Above, the top row represents a Pipeline with three stages. The first
>> two (Tokenizer and HashingTF) are Transformers (blue), and the third (
>> LogisticRegression) is an Estimator (red). The bottom row represents
>> data flowing through the pipeline, where cylinders indicate DataFrames.
>> The Pipeline.fit() method is called on the original DataFrame, which has
>> raw text documents and labels. The Tokenizer.transform() method splits
>> the raw text documents into words, adding a new column with words to the
>> DataFrame. The HashingTF.transform() method converts the words column
>> into feature vectors, adding a new column with those vectors to the
>> DataFrame. Now, since LogisticRegression is an Estimator, the Pipeline
>> first calls LogisticRegression.fit() to produce a LogisticRegressionModel.
>> If the Pipeline had more stages, it would call the
>> LogisticRegressionModel’s transform() method on the DataFrame before
>> passing the DataFrame to the next stage.
>>
>> A Pipeline is an Estimator. Thus, after a Pipeline’s fit() method runs,
>> it produces a PipelineModel, which is a Transformer. This PipelineModel
>> is used at *test time*; the figure below illustrates this usage.
>>
>> [image: Spark ML PipelineModel Example]
>>
>> In the figure above, the PipelineModel has the same number of stages as
>> the original Pipeline, but all Estimators in the original Pipeline have
>> become Transformers. When the PipelineModel’s transform() method is
>> called on a test dataset, the data are passed through the fitted pipeline
>> in order. Each stage’s transform() method updates the dataset and passes
>> it to the next stage.
>>
>> Pipelines and PipelineModels help to ensure that training and test data
>> go through identical feature processing steps.
>>
>> --
>>
>> Thanks & regards,
>> Nirmal
>>
>> Team Lead - WSO2 Machine Learner
>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>> Mobile: +94715779733
>> Blog: http://nirmalfdo.blogspot.com/
>>
>>
>>
>
>
> --
> ============================
> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
> Site: http://people.apache.org/~hemapani/
> Photos: http://www.flickr.com/photos/hemapani/
> Phone: 0772360902
>



-- 

Thanks & regards,
Nirmal

Team Lead - WSO2 Machine Learner
Associate Technical Lead - Data Technologies Team, WSO2 Inc.
Mobile: +94715779733
Blog: http://nirmalfdo.blogspot.com/
_______________________________________________
Architecture mailing list
Architecture@wso2.org
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Reply via email to