I think we can, since what she's building also a set of Spark transformations.
On Wed, Sep 30, 2015 at 11:41 AM, Srinath Perera <srin...@wso2.com> wrote: > Very nice! > > Can we do Text analytics pipeline Nethaji is building using this? So it > can serve as the PoC? > > Thanks > Srinath > > On Wed, Sep 30, 2015 at 10:27 AM, Nirmal Fernando <nir...@wso2.com> wrote: > >> spark.apache.org/docs/latest/ml-guide.html#pipeline [1] >> >> This would make us plug any Spark transformation easily and dynamically >> without the need of knowing RDD types. >> >> Same could be used to build ensemble support. >> >> [1] >> Pipeline >> >> In machine learning, it is common to run a sequence of algorithms to >> process and learn from data. E.g., a simple text document processing >> workflow might include several stages: >> >> - Split each document’s text into words. >> - Convert each document’s words into a numerical feature vector. >> - Learn a prediction model using the feature vectors and labels. >> >> Spark ML represents such a workflow as a Pipeline, which consists of a >> sequence of PipelineStages (Transformers and Estimators) to be run in a >> specific order. We will use this simple workflow as a running example in >> this section. >> <http://spark.apache.org/docs/latest/ml-guide.html#how-it-works>How it >> works >> >> A Pipeline is specified as a sequence of stages, and each stage is >> either a Transformer or an Estimator. These stages are run in order, and >> the input DataFrame is transformed as it passes through each stage. For >> Transformer stages, the transform() method is called on the DataFrame. >> For Estimator stages, the fit() method is called to produce a Transformer >> (which becomes part of the PipelineModel, or fitted Pipeline), and that >> Transformer’s transform() method is called on the DataFrame. >> >> We illustrate this for the simple text document workflow. The figure >> below is for the *training time* usage of a Pipeline. >> >> [image: Spark ML Pipeline Example] >> >> Above, the top row represents a Pipeline with three stages. The first >> two (Tokenizer and HashingTF) are Transformers (blue), and the third ( >> LogisticRegression) is an Estimator (red). The bottom row represents >> data flowing through the pipeline, where cylinders indicate DataFrames. >> The Pipeline.fit() method is called on the original DataFrame, which has >> raw text documents and labels. The Tokenizer.transform() method splits >> the raw text documents into words, adding a new column with words to the >> DataFrame. The HashingTF.transform() method converts the words column >> into feature vectors, adding a new column with those vectors to the >> DataFrame. Now, since LogisticRegression is an Estimator, the Pipeline >> first calls LogisticRegression.fit() to produce a LogisticRegressionModel. >> If the Pipeline had more stages, it would call the >> LogisticRegressionModel’s transform() method on the DataFrame before >> passing the DataFrame to the next stage. >> >> A Pipeline is an Estimator. Thus, after a Pipeline’s fit() method runs, >> it produces a PipelineModel, which is a Transformer. This PipelineModel >> is used at *test time*; the figure below illustrates this usage. >> >> [image: Spark ML PipelineModel Example] >> >> In the figure above, the PipelineModel has the same number of stages as >> the original Pipeline, but all Estimators in the original Pipeline have >> become Transformers. When the PipelineModel’s transform() method is >> called on a test dataset, the data are passed through the fitted pipeline >> in order. Each stage’s transform() method updates the dataset and passes >> it to the next stage. >> >> Pipelines and PipelineModels help to ensure that training and test data >> go through identical feature processing steps. >> >> -- >> >> Thanks & regards, >> Nirmal >> >> Team Lead - WSO2 Machine Learner >> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >> Mobile: +94715779733 >> Blog: http://nirmalfdo.blogspot.com/ >> >> >> > > > -- > ============================ > Blog: http://srinathsview.blogspot.com twitter:@srinath_perera > Site: http://people.apache.org/~hemapani/ > Photos: http://www.flickr.com/photos/hemapani/ > Phone: 0772360902 > -- Thanks & regards, Nirmal Team Lead - WSO2 Machine Learner Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/
_______________________________________________ Architecture mailing list Architecture@wso2.org https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture