[ https://issues.apache.org/jira/browse/FLINK-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14576778#comment-14576778 ]
ASF GitHub Bot commented on FLINK-2072: --------------------------------------- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/792#discussion_r31896948 --- Diff: docs/libs/ml/quickstart.md --- @@ -24,4 +24,198 @@ under the License. * This will be replaced by the TOC {:toc} -Coming soon. +## Introduction + +FlinkML is designed to make learning from your data a straight-forward process, abstracting away +the complexities that usually come with having to deal with big data learning tasks. In this +quick-start guide we will show just how easy it is to solve a simple supervised learning problem +using FlinkML. But first some basics, feel free to skip the next few lines if you're already +familiar with Machine Learning (ML) + +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those +learned patterns to make predictions about the future. We can categorize most ML algorithms into +two major categories: Supervised and Unsupervised Learning. + +* Supervised Learning deals with learning a function (mapping) from a set of inputs +(predictors) to a set of outputs. The learning is done using a __training set__ of (input, +output) pairs that we use to approximate the mapping function. Supervised learning problems are +further divided into classification and regression problems. In classification problems we try to +predict the __class__ that an example belongs to, for example whether a user is going to click on +an ad or not. Regression problems are about predicting (real) numerical values, often called the dependent +variable, for example what the temperature will be tomorrow. + +* Unsupervised learning deals with discovering patterns and regularities in the data. An example +of this would be __clustering__, where we try to discover groupings of the data from the +descriptive features. Unsupervised learning can also be used for feature selection, for example +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis). + +## Loading data + +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized +functions for formatted data, such as the LibSVM format. For supervised learning problems it is +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector` +object will have a FlinkML `Vector` member representing the features of the example and a `Double` +member which represents the label, which could be the class in a classification problem, or the dependent +variable for a regression problem. + +# TODO: Get dataset that has separate train and test sets +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data). + +We can load the data as a `DataSet[String]` first: + +{% highlight scala %} + +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)]("/path/to/breast-cancer-wisconsin.data") + +{% endhighlight %} + +The dataset has some missing values indicated by `?`. We can filter those rows out and +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the +dataset with the FlinkML classification algorithms. + +{% highlight scala %} + +val cancerLV = cancer + .map(_.productIterator.toList) + .filter(!_.contains("?")) + .map{list => + val numList = list.map(_.asInstanceOf[String].toDouble) + LabeledVector(numList(11), DenseVector(numList.take(10).toArray)) + } + +{% endhighlight %} + +We can then use this data to train a learner. + +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object. +You can also save datasets in the LibSVM format using the `writeLibSVM` function. +Let's import the Adult (a9a) dataset. You can download the +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a) +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t). + +We can simply import the dataset then using: + +{% highlight scala %} + +val adultTrain = MLUtils.readLibSVM("path/to/a8a") +val adultTest = MLUtils.readLibSVM("path/to/a8a.t") + +{% endhighlight %} + +This gives us a `DataSet[LabeledVector]` that we will use in the following section to create a classifier. + +Due to an error in the test dataset we have to adjust the test data using the following code, to +ensure that the dimensionality of all test examples is 123, as with the training set: + +{% highlight scala %} + +val adjustedTest = adultTest.map{lv => + val vec = lv.vector.asBreeze + val padded = vec.padTo(123, 0.0).toDenseVector + val fvec = padded.fromBreeze + LabeledVector(lv.label, fvec) + } + +{% endhighlight %} + +## Classification + +Once we have imported the dataset we can train a `Predictor` such as a linear SVM classifier. + +{% highlight scala %} + +val svm = SVM() + .setBlocks(env.getParallelism) + .setIterations(100) + .setRegularization(0.001) + .setStepsize(0.1) + .setSeed(42) + +svm.fit(adultTrain) + +{% endhighlight %} + +Let's now make predictions on the test set and see how well we do in terms of absolute error +We will also create a function that thresholds the predictions to the {-1, 1} scale that the +dataset uses. + +{% highlight scala %} + +def thresholdPredictions(predictions: DataSet[(Double, Double)]) +: DataSet[(Double, Double)] = { + predictions.map { + truthPrediction => + val truth = truthPrediction._1 + val prediction = truthPrediction._2 + val thresholdedPrediction = if (prediction > 0.0) 1.0 else -1.0 + (truth, thresholdedPrediction) + } +} + +val predictionPairs = thresholdPredictions(svm.predict(adjustedTest)) + +val absoluteErrorSum = predictionPairs.collect().map{ + case (truth, prediction) => Math.abs(truth - prediction)}.sum + +println(s"Absolute error: $absoluteErrorSum") + +{% endhighlight %} + +Next we will see if we can improve the performance by pre-processing our data. + +## Data pre-processing and pipelines + +A pre-processing step that is often encouraged when using SVM classification is scaling +the input features to the [0, 1] range, in order to avoid features with extreme values dominating the rest. +FlinkML has a number of `Transformers` such as `StandardScaler` that are used to pre-process data, and a key feature is the ability to +chain `Transformers` and `Predictors` together. This allows us to run the same pipeline of transformations and make predictions +on the train and test data in a straight-forward and type-safe manner. You can read more on the pipeline system of FlinkML, +[here](pipelines.html). + +Let first create a scaling transformer for the features in our dataset, and chain it to a new SVM classifier. + +{% highlight scala %} + +import org.apache.flink.ml.preprocessing.StandardScaler + +val scaler = StandardScaler() +scaler.fit(adultTrain) --- End diff -- We call `fit` later, so this is not needed here. > Add a quickstart guide for FlinkML > ---------------------------------- > > Key: FLINK-2072 > URL: https://issues.apache.org/jira/browse/FLINK-2072 > Project: Flink > Issue Type: New Feature > Components: Documentation, Machine Learning Library > Reporter: Theodore Vasiloudis > Assignee: Theodore Vasiloudis > Fix For: 0.9 > > > We need a quickstart guide that introduces users to the core concepts of > FlinkML to get them up and running quickly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)