[FLINK-2034] [ml] [docs] Adds FlinkML web documentation (introduction, vision, roadmap)
Also added attribution for some of the Latex in optimization framework. This closes #688. Project: http://git-wip-us.apache.org/repos/asf/flink/repo Commit: http://git-wip-us.apache.org/repos/asf/flink/commit/b602b2ee Tree: http://git-wip-us.apache.org/repos/asf/flink/tree/b602b2ee Diff: http://git-wip-us.apache.org/repos/asf/flink/diff/b602b2ee Branch: refs/heads/master Commit: b602b2ee1c9d130e97e844572f9827b29fbd9cf8 Parents: b3b6a9d Author: Theodore Vasiloudis <t...@sics.se> Authored: Mon May 18 15:52:56 2015 +0200 Committer: Till Rohrmann <trohrm...@apache.org> Committed: Fri May 22 09:41:00 2015 +0200 ---------------------------------------------------------------------- docs/libs/ml/contribution_guide.md | 26 +++++++++ docs/libs/ml/index.md | 58 +++++++++++++++++-- docs/libs/ml/optimization.md | 2 + docs/libs/ml/quickstart.md | 26 +++++++++ docs/libs/ml/vision_roadmap.md | 98 +++++++++++++++++++++++++++++++++ 5 files changed, 204 insertions(+), 6 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/flink/blob/b602b2ee/docs/libs/ml/contribution_guide.md ---------------------------------------------------------------------- diff --git a/docs/libs/ml/contribution_guide.md b/docs/libs/ml/contribution_guide.md new file mode 100644 index 0000000..e0db10a --- /dev/null +++ b/docs/libs/ml/contribution_guide.md @@ -0,0 +1,26 @@ +--- +title: "FlinkML - Contribution guide" +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +* This will be replaced by the TOC +{:toc} + +Coming soon. In the meantime, check our list of [open issues on JIRA](https://issues.apache.org/jira/browse/FLINK-1748?jql=component%20%3D%20%22Machine%20Learning%20Library%22%20AND%20project%20%3D%20FLINK%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20priority%20DESC) http://git-wip-us.apache.org/repos/asf/flink/blob/b602b2ee/docs/libs/ml/index.md ---------------------------------------------------------------------- diff --git a/docs/libs/ml/index.md b/docs/libs/ml/index.md index d36ce20..f774fcf 100644 --- a/docs/libs/ml/index.md +++ b/docs/libs/ml/index.md @@ -1,5 +1,5 @@ --- -title: "Machine Learning Library" +title: "FlinkML - Machine Learning for Flink" --- <!-- Licensed to the Apache Software Foundation (ASF) under one @@ -20,7 +20,18 @@ specific language governing permissions and limitations under the License. --> -## Link +FlinkML is the Machine Learning (ML) library for Flink. It is a new effort in the Flink community, +with a growing list of algorithms and contributors. With FlinkML we aim to provide +scalable ML algorithms, an intuitive API, and tools that help minimize glue code in end-to-end ML +systems. You can see more details about our goals and where the library is headed in our [vision +and roadmap here](vision_roadmap.html). + +* This will be replaced by the TOC +{:toc} + +## Getting Started + +You can use FlinkML in your project by adding the following dependency to your pom.xml {% highlight bash %} <dependency> @@ -30,16 +41,51 @@ under the License. </dependency> {% endhighlight %} -## Algorithms +## Supported Algorithms + +### Supervised Learning -* [Alternating Least Squares (ALS)](als.html) * [Communication efficient distributed dual coordinate ascent (CoCoA)](cocoa.html) * [Multiple linear regression](multiple_linear_regression.html) +* [Optimization Framework](optimization.html) + +### Data Preprocessing + * [Polynomial Base Feature Mapper](polynomial_base_feature_mapper.html) * [Standard Scaler](standard_scaler.html) -* [Optimization Framework](optimization.html) +### Recommendation + +* [Alternating Least Squares (ALS)](als.html) -## Metrics +### Utilities * [Distance Metrics](distance_metrics.html) + +## Example & Quickstart guide + +We already have some of the building blocks for FlinkML in place, and will continue to extend the +library with more algorithms. An example of how simple it is to create a learning model in +FlinkML is given below: + +{% highlight scala %} +// LabeledVector is a feature vector with a label (class or real value) +val data: DataSet[LabeledVector] = ... + +val learner = MultipleLinearRegression() + .setStepsize(1.0) + .setIterations(100) + .setConvergenceThreshold(0.001) + +learner.fit(data, parameters) + +// The learner can now be used to make predictions using learner.predict() +{% endhighlight %} + +For a more comprehensive guide, you can check out our [quickstart guide](quickstart.html) + +## How to contribute + +Please check our [roadmap](vision_roadmap.html#roadmap) and [contribution guide](contribution_guide.html). +You can also check out our list of +[unresolved issues on JIRA](https://issues.apache.org/jira/browse/FLINK-1748?jql=component%20%3D%20%22Machine%20Learning%20Library%22%20AND%20project%20%3D%20FLINK%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20priority%20DESC) http://git-wip-us.apache.org/repos/asf/flink/blob/b602b2ee/docs/libs/ml/optimization.md ---------------------------------------------------------------------- diff --git a/docs/libs/ml/optimization.md b/docs/libs/ml/optimization.md index 5d1f3a7..b30e0d0 100644 --- a/docs/libs/ml/optimization.md +++ b/docs/libs/ml/optimization.md @@ -231,3 +231,5 @@ val weightVector = weightDS // We can now use the weightVector to make predictions {% endhighlight %} + +Note: Some of the Latex math notation was adapted from Apache Spark MLlib's documentation http://git-wip-us.apache.org/repos/asf/flink/blob/b602b2ee/docs/libs/ml/quickstart.md ---------------------------------------------------------------------- diff --git a/docs/libs/ml/quickstart.md b/docs/libs/ml/quickstart.md new file mode 100644 index 0000000..43a3144 --- /dev/null +++ b/docs/libs/ml/quickstart.md @@ -0,0 +1,26 @@ +--- +title: "FlinkML - Quickstart guide" +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +* This will be replaced by the TOC +{:toc} + +Coming soon. http://git-wip-us.apache.org/repos/asf/flink/blob/b602b2ee/docs/libs/ml/vision_roadmap.md ---------------------------------------------------------------------- diff --git a/docs/libs/ml/vision_roadmap.md b/docs/libs/ml/vision_roadmap.md new file mode 100644 index 0000000..1e319b6 --- /dev/null +++ b/docs/libs/ml/vision_roadmap.md @@ -0,0 +1,98 @@ +--- +title: "FlinkML - Vision and Roadmap" +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +* This will be replaced by the TOC +{:toc} + +## Vision + +The Machine Learning (ML) library for Flink is a new effort to bring scalable ML tools to the Flink +community. Our goal is is to design and implement a system that is scalable and can deal with +problems of various sizes, whether your data size is measured in megabytes or terabytes and beyond. +We call this library FlinkML. + +An important concern for developers of ML systems is the amount of glue code that developers are +forced to write [1] in the process of implementing an end-to-end ML system. Our goal with FlinkML +is to help developers keep glue code to a minimum. The Flink ecosystem provides a great setting to +tackle this problem, with its scalable ETL capabilities that can be easily combined inside the same +program with FlinkML, allowing the development of robust pipelines without the need to use yet +another technology for data ingestion and data munging. + +Another goal for FlinkML is to make the library easy to use. To that end we will be providing +detailed documentation along with examples for every part of the system. Our aim is that developers +will be able to get started with writing their ML pipelines quickly, using familiar programming +concepts and terminology. + +Contrary to other data-processing systems, Flink exploits in-memory data streaming, and natively +executes iterative processing algorithms which are common in ML. We plan to exploit the streaming +nature of Flink, and provide functionality designed specifically for data streams. + +FlinkML will allow data scientists to test their models locally and using subsets of data, and then +use the same code to run their algorithms at a much larger scale in a cluster setting. + +We are inspired by other open source efforts to provide ML systems, in particular +[scikit-learn](http://scikit-learn.org/) for cleanly specifying ML pipelines, and Sparkâs +[MLLib](https://spark.apache.org/mllib/) for providing ML algorithms that scale with problem and +cluster sizes. + +## Roadmap + +The roadmap below can provide an indication of the algorithms we aim to implement in the coming +months. If you are interested in helping out, please check our [contribution guide](contribution_guide.html). +Items in **bold** have already been implemented: + +* Pipelines of transformers and learners +* Data pre-processing + * **Feature scaling** + * **Polynomial feature base mapper** + * Feature hashing + * Feature extraction for text + * Dimensionality reduction +* Model selection and performance evaluation + * Cross-validation for model selection and evaluation +* Supervised learning + * Optimization framework + * **Stochastic Gradient Descent** + * L-BFGS + * Generalized Linear Models + * **Multiple linear regression** + * LASSO, Ridge regression + * Multi-class Logistic regression + * Random forests + * **Support Vector Machines** +* Unsupervised learning + * Clustering + * K-means clustering + * PCA +* Recommendation + * **ALS** +* Text analytics + * LDA +* Statistical estimation tools +* Distributed linear algebra +* Streaming ML + +**References:** + +[1] D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, V. Chaudhary, +and M. Young. _Machine learning: The high interest credit card of technical debt._ In SE4ML: +Software Engineering for Machine Learning (NIPS 2014 Workshop), 2014.