[jira] [Commented] (FLINK-2073) Add contribution guide for FlinkML

ASF GitHub Bot (JIRA) Tue, 26 May 2015 11:23:06 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14559556#comment-14559556
 ]


ASF GitHub Bot commented on FLINK-2073:
---------------------------------------

Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/727#discussion_r31063481
  
    --- Diff: docs/libs/ml/contribution_guide.md ---
    @@ -20,7 +21,329 @@ specific language governing permissions and limitations
     under the License.
     -->
     
    +The Flink community highly appreciates all sorts of contributions to 
FlinkML.
    +FlinkML offers people interested in machine learning to work on a highly 
active open source project which makes scalable ML reality.
    +The following document describes how to contribute to FlinkML.
    +
     * This will be replaced by the TOC
     {:toc}
     
    -Coming soon. In the meantime, check our list of [open issues on 
JIRA](https://issues.apache.org/jira/browse/FLINK-1748?jql=component%20%3D%20%22Machine%20Learning%20Library%22%20AND%20project%20%3D%20FLINK%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20priority%20DESC)
    +## Getting Started
    +
    +In order to get started first read Flink's [contribution 
guide](http://flink.apache.org/how-to-contribute.html).
    +Everything from this guide also applies to FlinkML.
    +
    +## Pick a Topic
    +
    +If you are looking for some new ideas, then you should check out the list 
of [unresolved issues on 
JIRA](https://issues.apache.org/jira/issues/?jql=component%20%3D%20%22Machine%20Learning%20Library%22%20AND%20project%20%3D%20FLINK%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20priority%20DESC).
    +Once you decide to contribute to one of these issues, you should take 
ownership of it and track your progress with this issue.
    +That way, the other contributors know the state of the different issues 
and redundant work is avoided.
    +
    +If you already know what you want to contribute to FlinkML all the better.
    +It is still advisable to create a JIRA issue for your idea to tell the 
Flink community what you want to do, though.
    +
    +## Testing
    +
    +New contributions should come with tests to verify the correct behavior of 
the algorithm.
    +The tests help to maintain the algorithm's correctness throughout code 
changes, e.g. refactorings.
    +
    +We distinguish between unit tests, which are executed during maven's test 
phase, and integration tests, which are executed during maven's verify phase.
    +Maven automatically makes this distinction by using the following naming 
rules:
    +All test cases whose class name ends with a suffix fulfilling the regular 
expression `(IT|Integration)(Test|Suite|Case)`, are considered integration 
tests.
    +The rest are considered unit tests and should only test behavior which is 
local to the component under test.
    +
    +An integration test is a test which requires the full Flink system to be 
started.
    +In order to do that properly, all integration test cases have to mix in 
the trait `FlinkTestBase`.
    +This trait will set the right `ExecutionEnvironment` so that the test will 
be executed on a special `FlinkMiniCluster` designated for testing purposes.
    +Thus, an integration test could look the following:
    +
    +{% highlight scala %}
    +class ExampleITSuite extends FlatSpec with FlinkTestBase {
    +  behavior of "An example algorithm"
    +  
    +  it should "do something" in {
    +    ...
    +  }
    +}
    +{% endhighlight %}
    +
    +The test style does not have to be `FlatSpec` but can be any other 
scalatest `Suite` subclass. 
    +
    +## Documentation
    +
    +When contributing new algorithms, it is required to add code comments 
describing the functioning of the algorithm and its parameters with which the 
user can control its behavior.
    +Additionally, we would like to encourage contributors to add this 
information to the online documentation.
    +The online documentation for FlinkML's components can be found in the 
directory `docs/libs/ml`.
    +
    +Every new algorithm is described by a single markdown file.
    +This file should contain at least the following points:
    +
    +1. What does the algorithm do
    +2. How does the algorithm work (or reference to description) 
    +3. Parameter description with default values
    +4. Code snippet showing how the algorithm is used
    +
    +In order to use latex syntax in the markdown file, you have to include 
`mathjax: include` in the YAML front matter.
    + 
    +{% highlight java %}
    +---
    +mathjax: include
    +title: Example title
    +---
    +{% endhighlight %}
    +
    +In order to use displayed mathematics, you have to put your latex code in 
`$$ ... $$`.
    +For in-line mathematics, use `$ ... $`.
    +Additionally some predefined latex commands are included into the scope of 
your markdown file.
    +See `docs/_include/latex_commands.html` for the complete list of 
predefined latex commands.
    +
    +## Contributing
    +
    +Once you have implemented the algorithm with adequate test coverage and 
added documentation, you are ready to open a pull request.
    +Details of how to open a pull request can be found 
[here](http://flink.apache.org/how-to-contribute.html#contributing-code--documentation).
 
    +
    +## How to Implement a Pipeline Operator
    +
    +FlinkML follows the principle to make machine learning as easy and 
accessible as possible.
    +Therefore, it supports a flexible pipelining mechanism which allows users 
to quickly define their analysis pipelines consisting of a multitude of 
different components.
    +A pipeline operator is either a `Transformer` or a `Predictor`.
    +A `Transformer` can be fitted to training data and transforms data from 
one format into another format.
    +A scaler which changes the mean and variance of its input data according 
to the mean and variance of some training data is an example for a 
`Transformer`.
    +In contrast, a `Predictor` encapsulates a data model and the corresponding 
logic to train it.
    +Once a `Predictor` has trained the model, it can be used to make new 
predictions.
    +A support vector machine which is first trained to obtain the support 
vectors and then used to classify data points is an example for a `Predictor`.
    +A general description of FlinkML's pipelining can be found 
[here]({{site.baseurl}}/libs/ml/pipelines.html).
    +In order to support the pipelining, algorithms have to adhere to a certain 
design pattern, which we will describe next.
    +
    +Let's assume that we want to implement a pipeline operator which changes 
the mean of your data.
    +At first, we have to reflect which type of pipeline operator it is.
    +Since centering data is a common preprocessing step in any analysis 
pipeline, we will implement it as a `Transformer`.
    +Therefore, we first create a `MeanTransformer` class which inherits from 
`Transformer`
    +
    +{% highlight scala %}
    +class MeanTransformer extends Transformer[Centering] {}
    --- End diff --
    
    good catch


> Add contribution guide for FlinkML
> ----------------------------------
>
>                 Key: FLINK-2073
>                 URL: https://issues.apache.org/jira/browse/FLINK-2073
>             Project: Flink
>          Issue Type: New Feature
>          Components: Documentation, Machine Learning Library
>            Reporter: Theodore Vasiloudis
>            Assignee: Till Rohrmann
>             Fix For: 0.9
>
>
> We need a guide for contributions to FlinkML in order to encourage the 
> extension of the library, and provide guidelines for developers.
> One thing that should be included is a step-by-step guide to create a 
> transformer, or other Estimator



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2073) Add contribution guide for FlinkML

Reply via email to