[GitHub] [flink-ml] lindong28 commented on a change in pull request #61: [FLINK-26100][docs] Add doc for ops & key concepts (release-2.0)

GitBox Fri, 18 Feb 2022 01:14:21 -0800


lindong28 commented on a change in pull request #61:
URL: https://github.com/apache/flink-ml/pull/61#discussion_r809779373




##########
File path: docs/content/docs/development/iteration.md
##########
@@ -0,0 +1,230 @@
+---
+title: "Iteration"
+weight: 2
+type: docs
+aliases:
+- /development/iteration.html
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Iteration
+
+Iteration is a basic building block for a ML library. In machine learning
+algorithms, iteration might be used in offline or online training process. In
+general, two types of iterations are required and Flink ML supports both of 
them
+in order to provide the infrastructure for a variety of algorithms.
+
+1. **Bounded Iteration**: Usually used in the offline case. In this case the
+   algorithm usually train on a bounded dataset, it updates the parameters for
+   multiple rounds until convergence.
+2. **Unbounded Iteration**: Usually used in the online case, in this case the
+   algorithm usually train on an unbounded dataset. It accumulates a mini-batch
+   of data and then do one update to the parameters. 
+
+## Iteration Paradigm
+
+An iterative algorithm has the following behavior pattern:
+
+- The iterative algorithm has an ***iteration body*** that is repeatedly 
invoked
+  until some termination criteria is reached (e.g. after a user-specified 
number
+  of epochs has been reached). An iteration body is a subgraph of operators 
that
+  implements the computation logic of e.g. an iterative machine learning
+  algorithm, whose outputs might be be fed back as the inputs of this 
subgraph. 

Review comment:
       `be be` -> `be`

##########
File path: docs/content/docs/development/iteration.md
##########
@@ -0,0 +1,230 @@
+---
+title: "Iteration"
+weight: 2
+type: docs
+aliases:
+- /development/iteration.html
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Iteration
+
+Iteration is a basic building block for a ML library. In machine learning
+algorithms, iteration might be used in offline or online training process. In
+general, two types of iterations are required and Flink ML supports both of 
them
+in order to provide the infrastructure for a variety of algorithms.
+
+1. **Bounded Iteration**: Usually used in the offline case. In this case the
+   algorithm usually train on a bounded dataset, it updates the parameters for
+   multiple rounds until convergence.
+2. **Unbounded Iteration**: Usually used in the online case, in this case the
+   algorithm usually train on an unbounded dataset. It accumulates a mini-batch
+   of data and then do one update to the parameters. 
+
+## Iteration Paradigm
+
+An iterative algorithm has the following behavior pattern:
+
+- The iterative algorithm has an ***iteration body*** that is repeatedly 
invoked
+  until some termination criteria is reached (e.g. after a user-specified 
number
+  of epochs has been reached). An iteration body is a subgraph of operators 
that
+  implements the computation logic of e.g. an iterative machine learning
+  algorithm, whose outputs might be be fed back as the inputs of this 
subgraph. 
+- In each invocation, the iteration body updates the model parameters based on
+  the user-provided data as well as the most recent model parameters.
+- The iterative algorithm takes as inputs the user-provided data and the 
initial
+  model parameters.
+- The iterative algorithm could output arbitrary user-defined information, such
+  as the loss after each epoch, or the final model parameters. 
+
+Therefore, the behavior of an iterative algorithm could be characterized with
+the following iteration paradigm (w.r.t. Flink concepts):
+
+- An iteration-body is a Flink subgraph with the following inputs and outputs:
+  - Inputs: **model-variables** (as a list of data streams) and
+    **user-provided-data** (as another list of data streams)
+  - Outputs: **feedback-model-variables** (as a list of data streams) and
+    **user-observed-outputs** (as a list of data streams)
+- A **termination-condition** that specifies when the iterative execution of 
the
+  iteration body should terminate.
+- In order to execute an iteration body, a user needs to execute an iteration
+  body the following inputs, and gets the following outputs.

Review comment:
       body the following -> body with the following

##########
File path: docs/content/docs/development/iteration.md
##########
@@ -0,0 +1,230 @@
+---
+title: "Iteration"
+weight: 2
+type: docs
+aliases:
+- /development/iteration.html
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Iteration
+
+Iteration is a basic building block for a ML library. In machine learning
+algorithms, iteration might be used in offline or online training process. In
+general, two types of iterations are required and Flink ML supports both of 
them
+in order to provide the infrastructure for a variety of algorithms.
+
+1. **Bounded Iteration**: Usually used in the offline case. In this case the
+   algorithm usually train on a bounded dataset, it updates the parameters for
+   multiple rounds until convergence.
+2. **Unbounded Iteration**: Usually used in the online case, in this case the
+   algorithm usually train on an unbounded dataset. It accumulates a mini-batch

Review comment:
       train -> trains

##########
File path: docs/content/docs/development/overview.md
##########
@@ -0,0 +1,251 @@
+---
+title: "Overview"
+weight: 1
+type: docs
+aliases:
+- /development/overview.html
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Overview
+
+This document provides a brief introduction to the basic concepts in Flink ML. 
+
+## Table API
+
+Flink ML's API is based on Flink's Table API. The Table API is a
+language-integrated query API for Java, Scala, and Python that allows the
+composition of queries from relational operators such as selection, filter, and
+join in a very intuitive way.
+
+Table API allows the usage of a wide range of data types. [Flink Document Data
+Types](https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/table/types/)
+page provides a list of supported types. In addition to these types, Flink ML
+also provides support for `Vector` Type.
+
+The Table API integrates seamlessly with Flink’s DataStream API. You can easily
+switch between all APIs and libraries which build upon them. Please refer to
+Flink's document for how to convert between `Table` and `DataStream`, as well 
as
+other usage of Flink Table API.
+
+## Stage
+
+A `Stage` is a node in a `Pipeline` or `Graph`. It is the fundamental component
+in Flink ML. This interface is only a concept, and does not have any actual
+functionality. Its subclasses include the follows.
+
+- `Estimator`: An `Estimator` is a `Stage` that is reponsible for the training
+  process in machine learning algorithms. It implements a `fit()` method that
+  takes a list of tables and produces a `Model`.
+
+- `AlgoOperator`: An `AlgoOperator` is a `Stage` that is used to encode generic
+  multi-input multi-output computation logic. It implements a `transform()`
+  method, which applies certain computation logic on the given input tables and
+  returns a list of result tables.
+
+- `Transformer`: A `Transformer` is an `AlgoOperator` with the semantic
+  difference that it encodes the Transformation logic, such that a record in 
the
+  output typically corresponds to one record in the input. In contrast, an
+  `AlgoOperator` is a better fit to express aggregation logic where a record in
+  the output could be computed from an arbitrary number of records in the 
input.
+
+- `Model`: A `Model` is a `Transformer` with the extra APIs to set and get 
model
+  data. It is typically generated by fitting an `Estimator` on a list of 
tables.
+  It provides `getModelData()` and `setModelData()`, which allows users to
+  explicitly read or write model data tables to the transformer. Each table
+  could be an unbounded stream of model data changes.
+
+A typical usage of `Stage` is to create an `Estimator` instance first, trigger
+its training process by invoking its `fit()` method, and to perform predictions
+with the resulting `Model` instance. This example usage is shown in the code
+below.
+
+```java
+// Suppose SumModel is a concrete subclass of Model, SumEstimator is a 
concrete subclass of Estimator.
+
+Table trainData = ...;
+Table predictData = ...;
+
+SumEstimator estimator = new SumEstimator();
+SumModel model = estimator.fit(trainData);
+Table predictResult = model.transform(predictData)[0];
+```
+
+## Builders
+
+In order to organize Flink ML stages into more complexed format so as to 
achieve
+advanced functionalities, like chaining data processing and machine learning
+algorithms together, Flink ML provides APIs that help to manage the 
relationship
+and structure of stages in Flink jobs. The entry of these APIs includes
+`Pipeline` and `Graph`.
+
+### Pipeline
+
+A `Pipeline` acts as an `Estimator`. It consists of an ordered list of stages,
+each of which could be an `Estimator`, `Model`, `Transformer` or 
`AlgoOperator`.
+Its `fit()` method goes through all stages of this pipeline in order and does
+the following on each stage until the last `Estimator` (inclusive).
+
+- If a stage is an `Estimator`, it would invoke the stage's `fit()` method with
+  the input tables to generage a `Model`. And if there is `Estimator` after 
this
+  stage, it would transform the input tables using the generated `Model` to get
+  result tables, then pass the result tables to the next stage as inputs.
+- If a stage is an `AlgoOperator` AND there is `Estimator` after this stage, it
+  would transform the input tables using this stage to get result tables, then
+  pass the result tables to the next stage as inputs.
+
+After all the `Estimators` are trained to fit their input tables, a new
+`PipelineModel` will be created with the same stages in this pipeline, except
+that all the `Estimator`s in the `PipelineModel` are replaced with the models
+generated in the above process.
+
+A `PipelineModel` acts as a `Model`. It consists of an ordered list of stages,
+each of which could be a `Model`, `Transformer` or `AlgoOperator`. Its
+`transform()` method applies all stages in this `PipelineModel` on the input
+tables in order. The output of one stage is used as the input of the next stage
+(if any). The output of the last stage is returned as the result of this 
method.
+
+A `Pipeline` can be created by passing a list of `Stage`s to Pipeline's
+constructor. For example,
+
+```java
+// Suppose SumModel is a concrete subclass of Model, SumEstimator is a 
concrete subclass of Estimator.
+
+Model modelA = new SumModel().setModelData(tEnv.fromValues(10));
+Estimator estimatorA = new SumEstimator();
+Model modelB = new SumModel().setModelData(tEnv.fromValues(30));
+
+List<Stage<?>> stages = Arrays.asList(modelA, estimatorA, modelB);
+Estimator<?, ?> estimator = new Pipeline(stages);
+```
+
+The commands above creates a Pipeline like follows.
+
+{{< mermaid >}}
+
+graph LR
+
+empty0[ ] --> modelA --> estimatorA --> modelB --> empty1[ ]
+
+style empty0 fill:#FFFFFF, stroke:#FFFFFF;
+style empty1 fill:#FFFFFF, stroke:#FFFFFF;
+
+{{< /mermaid >}}
+
+### Graph
+
+A `Graph` acts as an `Estimator`. A `Graph` consists of a DAG of stages, each 
of
+which could be an `Estimator`, `Model`, `Transformer` or `AlgoOperator`. When
+`Graph::fit` is called, the stages are executed in a topologically-sorted 
order.
+If a stage is an `Estimator`, its `Estimator::fit` method will be called on the
+input tables (from the input edges) to fit a `Model`. Then the `Model` will be
+used to transform the input tables and produce output tables to the output
+edges. If a stage is an `AlgoOperator`, its `AlgoOperator::transform` method
+will be called on the input tables and produce output tables to the output
+edges. The `GraphModel` fitted from a `Graph` consists of the fitted `Models`
+and `AlgoOperators`, corresponding to the `Graph`'s stages.
+
+A `GraphModel` acts as a `Model`. A `GraphModel` consists of a DAG of stages,
+each of which could be an `Estimator`, `Model`, `Transformer` or 
`AlgoOperator`.
+When `GraphModel::transform` is called, the stages are executed in a
+topologically-sorted order. When a stage is executed, its
+`AlgoOperator::transform` method will be called on the input tables (from the
+input edges) and produce output tables to the output edges.
+
+A `Graph` can be constructed via the `GraphBuilder` class, which provides
+methods like `addAlgoOperator` or `addEstimator` to help adding stages to a
+graph. Flink ML also introduces `TableId` class to represent the input/output 
of
+a stage and to help express the relationship between stages in a graph, thus
+allowing users to construct the DAG before they have the concrete tables
+available.
+
+The example codes below shows how to build a `Graph`.
+
+```java
+// Suppose SumModel is a concrete subclass of Model.
+
+GraphBuilder builder = new GraphBuilder();
+// Creates nodes.
+SumModel stage1 = new SumModel().setModelData(tEnv.fromValues(1));
+SumModel stage2 = new SumModel();
+SumModel stage3 = new SumModel().setModelData(tEnv.fromValues(3));
+// Creates inputs and modelDataInputs.
+TableId input = builder.createTableId();
+TableId modelDataInput = builder.createTableId();
+// Feeds inputs and gets outputs.
+TableId output1 = builder.addAlgoOperator(stage1, input)[0];
+TableId output2 = builder.addAlgoOperator(stage2, output1)[0];
+builder.setModelDataOnModel(stage2, modelDataInput);
+TableId output3 = builder.addAlgoOperator(stage3, output2)[0];
+TableId modelDataOutput = builder.getModelDataFromModel(stage3)[0];
+
+// Builds a Model from the graph.
+TableId[] inputs = new TableId[] {input};
+TableId[] outputs = new TableId[] {output3};
+TableId[] modelDataInputs = new TableId[] {modelDataInput};
+TableId[] modelDataOutputs = new TableId[] {modelDataOutput};
+Model<?> model = builder.buildModel(inputs, outputs, modelDataInputs, 
modelDataOutputs);
+```
+
+The code above constructs a `Graph` like follows.
+
+{{< mermaid >}}
+
+graph LR
+
+empty0[ ] --> |input| stage1
+stage1 --> |output1| stage2
+empty1[ ] --> |modelDataInput| stage2
+stage2 --> |output2| stage3
+stage3 --> |output3| empty3[ ]
+stage3 --> |modelDataOutput| empty2[ ]
+
+style empty0 fill:#FFFFFF, stroke:#FFFFFF;
+style empty1 fill:#FFFFFF, stroke:#FFFFFF;
+style empty2 fill:#FFFFFF, stroke:#FFFFFF;
+style empty3 fill:#FFFFFF, stroke:#FFFFFF;
+
+{{< /mermaid >}}
+
+## Parameter
+
+Flink ML `Stage` is a subclass of `WithParams`, which provides a uniform API to
+get and set parameters.
+
+A `Param` is the definition of a parameter, including name, class, description,
+default value and the validator.
+
+In order to set the parameter of an algorithm, users can use any of the
+following ways.
+
+- Invoke the parameter's specific set method. For example, in order to set `K`,
+  the number of clusters, of a K-means algorithm, users can directly invoke
+  `setK()` method on that `KMeans` instance.
+- Pass a parameter map containing new values to the stage through
+  `ReadWriteUtils.updateExistingParams()` method.
+
+If a `Model` is generated through an `Estimator`'s `fit()` method, the `Model`
+would inherit the `Estimator` object's parameters. Thus there is no need to set
+the parameters for a second time of the parameters are not changed.
+
+Parameters belong to specific instances of `Stage`s. For example, if we have 
two

Review comment:
       This seems pretty intuitive, right? Would it be simpler to remove this 
statement instead of explicitly explaining it?

##########
File path: docs/content/docs/development/iteration.md
##########
@@ -0,0 +1,230 @@
+---
+title: "Iteration"
+weight: 2
+type: docs
+aliases:
+- /development/iteration.html
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Iteration
+
+Iteration is a basic building block for a ML library. In machine learning
+algorithms, iteration might be used in offline or online training process. In
+general, two types of iterations are required and Flink ML supports both of 
them
+in order to provide the infrastructure for a variety of algorithms.
+
+1. **Bounded Iteration**: Usually used in the offline case. In this case the
+   algorithm usually train on a bounded dataset, it updates the parameters for
+   multiple rounds until convergence.
+2. **Unbounded Iteration**: Usually used in the online case, in this case the
+   algorithm usually train on an unbounded dataset. It accumulates a mini-batch
+   of data and then do one update to the parameters. 
+
+## Iteration Paradigm
+
+An iterative algorithm has the following behavior pattern:
+
+- The iterative algorithm has an ***iteration body*** that is repeatedly 
invoked
+  until some termination criteria is reached (e.g. after a user-specified 
number
+  of epochs has been reached). An iteration body is a subgraph of operators 
that
+  implements the computation logic of e.g. an iterative machine learning
+  algorithm, whose outputs might be be fed back as the inputs of this 
subgraph. 
+- In each invocation, the iteration body updates the model parameters based on
+  the user-provided data as well as the most recent model parameters.
+- The iterative algorithm takes as inputs the user-provided data and the 
initial
+  model parameters.
+- The iterative algorithm could output arbitrary user-defined information, such
+  as the loss after each epoch, or the final model parameters. 
+
+Therefore, the behavior of an iterative algorithm could be characterized with
+the following iteration paradigm (w.r.t. Flink concepts):
+
+- An iteration-body is a Flink subgraph with the following inputs and outputs:
+  - Inputs: **model-variables** (as a list of data streams) and
+    **user-provided-data** (as another list of data streams)
+  - Outputs: **feedback-model-variables** (as a list of data streams) and
+    **user-observed-outputs** (as a list of data streams)
+- A **termination-condition** that specifies when the iterative execution of 
the
+  iteration body should terminate.
+- In order to execute an iteration body, a user needs to execute an iteration
+  body the following inputs, and gets the following outputs.
+  - Inputs: **initial-model-variables** (as a list of bounded data streams) and
+    **user-provided-data** (as a list of data streams)
+  - Outputs: the **user-observed-output** emitted by the iteration body.
+
+It is important to note that the **model-variables** expected by the iteration
+body is not the same as the **initial-model-variables** provided by the user.
+Instead, **model-variables** are computed as the union of the
+**feedback-model-variables** (emitted by the iteration body) and the
+**initial-model-variables** (provided by the caller of the iteration body).
+Flink ML provides utility class (see Iterations) to run an iteration-body with
+the user-provided inputs.
+
+The figure below summarizes the iteration paradigm described above. 
+
+{{<  mermaid >}}
+flowchart LR
+
+subgraph Iteration Body
+union1
+union2
+node11
+node12
+node21
+node22
+nodeX
+end
+
+input0 --> node11
+
+union1 -. feedback .-  node12
+input1 --> union1
+union1 --> node11
+node11 --> nodeX
+nodeX --> node12
+node12 --> output1
+
+input2 --> union2
+union2 --> node21
+node21 --> nodeX
+nodeX --> node22
+node22 --> output2
+union2 -. feedback .-  node22
+
+input0[non-iterate input]
+input1[iterate input]
+input2[iterate input]
+union1[union]
+union2[union]
+node11( )
+node12( )
+nodeX( )
+node21( )
+node22( )
+output1[output]
+output2[output]
+
+{{<  /mermaid >}}
+
+## API
+
+The main entry of Flink ML's iteration lies in `Iterations` class. It mainly
+provides two public methods and users may choose to use either of them based on
+whether the input data is bounded or unbounded.
+
+```java
+public class Iterations {
+  public static DataStreamList iterateUnboundedStreams(
+    DataStreamList initVariableStreams, DataStreamList dataStreams, 
IterationBody body) {...}
+  ...
+  public static DataStreamList iterateBoundedStreamsUntilTermination(
+    DataStreamList initVariableStreams,
+    ReplayableDataStreamList dataStreams,
+    IterationConfig config,
+    IterationBody body){...}
+}
+```
+
+To construct an iteration, Users are required to provide
+
+- `initVariableStreams`: the initial values of the variable data streams which
+  would be updated in each round.
+- `dataStreams`: the other data streams used inside the iteration, but would 
not
+  be updated.
+- `iterationBody`: specifies the subgraph to update the variable streams and 
the
+  outputs.
+
+The `IterationBody` will be invoked with two parameters: The first parameter is
+a list of input variable streams, which are created as the union of the initial
+variable streams and the corresponding feedback variable streams (returned by
+the iteration body); The second parameter is the data streams given to this
+method. 
+
+```java
+public interface IterationBody extends Serializable {
+  ...
+  IterationBodyResult process(DataStreamList variableStreams, DataStreamList 
dataStreams);
+  ...
+}
+```
+
+Notes that inside the iteration body, users could only create the subgraph from

Review comment:
       Hmm.. I am not sure this is the case. Why we can not create sink inside 
the body?
   
   If we are not sure about this statement, it is probably simpler to just 
remove it.

##########
File path: docs/content/docs/development/overview.md
##########
@@ -0,0 +1,251 @@
+---
+title: "Overview"
+weight: 1
+type: docs
+aliases:
+- /development/overview.html
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Overview
+
+This document provides a brief introduction to the basic concepts in Flink ML. 
+
+## Table API
+
+Flink ML's API is based on Flink's Table API. The Table API is a
+language-integrated query API for Java, Scala, and Python that allows the
+composition of queries from relational operators such as selection, filter, and
+join in a very intuitive way.
+
+Table API allows the usage of a wide range of data types. [Flink Document Data
+Types](https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/table/types/)
+page provides a list of supported types. In addition to these types, Flink ML
+also provides support for `Vector` Type.
+
+The Table API integrates seamlessly with Flink’s DataStream API. You can easily
+switch between all APIs and libraries which build upon them. Please refer to
+Flink's document for how to convert between `Table` and `DataStream`, as well 
as
+other usage of Flink Table API.
+
+## Stage
+
+A `Stage` is a node in a `Pipeline` or `Graph`. It is the fundamental component
+in Flink ML. This interface is only a concept, and does not have any actual
+functionality. Its subclasses include the follows.
+
+- `Estimator`: An `Estimator` is a `Stage` that is reponsible for the training
+  process in machine learning algorithms. It implements a `fit()` method that
+  takes a list of tables and produces a `Model`.
+
+- `AlgoOperator`: An `AlgoOperator` is a `Stage` that is used to encode generic
+  multi-input multi-output computation logic. It implements a `transform()`
+  method, which applies certain computation logic on the given input tables and
+  returns a list of result tables.
+
+- `Transformer`: A `Transformer` is an `AlgoOperator` with the semantic
+  difference that it encodes the Transformation logic, such that a record in 
the
+  output typically corresponds to one record in the input. In contrast, an
+  `AlgoOperator` is a better fit to express aggregation logic where a record in
+  the output could be computed from an arbitrary number of records in the 
input.
+
+- `Model`: A `Model` is a `Transformer` with the extra APIs to set and get 
model
+  data. It is typically generated by fitting an `Estimator` on a list of 
tables.
+  It provides `getModelData()` and `setModelData()`, which allows users to
+  explicitly read or write model data tables to the transformer. Each table
+  could be an unbounded stream of model data changes.
+
+A typical usage of `Stage` is to create an `Estimator` instance first, trigger
+its training process by invoking its `fit()` method, and to perform predictions
+with the resulting `Model` instance. This example usage is shown in the code
+below.
+
+```java
+// Suppose SumModel is a concrete subclass of Model, SumEstimator is a 
concrete subclass of Estimator.
+
+Table trainData = ...;
+Table predictData = ...;
+
+SumEstimator estimator = new SumEstimator();
+SumModel model = estimator.fit(trainData);
+Table predictResult = model.transform(predictData)[0];
+```
+
+## Builders
+
+In order to organize Flink ML stages into more complexed format so as to 
achieve
+advanced functionalities, like chaining data processing and machine learning
+algorithms together, Flink ML provides APIs that help to manage the 
relationship
+and structure of stages in Flink jobs. The entry of these APIs includes
+`Pipeline` and `Graph`.
+
+### Pipeline
+
+A `Pipeline` acts as an `Estimator`. It consists of an ordered list of stages,
+each of which could be an `Estimator`, `Model`, `Transformer` or 
`AlgoOperator`.
+Its `fit()` method goes through all stages of this pipeline in order and does
+the following on each stage until the last `Estimator` (inclusive).
+
+- If a stage is an `Estimator`, it would invoke the stage's `fit()` method with
+  the input tables to generage a `Model`. And if there is `Estimator` after 
this
+  stage, it would transform the input tables using the generated `Model` to get
+  result tables, then pass the result tables to the next stage as inputs.
+- If a stage is an `AlgoOperator` AND there is `Estimator` after this stage, it
+  would transform the input tables using this stage to get result tables, then
+  pass the result tables to the next stage as inputs.
+
+After all the `Estimators` are trained to fit their input tables, a new
+`PipelineModel` will be created with the same stages in this pipeline, except
+that all the `Estimator`s in the `PipelineModel` are replaced with the models
+generated in the above process.
+
+A `PipelineModel` acts as a `Model`. It consists of an ordered list of stages,
+each of which could be a `Model`, `Transformer` or `AlgoOperator`. Its
+`transform()` method applies all stages in this `PipelineModel` on the input
+tables in order. The output of one stage is used as the input of the next stage
+(if any). The output of the last stage is returned as the result of this 
method.
+
+A `Pipeline` can be created by passing a list of `Stage`s to Pipeline's
+constructor. For example,
+
+```java
+// Suppose SumModel is a concrete subclass of Model, SumEstimator is a 
concrete subclass of Estimator.
+
+Model modelA = new SumModel().setModelData(tEnv.fromValues(10));
+Estimator estimatorA = new SumEstimator();
+Model modelB = new SumModel().setModelData(tEnv.fromValues(30));
+
+List<Stage<?>> stages = Arrays.asList(modelA, estimatorA, modelB);
+Estimator<?, ?> estimator = new Pipeline(stages);
+```
+
+The commands above creates a Pipeline like follows.
+
+{{< mermaid >}}
+
+graph LR
+
+empty0[ ] --> modelA --> estimatorA --> modelB --> empty1[ ]
+
+style empty0 fill:#FFFFFF, stroke:#FFFFFF;
+style empty1 fill:#FFFFFF, stroke:#FFFFFF;
+
+{{< /mermaid >}}
+
+### Graph
+
+A `Graph` acts as an `Estimator`. A `Graph` consists of a DAG of stages, each 
of
+which could be an `Estimator`, `Model`, `Transformer` or `AlgoOperator`. When
+`Graph::fit` is called, the stages are executed in a topologically-sorted 
order.
+If a stage is an `Estimator`, its `Estimator::fit` method will be called on the
+input tables (from the input edges) to fit a `Model`. Then the `Model` will be
+used to transform the input tables and produce output tables to the output
+edges. If a stage is an `AlgoOperator`, its `AlgoOperator::transform` method
+will be called on the input tables and produce output tables to the output
+edges. The `GraphModel` fitted from a `Graph` consists of the fitted `Models`
+and `AlgoOperators`, corresponding to the `Graph`'s stages.
+
+A `GraphModel` acts as a `Model`. A `GraphModel` consists of a DAG of stages,
+each of which could be an `Estimator`, `Model`, `Transformer` or 
`AlgoOperator`.
+When `GraphModel::transform` is called, the stages are executed in a
+topologically-sorted order. When a stage is executed, its
+`AlgoOperator::transform` method will be called on the input tables (from the
+input edges) and produce output tables to the output edges.
+
+A `Graph` can be constructed via the `GraphBuilder` class, which provides
+methods like `addAlgoOperator` or `addEstimator` to help adding stages to a
+graph. Flink ML also introduces `TableId` class to represent the input/output 
of
+a stage and to help express the relationship between stages in a graph, thus
+allowing users to construct the DAG before they have the concrete tables
+available.
+
+The example codes below shows how to build a `Graph`.
+
+```java
+// Suppose SumModel is a concrete subclass of Model.
+
+GraphBuilder builder = new GraphBuilder();
+// Creates nodes.
+SumModel stage1 = new SumModel().setModelData(tEnv.fromValues(1));
+SumModel stage2 = new SumModel();
+SumModel stage3 = new SumModel().setModelData(tEnv.fromValues(3));
+// Creates inputs and modelDataInputs.
+TableId input = builder.createTableId();
+TableId modelDataInput = builder.createTableId();
+// Feeds inputs and gets outputs.
+TableId output1 = builder.addAlgoOperator(stage1, input)[0];
+TableId output2 = builder.addAlgoOperator(stage2, output1)[0];
+builder.setModelDataOnModel(stage2, modelDataInput);
+TableId output3 = builder.addAlgoOperator(stage3, output2)[0];
+TableId modelDataOutput = builder.getModelDataFromModel(stage3)[0];
+
+// Builds a Model from the graph.
+TableId[] inputs = new TableId[] {input};
+TableId[] outputs = new TableId[] {output3};
+TableId[] modelDataInputs = new TableId[] {modelDataInput};
+TableId[] modelDataOutputs = new TableId[] {modelDataOutput};
+Model<?> model = builder.buildModel(inputs, outputs, modelDataInputs, 
modelDataOutputs);
+```
+
+The code above constructs a `Graph` like follows.
+
+{{< mermaid >}}
+
+graph LR
+
+empty0[ ] --> |input| stage1
+stage1 --> |output1| stage2
+empty1[ ] --> |modelDataInput| stage2
+stage2 --> |output2| stage3
+stage3 --> |output3| empty3[ ]
+stage3 --> |modelDataOutput| empty2[ ]
+
+style empty0 fill:#FFFFFF, stroke:#FFFFFF;
+style empty1 fill:#FFFFFF, stroke:#FFFFFF;
+style empty2 fill:#FFFFFF, stroke:#FFFFFF;
+style empty3 fill:#FFFFFF, stroke:#FFFFFF;
+
+{{< /mermaid >}}
+
+## Parameter
+
+Flink ML `Stage` is a subclass of `WithParams`, which provides a uniform API to
+get and set parameters.
+
+A `Param` is the definition of a parameter, including name, class, description,
+default value and the validator.
+
+In order to set the parameter of an algorithm, users can use any of the
+following ways.
+
+- Invoke the parameter's specific set method. For example, in order to set `K`,
+  the number of clusters, of a K-means algorithm, users can directly invoke
+  `setK()` method on that `KMeans` instance.
+- Pass a parameter map containing new values to the stage through
+  `ReadWriteUtils.updateExistingParams()` method.
+
+If a `Model` is generated through an `Estimator`'s `fit()` method, the `Model`
+would inherit the `Estimator` object's parameters. Thus there is no need to set
+the parameters for a second time of the parameters are not changed.

Review comment:
       of -> if

##########
File path: docs/content/docs/development/iteration.md
##########
@@ -0,0 +1,230 @@
+---
+title: "Iteration"
+weight: 2
+type: docs
+aliases:
+- /development/iteration.html
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Iteration
+
+Iteration is a basic building block for a ML library. In machine learning
+algorithms, iteration might be used in offline or online training process. In
+general, two types of iterations are required and Flink ML supports both of 
them
+in order to provide the infrastructure for a variety of algorithms.
+
+1. **Bounded Iteration**: Usually used in the offline case. In this case the
+   algorithm usually train on a bounded dataset, it updates the parameters for

Review comment:
       train -> trains




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink-ml] lindong28 commented on a change in pull request #61: [FLINK-26100][docs] Add doc for ops & key concepts (release-2.0)

Reply via email to