[GitHub] [flink] morsapaes commented on a change in pull request #13203: [FLINK-18984][python][docs] Add tutorial documentation for Python DataStream API

GitBox Tue, 25 Aug 2020 12:14:11 -0700


morsapaes commented on a change in pull request #13203:
URL: https://github.com/apache/flink/pull/13203#discussion_r476650361




##########
File path: docs/dev/python/getting-started/tutorial/datastream_tutorial.md
##########
@@ -0,0 +1,126 @@
+---
+title: "Python DataStream API Tutorial"
+nav-parent_id: python_tutorial
+nav-pos: 30
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+This walkthrough will quickly get you started building a pure Python Flink 
DataStream project.
+
+Please refer to the PyFlink [installation guide]({{ site.baseurl 
}}/dev/python/getting-started/installation.html) on how to set up the Python 
execution environments.
+
+* This will be replaced by the TOC
+{:toc}
+
+## Setting up a Python Project
+
+You can begin by creating a Python project and installing the PyFlink package 
following the [installation guide]({{ site.baseurl 
}}/dev/python/getting-started/installation.html#installation-of-pyflink).
+
+## Writing a Flink Python DataStream API Program
+
+DataStream API applications begin by declaring a `StreamExecutionEnvironment`.
+This is the context in which a streaming program is executed.
+It can be used for setting execution parameters such as restart strategy, 
default parallelism, etc.
+
+{% highlight python %}
+env = StreamExecutionEnvironment.get_execution_environment()
+env.set_parallelism(1)
+{% endhighlight %}
+
+Once a `StreamExecutionEnvironment` created, you can declare your source with 
it.
+
+{% highlight python %}
+ds = env.from_collection(
+    collection=[(1, 'aaa'), (2, 'bbb')],
+    type_info=Types.ROW([Types.INT(), Types.STRING()]))
+{% endhighlight %}
+
+This creates a data stream from the given collection. The type is that of the 
elements in the collection. In this example, the type is a Row type with two 
fields. The type of the first field is integer type while the second is string 
type.
+
+You can now perform transformations on the datastream or writes the data into 
external system with sink.

Review comment:
       ```suggestion
   You can now perform transformations on this data stream, or just write the 
data to an external system using a _sink_. This walkthrough uses the 
`StreamingFileSink` sink connector to write the data into a file in the 
`/tmp/output` directory.
   ```

##########
File path: docs/dev/python/getting-started/tutorial/datastream_tutorial.md
##########
@@ -0,0 +1,126 @@
+---
+title: "Python DataStream API Tutorial"
+nav-parent_id: python_tutorial
+nav-pos: 30
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+This walkthrough will quickly get you started building a pure Python Flink 
DataStream project.
+
+Please refer to the PyFlink [installation guide]({{ site.baseurl 
}}/dev/python/getting-started/installation.html) on how to set up the Python 
execution environments.

Review comment:
       Installation with `pip` is pretty straightforward, so why not just add 
this to the tutorial instead of making the user go to a different page?
   
   If we are restructuring these anyways, I'd suggest to follow the same 
structure as the existing tutorials: 
https://ci.apache.org/projects/flink/flink-docs-release-1.11/try-flink/datastream_api.html

##########
File path: docs/dev/python/getting-started/tutorial/datastream_tutorial.md
##########
@@ -0,0 +1,126 @@
+---
+title: "Python DataStream API Tutorial"
+nav-parent_id: python_tutorial
+nav-pos: 30
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+This walkthrough will quickly get you started building a pure Python Flink 
DataStream project.
+
+Please refer to the PyFlink [installation guide]({{ site.baseurl 
}}/dev/python/getting-started/installation.html) on how to set up the Python 
execution environments.
+
+* This will be replaced by the TOC
+{:toc}
+
+## Setting up a Python Project
+
+You can begin by creating a Python project and installing the PyFlink package 
following the [installation guide]({{ site.baseurl 
}}/dev/python/getting-started/installation.html#installation-of-pyflink).
+
+## Writing a Flink Python DataStream API Program
+
+DataStream API applications begin by declaring a `StreamExecutionEnvironment`.
+This is the context in which a streaming program is executed.
+It can be used for setting execution parameters such as restart strategy, 
default parallelism, etc.
+
+{% highlight python %}
+env = StreamExecutionEnvironment.get_execution_environment()
+env.set_parallelism(1)
+{% endhighlight %}
+
+Once a `StreamExecutionEnvironment` created, you can declare your source with 
it.
+
+{% highlight python %}
+ds = env.from_collection(
+    collection=[(1, 'aaa'), (2, 'bbb')],
+    type_info=Types.ROW([Types.INT(), Types.STRING()]))
+{% endhighlight %}
+
+This creates a data stream from the given collection. The type is that of the 
elements in the collection. In this example, the type is a Row type with two 
fields. The type of the first field is integer type while the second is string 
type.
+
+You can now perform transformations on the datastream or writes the data into 
external system with sink.
+
+{% highlight python %}
+ds.add_sink(StreamingFileSink
+    .for_row_format('/tmp/output', SimpleStringEncoder())
+    .build())
+{% endhighlight %}
+
+Finally you must execute the actual Flink Python DataStream API job.
+All operations, such as creating sources, transformations and sinks are lazy.
+Only when `env.execute(job_name)` is called will runs the job.
+
+{% highlight python %}
+env.execute("tutorial_job")
+{% endhighlight %}
+
+The complete code so far:
+
+{% highlight python %}
+from pyflink.common.serialization import SimpleStringEncoder
+from pyflink.common.typeinfo import Types
+from pyflink.datastream import StreamExecutionEnvironment
+from pyflink.datastream.connectors import StreamingFileSink
+
+
+def tutorial():
+    env = StreamExecutionEnvironment.get_execution_environment()
+    env.set_parallelism(1)
+    ds = env.from_collection(
+        collection=[(1, 'aaa'), (2, 'bbb')],
+        type_info=Types.ROW([Types.INT(), Types.STRING()]))
+    ds.add_sink(StreamingFileSink
+                .for_row_format('/tmp/output', SimpleStringEncoder())
+                .build())
+    env.execute("tutorial_job")
+
+
+if __name__ == '__main__':
+    tutorial()
+{% endhighlight %}
+
+## Executing a Flink Python DataStream API Program
+Firstly, make sure the output directory is not existed:

Review comment:
       ```suggestion
   Now that you defined your PyFlink program, you can run it! First, make sure 
that the output directory doesn't exist:
   ```

##########
File path: docs/dev/python/getting-started/tutorial/datastream_tutorial.md
##########
@@ -0,0 +1,126 @@
+---
+title: "Python DataStream API Tutorial"
+nav-parent_id: python_tutorial
+nav-pos: 30
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+This walkthrough will quickly get you started building a pure Python Flink 
DataStream project.
+
+Please refer to the PyFlink [installation guide]({{ site.baseurl 
}}/dev/python/getting-started/installation.html) on how to set up the Python 
execution environments.
+
+* This will be replaced by the TOC
+{:toc}
+
+## Setting up a Python Project
+
+You can begin by creating a Python project and installing the PyFlink package 
following the [installation guide]({{ site.baseurl 
}}/dev/python/getting-started/installation.html#installation-of-pyflink).
+
+## Writing a Flink Python DataStream API Program
+
+DataStream API applications begin by declaring a `StreamExecutionEnvironment`.
+This is the context in which a streaming program is executed.
+It can be used for setting execution parameters such as restart strategy, 
default parallelism, etc.
+
+{% highlight python %}
+env = StreamExecutionEnvironment.get_execution_environment()
+env.set_parallelism(1)
+{% endhighlight %}
+
+Once a `StreamExecutionEnvironment` created, you can declare your source with 
it.
+
+{% highlight python %}
+ds = env.from_collection(
+    collection=[(1, 'aaa'), (2, 'bbb')],
+    type_info=Types.ROW([Types.INT(), Types.STRING()]))
+{% endhighlight %}
+
+This creates a data stream from the given collection. The type is that of the 
elements in the collection. In this example, the type is a Row type with two 
fields. The type of the first field is integer type while the second is string 
type.
+
+You can now perform transformations on the datastream or writes the data into 
external system with sink.
+
+{% highlight python %}
+ds.add_sink(StreamingFileSink
+    .for_row_format('/tmp/output', SimpleStringEncoder())
+    .build())
+{% endhighlight %}
+
+Finally you must execute the actual Flink Python DataStream API job.

Review comment:
       ```suggestion
   The last step is to execute the actual PyFlink DataStream API job. PyFlink 
applications are built lazily and shipped to the cluster for execution only 
once fully formed. To execute an application, you simply call 
`env.execute(job_name)`.
   ```

##########
File path: docs/dev/python/getting-started/tutorial/datastream_tutorial.md
##########
@@ -0,0 +1,126 @@
+---
+title: "Python DataStream API Tutorial"
+nav-parent_id: python_tutorial
+nav-pos: 30
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+This walkthrough will quickly get you started building a pure Python Flink 
DataStream project.
+
+Please refer to the PyFlink [installation guide]({{ site.baseurl 
}}/dev/python/getting-started/installation.html) on how to set up the Python 
execution environments.
+
+* This will be replaced by the TOC
+{:toc}
+
+## Setting up a Python Project
+
+You can begin by creating a Python project and installing the PyFlink package 
following the [installation guide]({{ site.baseurl 
}}/dev/python/getting-started/installation.html#installation-of-pyflink).
+
+## Writing a Flink Python DataStream API Program
+
+DataStream API applications begin by declaring a `StreamExecutionEnvironment`.
+This is the context in which a streaming program is executed.
+It can be used for setting execution parameters such as restart strategy, 
default parallelism, etc.
+
+{% highlight python %}
+env = StreamExecutionEnvironment.get_execution_environment()
+env.set_parallelism(1)
+{% endhighlight %}
+
+Once a `StreamExecutionEnvironment` created, you can declare your source with 
it.
+
+{% highlight python %}
+ds = env.from_collection(
+    collection=[(1, 'aaa'), (2, 'bbb')],
+    type_info=Types.ROW([Types.INT(), Types.STRING()]))
+{% endhighlight %}
+
+This creates a data stream from the given collection. The type is that of the 
elements in the collection. In this example, the type is a Row type with two 
fields. The type of the first field is integer type while the second is string 
type.
+
+You can now perform transformations on the datastream or writes the data into 
external system with sink.
+
+{% highlight python %}
+ds.add_sink(StreamingFileSink
+    .for_row_format('/tmp/output', SimpleStringEncoder())
+    .build())
+{% endhighlight %}
+
+Finally you must execute the actual Flink Python DataStream API job.
+All operations, such as creating sources, transformations and sinks are lazy.
+Only when `env.execute(job_name)` is called will runs the job.
+
+{% highlight python %}
+env.execute("tutorial_job")
+{% endhighlight %}
+
+The complete code so far:
+
+{% highlight python %}
+from pyflink.common.serialization import SimpleStringEncoder
+from pyflink.common.typeinfo import Types
+from pyflink.datastream import StreamExecutionEnvironment
+from pyflink.datastream.connectors import StreamingFileSink
+
+
+def tutorial():
+    env = StreamExecutionEnvironment.get_execution_environment()
+    env.set_parallelism(1)
+    ds = env.from_collection(
+        collection=[(1, 'aaa'), (2, 'bbb')],
+        type_info=Types.ROW([Types.INT(), Types.STRING()]))
+    ds.add_sink(StreamingFileSink
+                .for_row_format('/tmp/output', SimpleStringEncoder())
+                .build())
+    env.execute("tutorial_job")
+
+
+if __name__ == '__main__':
+    tutorial()
+{% endhighlight %}
+
+## Executing a Flink Python DataStream API Program

Review comment:
       Is there a reason to use "Flink Python" instead of PyFlink (the question 
applies to the whole walkthrough)?

##########
File path: docs/dev/python/getting-started/tutorial/datastream_tutorial.md
##########
@@ -0,0 +1,126 @@
+---
+title: "Python DataStream API Tutorial"
+nav-parent_id: python_tutorial
+nav-pos: 30
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+This walkthrough will quickly get you started building a pure Python Flink 
DataStream project.
+
+Please refer to the PyFlink [installation guide]({{ site.baseurl 
}}/dev/python/getting-started/installation.html) on how to set up the Python 
execution environments.
+
+* This will be replaced by the TOC
+{:toc}
+
+## Setting up a Python Project
+
+You can begin by creating a Python project and installing the PyFlink package 
following the [installation guide]({{ site.baseurl 
}}/dev/python/getting-started/installation.html#installation-of-pyflink).
+
+## Writing a Flink Python DataStream API Program
+
+DataStream API applications begin by declaring a `StreamExecutionEnvironment`.
+This is the context in which a streaming program is executed.
+It can be used for setting execution parameters such as restart strategy, 
default parallelism, etc.
+
+{% highlight python %}
+env = StreamExecutionEnvironment.get_execution_environment()
+env.set_parallelism(1)
+{% endhighlight %}
+
+Once a `StreamExecutionEnvironment` created, you can declare your source with 
it.
+
+{% highlight python %}
+ds = env.from_collection(
+    collection=[(1, 'aaa'), (2, 'bbb')],
+    type_info=Types.ROW([Types.INT(), Types.STRING()]))
+{% endhighlight %}
+
+This creates a data stream from the given collection. The type is that of the 
elements in the collection. In this example, the type is a Row type with two 
fields. The type of the first field is integer type while the second is string 
type.
+
+You can now perform transformations on the datastream or writes the data into 
external system with sink.
+
+{% highlight python %}
+ds.add_sink(StreamingFileSink
+    .for_row_format('/tmp/output', SimpleStringEncoder())
+    .build())
+{% endhighlight %}
+
+Finally you must execute the actual Flink Python DataStream API job.
+All operations, such as creating sources, transformations and sinks are lazy.
+Only when `env.execute(job_name)` is called will runs the job.
+
+{% highlight python %}
+env.execute("tutorial_job")
+{% endhighlight %}
+
+The complete code so far:
+
+{% highlight python %}
+from pyflink.common.serialization import SimpleStringEncoder
+from pyflink.common.typeinfo import Types
+from pyflink.datastream import StreamExecutionEnvironment
+from pyflink.datastream.connectors import StreamingFileSink
+
+
+def tutorial():
+    env = StreamExecutionEnvironment.get_execution_environment()
+    env.set_parallelism(1)
+    ds = env.from_collection(
+        collection=[(1, 'aaa'), (2, 'bbb')],
+        type_info=Types.ROW([Types.INT(), Types.STRING()]))
+    ds.add_sink(StreamingFileSink
+                .for_row_format('/tmp/output', SimpleStringEncoder())
+                .build())
+    env.execute("tutorial_job")
+
+
+if __name__ == '__main__':
+    tutorial()
+{% endhighlight %}
+
+## Executing a Flink Python DataStream API Program
+Firstly, make sure the output directory is not existed:
+
+{% highlight bash %}
+rm -rf /tmp/output
+{% endhighlight %}
+
+Next, you can run this example on the command line:
+
+{% highlight bash %}
+$ python datastream_tutorial.py
+{% endhighlight %}
+
+The command builds and runs the Python DataStream API program in a local mini 
cluster.
+You can also submit the Python DataStream API program to a remote cluster, you 
can refer
+[Job Submission Examples]({{ site.baseurl 
}}/ops/cli.html#job-submission-examples)
+for more details.

Review comment:
       ```suggestion
   The command builds and runs your PyFlink program in a local mini cluster. 
You can alternatively submit it to a remote cluster using the instructions 
detailed in [Job Submission Examples]({{ site.baseurl 
}}/ops/cli.html#job-submission-examples).
   ```

##########
File path: docs/dev/python/getting-started/tutorial/datastream_tutorial.md
##########
@@ -0,0 +1,126 @@
+---
+title: "Python DataStream API Tutorial"
+nav-parent_id: python_tutorial
+nav-pos: 30
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+This walkthrough will quickly get you started building a pure Python Flink 
DataStream project.

Review comment:
       Some context on what is the DataStream API + the same for the Table API 
tutorial would be helpful. There is a nice snippet from the "Fraud Detection 
with the DataStream API" tutorial that could be used here, like:
   
   _"Apache Flink offers a DataStream API for building robust, stateful 
streaming applications. It provides fine-grained control over state and time, 
which allows for the implementation of advanced event-driven systems. In this 
step-by-step guide, you’ll learn how to build a stateful streaming application 
with PyFlink and the DataStream API."_
   
   (In the same way, the Table API tutorial can use the introduction from "Real 
Time Reporting with the Table API".)

##########
File path: docs/dev/python/getting-started/tutorial/datastream_tutorial.md
##########
@@ -0,0 +1,126 @@
+---
+title: "Python DataStream API Tutorial"
+nav-parent_id: python_tutorial
+nav-pos: 30
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+This walkthrough will quickly get you started building a pure Python Flink 
DataStream project.
+
+Please refer to the PyFlink [installation guide]({{ site.baseurl 
}}/dev/python/getting-started/installation.html) on how to set up the Python 
execution environments.
+
+* This will be replaced by the TOC
+{:toc}
+
+## Setting up a Python Project
+
+You can begin by creating a Python project and installing the PyFlink package 
following the [installation guide]({{ site.baseurl 
}}/dev/python/getting-started/installation.html#installation-of-pyflink).
+
+## Writing a Flink Python DataStream API Program
+
+DataStream API applications begin by declaring a `StreamExecutionEnvironment`.
+This is the context in which a streaming program is executed.
+It can be used for setting execution parameters such as restart strategy, 
default parallelism, etc.

Review comment:
       ```suggestion
   DataStream API applications begin by declaring an execution environment 
(`StreamExecutionEnvironment`), the context in which a streaming program is 
executed. This is what you will use to set the properties of your job (e.g. 
default parallelism, restart strategy), create your sources and finally trigger 
the execution of the job.
   ```

##########
File path: docs/dev/python/getting-started/tutorial/datastream_tutorial.md
##########
@@ -0,0 +1,126 @@
+---
+title: "Python DataStream API Tutorial"
+nav-parent_id: python_tutorial
+nav-pos: 30
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+This walkthrough will quickly get you started building a pure Python Flink 
DataStream project.
+
+Please refer to the PyFlink [installation guide]({{ site.baseurl 
}}/dev/python/getting-started/installation.html) on how to set up the Python 
execution environments.
+
+* This will be replaced by the TOC
+{:toc}
+
+## Setting up a Python Project
+
+You can begin by creating a Python project and installing the PyFlink package 
following the [installation guide]({{ site.baseurl 
}}/dev/python/getting-started/installation.html#installation-of-pyflink).
+
+## Writing a Flink Python DataStream API Program
+
+DataStream API applications begin by declaring a `StreamExecutionEnvironment`.
+This is the context in which a streaming program is executed.
+It can be used for setting execution parameters such as restart strategy, 
default parallelism, etc.
+
+{% highlight python %}
+env = StreamExecutionEnvironment.get_execution_environment()
+env.set_parallelism(1)
+{% endhighlight %}
+
+Once a `StreamExecutionEnvironment` created, you can declare your source with 
it.

Review comment:
       ```suggestion
   Once a `StreamExecutionEnvironment` is created, you can use it to declare 
your _source_. Sources ingest data from external systems, such as Apache Kafka, 
Rabbit MQ, or Apache Pulsar, into Flink Jobs. 
   
   To keep things simple, this walkthrough uses a source that is backed by a 
collection of elements.
   ```

##########
File path: docs/dev/python/getting-started/tutorial/datastream_tutorial.md
##########
@@ -0,0 +1,126 @@
+---
+title: "Python DataStream API Tutorial"
+nav-parent_id: python_tutorial
+nav-pos: 30
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+This walkthrough will quickly get you started building a pure Python Flink 
DataStream project.
+
+Please refer to the PyFlink [installation guide]({{ site.baseurl 
}}/dev/python/getting-started/installation.html) on how to set up the Python 
execution environments.
+
+* This will be replaced by the TOC
+{:toc}
+
+## Setting up a Python Project
+
+You can begin by creating a Python project and installing the PyFlink package 
following the [installation guide]({{ site.baseurl 
}}/dev/python/getting-started/installation.html#installation-of-pyflink).
+
+## Writing a Flink Python DataStream API Program
+
+DataStream API applications begin by declaring a `StreamExecutionEnvironment`.
+This is the context in which a streaming program is executed.
+It can be used for setting execution parameters such as restart strategy, 
default parallelism, etc.
+
+{% highlight python %}
+env = StreamExecutionEnvironment.get_execution_environment()
+env.set_parallelism(1)
+{% endhighlight %}
+
+Once a `StreamExecutionEnvironment` created, you can declare your source with 
it.
+
+{% highlight python %}
+ds = env.from_collection(
+    collection=[(1, 'aaa'), (2, 'bbb')],
+    type_info=Types.ROW([Types.INT(), Types.STRING()]))
+{% endhighlight %}
+
+This creates a data stream from the given collection. The type is that of the 
elements in the collection. In this example, the type is a Row type with two 
fields. The type of the first field is integer type while the second is string 
type.

Review comment:
       ```suggestion
   This creates a data stream from the given collection, with the same type as 
that of the elements in it (here, a `ROW` type with a INT field and a STRING 
field).
   ```

##########
File path: docs/dev/python/getting-started/tutorial/datastream_tutorial.md
##########
@@ -0,0 +1,126 @@
+---
+title: "Python DataStream API Tutorial"
+nav-parent_id: python_tutorial
+nav-pos: 30
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+This walkthrough will quickly get you started building a pure Python Flink 
DataStream project.
+
+Please refer to the PyFlink [installation guide]({{ site.baseurl 
}}/dev/python/getting-started/installation.html) on how to set up the Python 
execution environments.
+
+* This will be replaced by the TOC
+{:toc}
+
+## Setting up a Python Project
+
+You can begin by creating a Python project and installing the PyFlink package 
following the [installation guide]({{ site.baseurl 
}}/dev/python/getting-started/installation.html#installation-of-pyflink).
+
+## Writing a Flink Python DataStream API Program
+
+DataStream API applications begin by declaring a `StreamExecutionEnvironment`.
+This is the context in which a streaming program is executed.
+It can be used for setting execution parameters such as restart strategy, 
default parallelism, etc.
+
+{% highlight python %}
+env = StreamExecutionEnvironment.get_execution_environment()
+env.set_parallelism(1)
+{% endhighlight %}
+
+Once a `StreamExecutionEnvironment` created, you can declare your source with 
it.
+
+{% highlight python %}
+ds = env.from_collection(
+    collection=[(1, 'aaa'), (2, 'bbb')],

Review comment:
       I get that this is reusing existing sample code, but it'd be nice to 
evolve the example to use a more relevant use case in the future. 
   
   (This is actually a reminder to myself, as I get my hands in PyFlink. 🙃 )

##########
File path: docs/dev/python/getting-started/tutorial/datastream_tutorial.md
##########
@@ -0,0 +1,126 @@
+---
+title: "Python DataStream API Tutorial"
+nav-parent_id: python_tutorial
+nav-pos: 30
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+This walkthrough will quickly get you started building a pure Python Flink 
DataStream project.
+
+Please refer to the PyFlink [installation guide]({{ site.baseurl 
}}/dev/python/getting-started/installation.html) on how to set up the Python 
execution environments.
+
+* This will be replaced by the TOC
+{:toc}
+
+## Setting up a Python Project
+
+You can begin by creating a Python project and installing the PyFlink package 
following the [installation guide]({{ site.baseurl 
}}/dev/python/getting-started/installation.html#installation-of-pyflink).
+
+## Writing a Flink Python DataStream API Program
+
+DataStream API applications begin by declaring a `StreamExecutionEnvironment`.
+This is the context in which a streaming program is executed.
+It can be used for setting execution parameters such as restart strategy, 
default parallelism, etc.
+
+{% highlight python %}
+env = StreamExecutionEnvironment.get_execution_environment()
+env.set_parallelism(1)
+{% endhighlight %}
+
+Once a `StreamExecutionEnvironment` created, you can declare your source with 
it.
+
+{% highlight python %}
+ds = env.from_collection(
+    collection=[(1, 'aaa'), (2, 'bbb')],
+    type_info=Types.ROW([Types.INT(), Types.STRING()]))
+{% endhighlight %}
+
+This creates a data stream from the given collection. The type is that of the 
elements in the collection. In this example, the type is a Row type with two 
fields. The type of the first field is integer type while the second is string 
type.
+
+You can now perform transformations on the datastream or writes the data into 
external system with sink.
+
+{% highlight python %}
+ds.add_sink(StreamingFileSink
+    .for_row_format('/tmp/output', SimpleStringEncoder())
+    .build())
+{% endhighlight %}
+
+Finally you must execute the actual Flink Python DataStream API job.
+All operations, such as creating sources, transformations and sinks are lazy.
+Only when `env.execute(job_name)` is called will runs the job.
+
+{% highlight python %}
+env.execute("tutorial_job")
+{% endhighlight %}
+
+The complete code so far:
+
+{% highlight python %}
+from pyflink.common.serialization import SimpleStringEncoder
+from pyflink.common.typeinfo import Types
+from pyflink.datastream import StreamExecutionEnvironment
+from pyflink.datastream.connectors import StreamingFileSink
+
+
+def tutorial():
+    env = StreamExecutionEnvironment.get_execution_environment()
+    env.set_parallelism(1)
+    ds = env.from_collection(
+        collection=[(1, 'aaa'), (2, 'bbb')],
+        type_info=Types.ROW([Types.INT(), Types.STRING()]))
+    ds.add_sink(StreamingFileSink
+                .for_row_format('/tmp/output', SimpleStringEncoder())
+                .build())
+    env.execute("tutorial_job")
+
+
+if __name__ == '__main__':
+    tutorial()
+{% endhighlight %}
+
+## Executing a Flink Python DataStream API Program
+Firstly, make sure the output directory is not existed:
+
+{% highlight bash %}
+rm -rf /tmp/output
+{% endhighlight %}
+
+Next, you can run this example on the command line:
+
+{% highlight bash %}
+$ python datastream_tutorial.py
+{% endhighlight %}
+
+The command builds and runs the Python DataStream API program in a local mini 
cluster.
+You can also submit the Python DataStream API program to a remote cluster, you 
can refer
+[Job Submission Examples]({{ site.baseurl 
}}/ops/cli.html#job-submission-examples)
+for more details.
+
+Finally, you can see the execution result on the command line:
+
+{% highlight bash %}
+$ find /tmp/output -type f -exec cat {} \;
+1,aaa
+2,bbb
+{% endhighlight %}
+
+This should get you started with writing your own Flink Python DataStream API 
programs.
+To learn more about the Python DataStream API, you can refer
+[Flink Python API Docs]({{ site.pythondocs_baseurl }}/api/python) for more 
details.

Review comment:
       ```suggestion
   This walkthrough gives you the foundations to get started writing your own 
PyFlink DataStream API programs. To learn more about the Python DataStream API, 
please refer to [Flink Python API Docs]({{ site.pythondocs_baseurl 
}}/api/python) for more details.
   ```

##########
File path: docs/dev/python/getting-started/tutorial/datastream_tutorial.md
##########
@@ -0,0 +1,126 @@
+---
+title: "Python DataStream API Tutorial"
+nav-parent_id: python_tutorial
+nav-pos: 30
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+This walkthrough will quickly get you started building a pure Python Flink 
DataStream project.
+
+Please refer to the PyFlink [installation guide]({{ site.baseurl 
}}/dev/python/getting-started/installation.html) on how to set up the Python 
execution environments.
+
+* This will be replaced by the TOC
+{:toc}
+
+## Setting up a Python Project
+
+You can begin by creating a Python project and installing the PyFlink package 
following the [installation guide]({{ site.baseurl 
}}/dev/python/getting-started/installation.html#installation-of-pyflink).
+
+## Writing a Flink Python DataStream API Program
+
+DataStream API applications begin by declaring a `StreamExecutionEnvironment`.
+This is the context in which a streaming program is executed.
+It can be used for setting execution parameters such as restart strategy, 
default parallelism, etc.
+
+{% highlight python %}
+env = StreamExecutionEnvironment.get_execution_environment()
+env.set_parallelism(1)
+{% endhighlight %}
+
+Once a `StreamExecutionEnvironment` created, you can declare your source with 
it.
+
+{% highlight python %}
+ds = env.from_collection(
+    collection=[(1, 'aaa'), (2, 'bbb')],
+    type_info=Types.ROW([Types.INT(), Types.STRING()]))
+{% endhighlight %}
+
+This creates a data stream from the given collection. The type is that of the 
elements in the collection. In this example, the type is a Row type with two 
fields. The type of the first field is integer type while the second is string 
type.
+
+You can now perform transformations on the datastream or writes the data into 
external system with sink.
+
+{% highlight python %}
+ds.add_sink(StreamingFileSink
+    .for_row_format('/tmp/output', SimpleStringEncoder())
+    .build())
+{% endhighlight %}
+
+Finally you must execute the actual Flink Python DataStream API job.
+All operations, such as creating sources, transformations and sinks are lazy.
+Only when `env.execute(job_name)` is called will runs the job.
+
+{% highlight python %}
+env.execute("tutorial_job")
+{% endhighlight %}
+
+The complete code so far:
+
+{% highlight python %}
+from pyflink.common.serialization import SimpleStringEncoder
+from pyflink.common.typeinfo import Types
+from pyflink.datastream import StreamExecutionEnvironment
+from pyflink.datastream.connectors import StreamingFileSink
+
+
+def tutorial():
+    env = StreamExecutionEnvironment.get_execution_environment()
+    env.set_parallelism(1)
+    ds = env.from_collection(
+        collection=[(1, 'aaa'), (2, 'bbb')],
+        type_info=Types.ROW([Types.INT(), Types.STRING()]))
+    ds.add_sink(StreamingFileSink
+                .for_row_format('/tmp/output', SimpleStringEncoder())
+                .build())
+    env.execute("tutorial_job")
+
+
+if __name__ == '__main__':
+    tutorial()
+{% endhighlight %}
+
+## Executing a Flink Python DataStream API Program
+Firstly, make sure the output directory is not existed:
+
+{% highlight bash %}
+rm -rf /tmp/output
+{% endhighlight %}
+
+Next, you can run this example on the command line:

Review comment:
       ```suggestion
   Next, you can run the example you just created on the command line:
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] morsapaes commented on a change in pull request #13203: [FLINK-18984][python][docs] Add tutorial documentation for Python DataStream API

Reply via email to