[GitHub] [flink] morsapaes commented on a change in pull request #12057: [FLINK-16210][docs] Extending the "Flink Architecture" section.

GitBox Thu, 14 May 2020 06:43:26 -0700


morsapaes commented on a change in pull request #12057:
URL: https://github.com/apache/flink/pull/12057#discussion_r425146659




##########
File path: docs/concepts/flink-architecture.md
##########
@@ -129,4 +167,108 @@ two main benefits:
 
 <img src="{{ site.baseurl }}/fig/slot_sharing.svg" alt="TaskManagers with 
shared Task Slots" class="offset" width="80%" />
 
+## Flink Application Execution
+
+A _Flink Application_ is any user program that spawns one or multiple Flink
+jobs from its ``main()`` method. The execution of these jobs can happen in a
+local JVM (``LocalEnvironment``) or on a remote setup of clusters with multiple
+machines (``RemoteEnvironment``). For each program, the
+[``ExecutionEnvironment``]({{ site.baseurl }}/api/java/) provides methods to
+control the job execution (e.g. setting the parallelism) and to interact with
+the outside world (see [Anatomy of a Flink Program]({{ site.baseurl
+}}/dev/api_concepts.html#anatomy-of-a-flink-program)).
+
+The jobs of a Flink Application can either be submitted to a long-running
+[Flink Session Cluster]({{ site.baseurl
+}}/concepts/glossary.html#flink-session-cluster), a dedicated [Flink Job
+Cluster]({{ site.baseurl }}/concepts/glossary.html#flink-job-cluster) or a
+[Flink Application Cluster]({{ site.baseurl
+}}/concepts/glossary.html#flink-application-cluster). The difference between
+these options is mainly related to the cluster’s lifecycle and to resource
+isolation guarantees.
+
+### Flink Session Cluster
+
+* **Cluster Lifecycle**: in a Flink Session Cluster, the client connects to a
+  pre-existing, long-running cluster that can accept multiple job submissions.
+  Even after all jobs are finished, the cluster (and the Flink Master) will
+  keep running until the session is manually stopped. The lifetime of a Flink
+  Session Cluster is therefore not bound to the lifetime of any Flink Job.
+
+* **Resource Isolation**: TaskManager slots are allocated by the
+  ResourceManager on job submission and released once the job is finished.
+  Because all jobs are sharing the same cluster, there is some competition for
+  cluster resources — like network bandwidth in the submit-job phase. One
+  limitation of this shared setup is that if one TaskManager crashes, then all
+  jobs that have tasks running on this worker will fail; in a similar way, if
+  some fatal error occurs on the Flink Master, it will affect all jobs running
+  in the cluster.
+
+* **Other considerations**: having a pre-existing cluster saves a considerable
+  amount of time applying for resources and starting TaskManagers. This is
+  important in scenarios where the execution time of jobs is very short and a
+  high startup time would negatively impact the end-to-end user experience — as
+  is the case with interactive analysis of short queries, where it is desirable
+  that jobs can quickly perform computations using existing resources.
+
+<div class="alert alert-info"> <strong>Note:</strong> Formerly, a Flink 
Session Cluster was also known as a Flink Cluster in <i>session mode</i>. </div>
+
+### Flink Job Cluster
+
+* **Cluster Lifecycle**: in a Flink Job Cluster, the available cluster manager
+  (like YARN or Kubernetes) is used to spin up a cluster for each submitted job
+  and this cluster is available to that job only. Here, the client first
+  requests resources from the cluster manager to start the Flink Master and
+  submits the job to the Dispatcher running inside this process. TaskManagers
+  are then lazily allocated based on the resource requirements of the job. Once
+  the job is finished, the Flink Job Cluster is torn down.
+
+* **Resource Isolation**: a fatal error in the Flink Master only ever affects
+  one job in a Flink Job Cluster.
+
+* **Other considerations**: because the ResourceManager has to apply and wait
+  for external resource management components to start the TaskManager
+  processes and allocate resources, Flink Job Clusters are more suited to large
+  jobs that are long-running, have high-stability requirements and are not
+  sensitive to higher startup times.
+
+<div class="alert alert-info"> <strong>Note:</strong> Formerly, a Flink Job 
Cluster was also known as a Flink Cluster in <i>job (or per-job) mode</i>. 
</div>
+
+### Flink Application Cluster
+
+* **Cluster Lifecycle**: a Flink Application Cluster is a dedicated Flink
+  cluster that only executes jobs from one Flink Application and where the
+  ``main()`` method runs on the cluster rather than the client. The job
+  submission is a one-step process: you don’t need to start a Flink cluster
+  first and then submit a job to the existing cluster session; instead, you
+  package your application logic and dependencies into a executable job JAR and
+  the cluster entrypoint ([ApplicationClusterEntryPoint]({{ site.baseurl
+  
}}/api/java/index.html?org/apache/flink/container/entrypoint/StandaloneJobClusterEntryPoint.html))
+  is responsible for calling the ``main()`` method to extract the JobGraph.
+  This allows you to deploy a Flink Application like any other application on
+  Kubernetes, for example. The lifetime of a Flink Application Cluster is
+  therefore bound to the lifetime of the Flink Application.

Review comment:
       There is definitely room for confusion. In this case, I was trying to 
reduce that by being consistent with the definition in the Glossary: 
https://ci.apache.org/projects/flink/flink-docs-master/concepts/glossary.html#flink-application
   
   In the (near) future, it would probably be good to make this clearer by e.g. 
dropping the job cluster terminology and use just the new one — in the end:
   
   "The "per job mode" is a bit of a (rarer) special case of the application, 
pulling out dataflow graph assembly, but otherwise is the same. To avoid 
overcomplicating this early, we were thinking to not have this in the concepts, 
but only in the detailed ops docs."
   
   Users might be be too familiar with job clusters/per-job mode to just wipe 
it off sight and replace it, though.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] morsapaes commented on a change in pull request #12057: [FLINK-16210][docs] Extending the "Flink Architecture" section.

Reply via email to