[GitHub] [flink] morsapaes commented on a change in pull request #12057: [FLINK-16210][docs] Extending the "Flink Architecture" section.

GitBox Thu, 14 May 2020 01:22:42 -0700


morsapaes commented on a change in pull request #12057:
URL: https://github.com/apache/flink/pull/12057#discussion_r424955104




##########
File path: docs/concepts/flink-architecture.md
##########
@@ -129,4 +167,108 @@ two main benefits:
 
 <img src="{{ site.baseurl }}/fig/slot_sharing.svg" alt="TaskManagers with 
shared Task Slots" class="offset" width="80%" />
 
+## Flink Application Execution
+
+A _Flink Application_ is any user program that spawns one or multiple Flink
+jobs from its ``main()`` method. The execution of these jobs can happen in a
+local JVM (``LocalEnvironment``) or on a remote setup of clusters with multiple
+machines (``RemoteEnvironment``). For each program, the
+[``ExecutionEnvironment``]({{ site.baseurl }}/api/java/) provides methods to
+control the job execution (e.g. setting the parallelism) and to interact with
+the outside world (see [Anatomy of a Flink Program]({{ site.baseurl
+}}/dev/api_concepts.html#anatomy-of-a-flink-program)).
+
+The jobs of a Flink Application can either be submitted to a long-running
+[Flink Session Cluster]({{ site.baseurl
+}}/concepts/glossary.html#flink-session-cluster), a dedicated [Flink Job
+Cluster]({{ site.baseurl }}/concepts/glossary.html#flink-job-cluster) or a
+[Flink Application Cluster]({{ site.baseurl
+}}/concepts/glossary.html#flink-application-cluster). The difference between
+these options is mainly related to the cluster’s lifecycle and to resource
+isolation guarantees.
+
+### Flink Session Cluster
+
+* **Cluster Lifecycle**: in a Flink Session Cluster, the client connects to a
+  pre-existing, long-running cluster that can accept multiple job submissions.
+  Even after all jobs are finished, the cluster (and the Flink Master) will
+  keep running until the session is manually stopped. The lifetime of a Flink
+  Session Cluster is therefore not bound to the lifetime of any Flink Job.
+
+* **Resource Isolation**: TaskManager slots are allocated by the
+  ResourceManager on job submission and released once the job is finished.
+  Because all jobs are sharing the same cluster, there is some competition for
+  cluster resources — like network bandwidth in the submit-job phase. One
+  limitation of this shared setup is that if one TaskManager crashes, then all
+  jobs that have tasks running on this worker will fail; in a similar way, if
+  some fatal error occurs on the Flink Master, it will affect all jobs running
+  in the cluster.
+
+* **Other considerations**: having a pre-existing cluster saves a considerable
+  amount of time applying for resources and starting TaskManagers. This is
+  important in scenarios where the execution time of jobs is very short and a
+  high startup time would negatively impact the end-to-end user experience — as
+  is the case with interactive analysis of short queries, where it is desirable
+  that jobs can quickly perform computations using existing resources.
+
+<div class="alert alert-info"> <strong>Note:</strong> Formerly, a Flink 
Session Cluster was also known as a Flink Cluster in <i>session mode</i>. </div>
+
+### Flink Job Cluster
+
+* **Cluster Lifecycle**: in a Flink Job Cluster, the available cluster manager
+  (like YARN or Kubernetes) is used to spin up a cluster for each submitted job
+  and this cluster is available to that job only. Here, the client first
+  requests resources from the cluster manager to start the Flink Master and
+  submits the job to the Dispatcher running inside this process. TaskManagers
+  are then lazily allocated based on the resource requirements of the job. Once
+  the job is finished, the Flink Job Cluster is torn down.
+
+* **Resource Isolation**: a fatal error in the Flink Master only ever affects
+  one job in a Flink Job Cluster.
+
+* **Other considerations**: because the ResourceManager has to apply and wait
+  for external resource management components to start the TaskManager
+  processes and allocate resources, Flink Job Clusters are more suited to large
+  jobs that are long-running, have high-stability requirements and are not
+  sensitive to higher startup times.
+
+<div class="alert alert-info"> <strong>Note:</strong> Formerly, a Flink Job 
Cluster was also known as a Flink Cluster in <i>job (or per-job) mode</i>. 
</div>
+
+### Flink Application Cluster
+
+* **Cluster Lifecycle**: a Flink Application Cluster is a dedicated Flink
+  cluster that only executes jobs from one Flink Application and where the
+  ``main()`` method runs on the cluster rather than the client. The job
+  submission is a one-step process: you don’t need to start a Flink cluster
+  first and then submit a job to the existing cluster session; instead, you
+  package your application logic and dependencies into a executable job JAR and
+  the cluster entrypoint ([ApplicationClusterEntryPoint]({{ site.baseurl
+  
}}/api/java/index.html?org/apache/flink/container/entrypoint/StandaloneJobClusterEntryPoint.html))
+  is responsible for calling the ``main()`` method to extract the JobGraph.
+  This allows you to deploy a Flink Application like any other application on
+  Kubernetes, for example. The lifetime of a Flink Application Cluster is
+  therefore bound to the lifetime of the Flink Application.
+
+* **Resource Isolation**: in a Flink Application Cluster, the ResourceManager
+  and Dispatcher are scoped to a single Flink Application, which provides a
+  better separation of concerns than the Flink Session Cluster.
+
+<div class="alert alert-info"> <strong>Note:</strong> A Flink Job Cluster can 
be seen as a “run-on-client” alternative to Flink Application Clusters. </div>
+
+{% top %}
+
+## Self-contained Flink Applications
+
+When you want to do something like event-driven applications, it doesn’t make
+sense that you have to think about and manage clusters. So, there are efforts
+in the community towards enabling _Flink-as-a-Library_ in the future.
+
+The idea is that deploying a Flink Application becomes as easy as starting a
+process: Flink would be as any other library which you add to your application
+and does not affect how you deploy it. If you want to deploy such an
+application, it simply starts a set of processes which connect to each other,
+figure out their roles (e.g. JobManager, TaskManager) and execute the
+application in a distributed, parallel way. If the application cannot keep up
+with the workload, you simply start some new processes to rescale.

Review comment:
       From what I understood in conversation with Stephan and Aljoscha, 
library mode is still "not there" yet. The idea here would mainly be to clear 
the myth that you always need a cluster to run Flink, and then the progress of 
this effort can be discussed in the "Operations" docs. 
   
   I do like your suggestion to hint at auto-scaling rather than implying that 
the processes need to be scaled manually and will rephrase some sentences!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] morsapaes commented on a change in pull request #12057: [FLINK-16210][docs] Extending the "Flink Architecture" section.

Reply via email to