This is an automated email from the ASF dual-hosted git repository. gyfora pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/flink-kubernetes-operator.git
The following commit(s) were added to refs/heads/main by this push: new 3233206 [FLINK-26510] Add FlinkDeployment CR overview + management docs 3233206 is described below commit 32332066e80d04912989f71d6b1313a001f6b3f3 Author: Gyula Fora <g_f...@apple.com> AuthorDate: Tue Mar 15 16:19:42 2022 +0100 [FLINK-26510] Add FlinkDeployment CR overview + management docs --- .../content/docs/custom-resource/job-management.md | 130 ++++++++++++++++++++- docs/content/docs/custom-resource/overview.md | 128 +++++++++++++++++++- docs/content/docs/custom-resource/reference.md | 13 +++ .../partials/docs/inject/content-before.html | 5 +- docs/template/crd-ref.template | 14 ++- .../kubernetes/operator/crd/FlinkDeployment.java | 2 +- 6 files changed, 285 insertions(+), 7 deletions(-) diff --git a/docs/content/docs/custom-resource/job-management.md b/docs/content/docs/custom-resource/job-management.md index 0d99555..a0120e0 100644 --- a/docs/content/docs/custom-resource/job-management.md +++ b/docs/content/docs/custom-resource/job-management.md @@ -24,4 +24,132 @@ specific language governing permissions and limitations under the License. --> -# Job Management +# Job Lifecycle Management + +The core responsibility of the Flink operator is to manage the full production lifecycle of Flink jobs. + +What is covered: + - Running, suspending and deleting applications + - Stateful and stateless application upgrades + - Triggering savepoints + +The behaviour is always controlled by the respective configuration fields of the `JobSpec` object as introduced in the [FlinkDeployment overview]({{< ref "docs/custom-resource/overview" >}}). + +Let's take a look at these operations in detail. + +## Running, suspending and deleting applications + +By controlling the `state` field of the `JobSpec` users can define the desired state of the application. + +Supported application states: + - `running` : The job is expected to be running and processing data. + - `suspended` : Data processing should be temporarily suspended, with the intention of continuing later. + +**Job State transitions** + +There are 4 possible state change scenarios when updating the current FlinkDeployment. + + - `running` -> `running` : Job upgrade operation. In practice, a suspend followed by a restore operation. + - `running` -> `suspended` : Suspend operation. Stops the job while maintaining state information for stateful applications. + - `suspended` -> `running` : Restore operation. Start the application from current state using the latest spec. + - `suspended` -> `suspended` : Update spec, job is not started. + +The way state is handled for suspend and restore operations is described in detail in the next section. + +**Cancelling/Deleting applications** + +As you can see there is no cancelled or deleted among the possible desired states. When users no longer wish to process data with a given FlinkDeployment they can simply delete the deployment object using the Kubernetes api: + +``` +kubectl delete flinkdeployment my-deployment +``` + +{{< hint danger >}} +Deleting a deployment will remove all checkpoint and status information. Future deployments will from an empty state unless manually overriden by the user. +{{< /hint >}} + +## Stateful and stateless application upgrades + +When the spec changes for a FlinkDeployment the running Application or Session cluster must be upgraded. +In order to do this the operator will stop the currently running job (unless already suspended) and redeploy it using the latest spec and state carried over from the previous run for stateful applications. + +Users have full control on how state should be managed when stopping and restoring stateful applications using the `upgradeMode` setting of the JobSpec. + +Supported values:`stateless`, `savepoint`, `last-state` + +The `upgradeMode` setting controls both the stop and restore mechanisms as detailed in the following table: + +| | Stateless | Last State | Savepoint | +| ---- | ---------- | ---- | ---- | +| Config Requirement | None | Checkpointing & Kubernetes HA Enabled | Savepoint directory defined | +| Job Status Requirement | None* | None* | Job Running | +| Suspend Mechanism | Cancel / Delete | Delete Flink deployment (keep HA metadata) | Cancel with savepoint | +| Restore Mechanism | Deploy from empty state | Recover last state using HA metadata | Restore From savepoint | + +*\*In general no update can be executed while a savepoint operation is in progress* + +The three different upgrade modes are intended to support different use-cases: + - stateless: Stateless applications, prototyping + - last-state: Suitable for most stateful production applications. Quick upgrades in any application state (even for failing jobs), does not require a healthy job. Requires Flink Kubernetes HA configuration (see example below). + - savepoint: Suitable for forking, migrating applications. Requires a healthy running job as it requires a savepoint operation before shutdown. + +Full example using the `last-state` strategy: + +```yaml +apiVersion: flink.apache.org/v1alpha1 +kind: FlinkDeployment +metadata: + namespace: default + name: basic-checkpoint-ha-example +spec: + image: flink:1.14.3 + flinkConfiguration: + taskmanager.numberOfTaskSlots: "2" + state.savepoints.dir: file:///flink-data/savepoints + high-availability: org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory + high-availability.storageDir: file:///flink-data/ha + jobManager: + replicas: 1 + resource: + memory: "2048m" + cpu: 1 + taskManager: + resource: + memory: "2048m" + cpu: 1 + podTemplate: + spec: + serviceAccount: flink + containers: + - name: flink-main-container + volumeMounts: + - mountPath: /flink-data + name: flink-volume + volumes: + - name: flink-volume + hostPath: + # directory location on host + path: /tmp/flink + # this field is optional + type: Directory + job: + jarURI: local:///opt/flink/examples/streaming/StateMachineExample.jar + parallelism: 2 + upgradeMode: last-state + state: running +``` + +## Savepoint management + +Savepoints can be triggered manually by defining a random (nonce) value to the variable `savepointTriggerNonce` in the job specification: + +```yaml + job: + jarURI: local:///opt/flink/examples/streaming/StateMachineExample.jar + parallelism: 2 + upgradeMode: savepoint + state: running + savepointTriggerNonce: 123 +``` + +Changing the nonce value will trigger a new savepoint. Information about pending and last savepoint is stored in the FlinkDeployment status. diff --git a/docs/content/docs/custom-resource/overview.md b/docs/content/docs/custom-resource/overview.md index 567a729..2cf25b3 100644 --- a/docs/content/docs/custom-resource/overview.md +++ b/docs/content/docs/custom-resource/overview.md @@ -24,4 +24,130 @@ specific language governing permissions and limitations under the License. --> -# Architecture +# FlinkDeployment Overview + +The core user facing API of the Flink Kubernetes Operator is the FlinkDeployment Custom Resource (CR). + +Custom Resources are extensions of the Kubernetes API and define new object types. In our case the FlinkDeployment CR defines Flink Application and Session cluster deployments. + +Once the Flink Kubernetes Operator is installed and running in your Kubernetes environment, it will continuously watch FlinkDeployment objects submitted by the user to detect new deployments and changes to existing ones. In case you haven't deployed the operator yet, please check out the [quickstart]({{< ref "docs/try-flink-kubernetes-operator/quick-start" >}}) for detailed instructions on how to get started. + +FlinkDeployment objects are defined in YAML format by the user and must contain the following required fields: + +``` +apiVersion: flink.apache.org/v1alpha1 +kind: FlinkDeployment +metadata: + namespace: namespace-of-my-deployment + name: my-deployment +spec: + // Deployment specs of your Flink Session/Application +``` + +The `apiVersion`, `kind` fields have fixed values while `metadata` and `spec` control the actual Flink deployment. + +The Flink operator will subsequently add status information to your FlinkDeployment object based on the observed deployment state: + +``` +kubectl get flinkdeployment my-deployment -o yaml +``` + +```yaml +apiVersion: flink.apache.org/v1alpha1 +kind: FlinkDeployment +metadata: + ... +spec: + ... +status: + jobManagerDeploymentStatus: READY + jobStatus: + jobId: 93dfe8199a35d5503f4048a1a999c704 + jobName: State machine job + savepointInfo: {} + state: RUNNING + updateTime: "1647351134601" + reconciliationStatus: + lastReconciledSpec: + ... + success: true +``` + +Users can use the status of the FlinkDeployment to gauge the health of their deployments and any executed operation. + +## FlinkDeployment spec overview + +The `spec` is the most important part of the `FlinkDeployment` as it describes the desired Flink Application or Session cluster. +The spec contains all the information the operator need to deploy and manage your Flink deployments, including docker images, configurations, desired state etc. + +Most deployments will define at least the following fields: + - `image` : Docker used to run Flink job and task manager processes + - `serviceAccount` : Kubernetes service account used by the Flink pods + - `taskManager, jobManager` : Job and Task manager pod resource specs (cpu, memory, etc.) + - `flinkConfiguration` : Map of Flink configuration overrides such as HA and checkpointing configs + - `job` : Job Spec for Application deployments + +The Flink Kubernetes Operator supports two main types of deployments: **Application** and **Session** + +Application deployments manage a single job deployment in Application mode while Session deployments manage Flink Session clusters without providing any job management for it. The type of cluster created depends on the `spec` provided by the user as we will see in the next sections. + +### Application Deployments + +To create an Application deployment users must define the `job` (JobSpec) field in their deployment spec. + +Required fields: + - `jarURI` : URI of the job jar + - `parallelism` : Parallelism of the job + - `upgradeMode` : Upgrade mode of the job (stateless/savepoint/last-state) + - `state` : Desired state of the job (running/suspended) + +Minimal example: + +```yaml +apiVersion: flink.apache.org/v1alpha1 +kind: FlinkDeployment +metadata: + namespace: default + name: basic-example +spec: + image: flink:1.14.3 + flinkConfiguration: + taskmanager.numberOfTaskSlots: "2" + serviceAccount: flink + jobManager: + replicas: 1 + resource: + memory: "2048m" + cpu: 1 + taskManager: + resource: + memory: "2048m" + cpu: 1 + job: + jarURI: local:///opt/flink/examples/streaming/StateMachineExample.jar + parallelism: 2 + upgradeMode: stateless + state: running +``` + +Once created FlinkDeployment yamls can be submitted through kubectl: + +```bash +kubectl apply -f your-deployment.yaml +``` + +### Session Cluster Deployments + +Session clusters use a similar spec to application clusters with the only difference that `job` is not defined. + +For Session clusters the operator only provides very basic management and monitoring that cover: + - Start Session cluster + - Monitor overall cluster health + - Stop / Delete Session clsuter + +## Further information + + - [Job Management and Stateful upgrades]({{< ref "docs/custom-resource/job-management" >}}) + - [Deployment customoziation and pod templates]({{< ref "docs/custom-resource/pod-template" >}}) + - [Full Reference]({{< ref "docs/custom-resource/reference" >}}) + - [Examples](https://github.com/apache/flink-kubernetes-operator/tree/main/examples) diff --git a/docs/content/docs/custom-resource/reference.md b/docs/content/docs/custom-resource/reference.md index d04c79e..614b959 100644 --- a/docs/content/docs/custom-resource/reference.md +++ b/docs/content/docs/custom-resource/reference.md @@ -23,6 +23,19 @@ under the License. --> # FlinkDeployment Reference + +This page serves as a full reference for FlinkDeployment custom resource definition including all the possible configuration parameters. + +## FlinkDeployment +**Class**: org.apache.flink.kubernetes.operator.crd.FlinkDeployment + +**Description**: Custom resource that represents both Application and Session deployments. + +| Parameter | Type | Docs | +| ----------| ---- | ---- | +| spec | org.apache.flink.kubernetes.operator.crd.spec.FlinkDeploymentSpec | Spec that describes a Flink application or session cluster deployment. | +| status | org.apache.flink.kubernetes.operator.crd.status.FlinkDeploymentStatus | Last observed status of the Flink deployment. | + ## Spec ### FlinkDeploymentSpec diff --git a/docs/layouts/partials/docs/inject/content-before.html b/docs/layouts/partials/docs/inject/content-before.html index 0be58be..7304b9c 100644 --- a/docs/layouts/partials/docs/inject/content-before.html +++ b/docs/layouts/partials/docs/inject/content-before.html @@ -23,15 +23,14 @@ under the License. {{ if $.Site.Params.ShowOutDatedWarning }} <article class="markdown"> <blockquote style="border-color:#f66"> - {{ markdownify "This documentation is for an out-of-date version of Apache Flink Kubernetes Operator. We recommend you use the latest [stable version](https://ci.apache.org/projects/flink/flink-ml-docs-stable/)."}} + {{ markdownify "This documentation is for an out-of-date version of Apache Flink Kubernetes Operator. We recommend you use the latest [stable version](https://ci.apache.org/projects/flink/flink-kubernetes-operator-docs-stable/)."}} </blockquote> </article> {{ end }} {{ if (not $.Site.Params.IsStable) }} <article class="markdown"> <blockquote style="border-color:#f66"> - {{ markdownify "This documentation is for an unreleased version of the Apache Flink Kubernetes Operator. We recommend you use the latest [stable version](https://ci.apache.org/projects/flink/flink-ml-docs-stable/)."}} + {{ markdownify "This documentation is for an unreleased version of the Apache Flink Kubernetes Operator. We recommend you use the latest [stable version](https://ci.apache.org/projects/flink/flink-kubernetes-operator-docs-stable/)."}} </blockquote> </article> {{ end }} - diff --git a/docs/template/crd-ref.template b/docs/template/crd-ref.template index c7aa433..835fd90 100644 --- a/docs/template/crd-ref.template +++ b/docs/template/crd-ref.template @@ -22,4 +22,16 @@ specific language governing permissions and limitations under the License. --> -# FlinkDeployment Reference \ No newline at end of file +# FlinkDeployment Reference + +This page serves as a full reference for FlinkDeployment custom resource definition including all the possible configuration parameters. + +## FlinkDeployment +**Class**: org.apache.flink.kubernetes.operator.crd.FlinkDeployment + +**Description**: Custom resource that represents both Application and Session deployments. + +| Parameter | Type | Docs | +| ----------| ---- | ---- | +| spec | org.apache.flink.kubernetes.operator.crd.spec.FlinkDeploymentSpec | Spec that describes a Flink application or session cluster deployment. | +| status | org.apache.flink.kubernetes.operator.crd.status.FlinkDeploymentStatus | Last observed status of the Flink deployment. | diff --git a/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/crd/FlinkDeployment.java b/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/crd/FlinkDeployment.java index cc5f3fb..b0a43e6 100644 --- a/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/crd/FlinkDeployment.java +++ b/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/crd/FlinkDeployment.java @@ -29,7 +29,7 @@ import io.fabric8.kubernetes.model.annotation.Group; import io.fabric8.kubernetes.model.annotation.ShortNames; import io.fabric8.kubernetes.model.annotation.Version; -/** Flink deployment object (spec + status). */ +/** Custom resource definition that represents both Application and Session deployments. */ @Experimental @JsonInclude(JsonInclude.Include.NON_NULL) @JsonDeserialize()