Márton Balassi created FLINK-26804: --------------------------------------
Summary: Operator e2e tests sporadically fail: DEPLOYED_NOT_READY Key: FLINK-26804 URL: https://issues.apache.org/jira/browse/FLINK-26804 Project: Flink Issue Type: Bug Components: Kubernetes Operator Reporter: Márton Balassi Assignee: Márton Balassi I managed to introduce a sporadic failure scenario for the e2e tests via my solution of FLINK-26715. Since the operator only checks on the job every couple second the job might still be observed as being in DEPLOYED_NOT_READY state even after successfully completing checkpoints. {code:bash} Run ls e2e-tests/test_*.sh | while read script_test;do \ Running e2e-tests/test_kubernetes_application_ha.sh persistentvolumeclaim/flink-example-statemachine created Error from server (InternalError): error when creating "e2e-tests/data/cr.yaml": Internal error occurred: failed calling webhook "vflinkdeployments.flink.apache.org": failed to call webhook: Post "https://flink-operator-webhook-service.default.svc:443/validate?timeout=10s": dial tcp 10.106.63.26:443: connect: connection refused Command: kubectl apply -f e2e-tests/data/cr.yaml failed. Retrying... flinkdeployment.flink.apache.org/flink-example-statemachine created persistentvolumeclaim/flink-example-statemachine unchanged Error from server (NotFound): deployments.apps "flink-example-statemachine" not found Command: kubectl get deploy/flink-example-statemachine failed. Retrying... NAME READY UP-TO-DATE AVAILABLE AGE flink-example-statemachine 0/1 1 0 1s deployment.apps/flink-example-statemachine condition met Waiting for jobmanager pod flink-example-statemachine-7fcf55c88b-h5r7r ready. pod/flink-example-statemachine-7fcf55c88b-h5r7r condition met Waiting for log "Rest endpoint listening at"... Log "Rest endpoint listening at" shows up. Waiting for log "Completed checkpoint [0-[9](https://github.com/apache/flink-kubernetes-operator/runs/5640468148?check_suite_focus=true#step:9:9)]+ for job"... Log "Completed checkpoint [0-9]+ for job" shows up. Successfully verified that flinkdep/flink-example-statemachine.status.jobManagerDeploymentStatus is in READY state. Successfully verified that flinkdep/flink-example-statemachine.status.jobStatus.state is in RUNNING state. Kill the flink-example-statemachine-7fcf55c88b-h5r7r Defaulted container "flink-main-container" out of: flink-main-container, artifacts-fetcher (init) Waiting for log "Restoring job 00000000000000000000000000000000 from Checkpoint"... Log "Restoring job 00000000000000000000000000000000 from Checkpoint" shows up. Waiting for log "Completed checkpoint [0-9]+ for job"... Log "Completed checkpoint [0-9]+ for job" shows up. Status verification for flinkdep/flink-example-statemachine.status.jobManagerDeploymentStatus failed. It is DEPLOYED_NOT_READY instead of READY. Debugging failed e2e test: {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)