[
https://issues.apache.org/jira/browse/FLINK-30150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643305#comment-17643305
]
Peter Vary commented on FLINK-30150:
------------------------------------
This is the exception in the logs:
{code:java}
2022-12-05T11:40:59.2665289Z [m[33m2022-12-05 11:40:26,746[m
[36mo.a.f.k.o.o.d.SessionObserver [m
[1;31m[ERROR][default/session-cluster-1] REST service in session cluster is
bad now
2022-12-05T11:40:59.2665851Z java.util.concurrent.TimeoutException
2022-12-05T11:40:59.2666258Z at
java.base/java.util.concurrent.CompletableFuture.timedGet(Unknown Source)
2022-12-05T11:40:59.2666841Z at
java.base/java.util.concurrent.CompletableFuture.get(Unknown Source)
2022-12-05T11:40:59.2667549Z at
org.apache.flink.kubernetes.operator.service.AbstractFlinkService.listJobs(AbstractFlinkService.java:231)
2022-12-05T11:40:59.2668462Z at
org.apache.flink.kubernetes.operator.observer.deployment.SessionObserver.observeFlinkCluster(SessionObserver.java:48)
2022-12-05T11:40:59.2669809Z at
org.apache.flink.kubernetes.operator.observer.deployment.AbstractFlinkDeploymentObserver.observeInternal(AbstractFlinkDeploymentObserver.java:89)
2022-12-05T11:40:59.2671385Z at
org.apache.flink.kubernetes.operator.observer.deployment.AbstractFlinkDeploymentObserver.observeInternal(AbstractFlinkDeploymentObserver.java:55)
2022-12-05T11:40:59.2672514Z at
org.apache.flink.kubernetes.operator.observer.AbstractFlinkResourceObserver.observe(AbstractFlinkResourceObserver.java:56)
2022-12-05T11:40:59.2673507Z at
org.apache.flink.kubernetes.operator.observer.AbstractFlinkResourceObserver.observe(AbstractFlinkResourceObserver.java:32)
2022-12-05T11:40:59.2674466Z at
org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:113)
2022-12-05T11:40:59.2675692Z at
org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:54)
2022-12-05T11:40:59.2676509Z at
io.javaoperatorsdk.operator.processing.Controller$1.execute(Controller.java:136)
2022-12-05T11:40:59.2677043Z at
io.javaoperatorsdk.operator.processing.Controller$1.execute(Controller.java:94)
2022-12-05T11:40:59.2677741Z at
org.apache.flink.kubernetes.operator.metrics.OperatorJosdkMetrics.timeControllerExecution(OperatorJosdkMetrics.java:80)
2022-12-05T11:40:59.2678451Z at
io.javaoperatorsdk.operator.processing.Controller.reconcile(Controller.java:93)
2022-12-05T11:40:59.2679180Z at
io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.reconcileExecution(ReconciliationDispatcher.java:130)
2022-12-05T11:40:59.2680055Z at
io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleReconcile(ReconciliationDispatcher.java:110)
2022-12-05T11:40:59.2681621Z at
io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleDispatch(ReconciliationDispatcher.java:81)
2022-12-05T11:40:59.2682478Z at
io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleExecution(ReconciliationDispatcher.java:54)
2022-12-05T11:40:59.2683241Z at
io.javaoperatorsdk.operator.processing.event.EventProcessor$ReconcilerExecutor.run(EventProcessor.java:406)
2022-12-05T11:40:59.2683817Z at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
2022-12-05T11:40:59.2684294Z at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
2022-12-05T11:40:59.2684676Z at java.base/java.lang.Thread.run(Unknown
Source) {code}
The log line show 2022-12-05 11:40:26,746 as the timestamp.
This is happening when we manually kill the job to test the recovery:
{code:java}
2022-12-05T11:40:12.8330378Z Successfully verified that
sessionjob/flink-example-statemachine.status.jobStatus.state is in RUNNING
state.
2022-12-05T11:40:12.9711940Z Kill the session-cluster-1-7bc5b4d7cb-t5hgq
2022-12-05T11:40:13.3083721Z Waiting for log "Restoring job
ffffffff9b85cb750000000000000001 from Checkpoint"...
2022-12-05T11:40:35.8208688Z Log "Restoring job
ffffffff9b85cb750000000000000001 from Checkpoint" shows up. {code}
I would say that this is expected.
> Evaluate operator error log whitelist entry: REST service in session cluster
> is bad now
> ---------------------------------------------------------------------------------------
>
> Key: FLINK-30150
> URL: https://issues.apache.org/jira/browse/FLINK-30150
> Project: Flink
> Issue Type: Sub-task
> Components: Kubernetes Operator
> Reporter: Gabor Somogyi
> Priority: Major
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)