Re: Kubernetes operator listing jobs TimeoutException

2023-09-07 Thread Evgeniy Lyutikov
:829)




От: Evgeniy Lyutikov 
Отправлено: 8 июня 2023 г. 13:43:18
Кому: Shammon FY
Копия: user@flink.apache.org
Тема: Re: Kubernetes operator listing jobs TimeoutException


Hi, thanks for the reply.
These errors occur on jobs that have already been successfully deployed and are 
running.

When such an error occurs, the operator begins to consider that the job is in 
the DEPLOYING or DEPLOYED_NOT_READY status, but all this time the job is in the 
RUNNING state and no actions are performed with it

It seems that this problem appeared after updating the FlinkDeployment resource 
to update the version of the running job


2023-06-08 06:31:02,741 o.a.f.k.o.o.JobStatusObserver  [WARN 
][job-name/job-name] Exception while listing jobs
2023-06-08 06:31:02,741 o.a.f.k.o.o.d.ApplicationObserver [INFO 
][job-name/job-name] Observing JobManager deployment. Previous status: READY
2023-06-08 06:31:03,758 o.a.f.k.o.o.d.ApplicationObserver [INFO 
][job-name/job-name] JobManager is being deployed
2023-06-08 06:31:03,824 o.a.f.k.o.l.AuditUtils [INFO 
][job-name/job-name] >>> Status | Info| STABLE  | The resource 
deployment is considered to be stable and won’t be rolled back
2023-06-08 06:31:03,825 o.a.f.k.o.a.JobAutoScalerImpl  [INFO 
][job-name/job-name] Job autoscaler is disabled
2023-06-08 06:31:03,825 o.a.f.k.o.r.d.AbstractFlinkResourceReconciler [INFO 
][job-name/job-name] Resource fully reconciled, nothing to do...
2023-06-08 06:31:03,825 o.a.f.k.o.c.FlinkDeploymentController [INFO 
][job-name/job-name] End of reconciliation
2023-06-08 06:31:13,828 o.a.f.k.o.c.FlinkDeploymentController [INFO 
][job-name/job-name] Starting reconciliation
2023-06-08 06:31:13,829 o.a.f.k.o.s.FlinkResourceContextFactory [INFO 
][job-name/job-name] Getting service for job-name
2023-06-08 06:31:13,829 o.a.f.k.o.o.d.ApplicationObserver [INFO 
][job-name/job-name] Observing JobManager deployment. Previous status: DEPLOYING
2023-06-08 06:31:14,849 o.a.f.k.o.o.d.ApplicationObserver [INFO 
][job-name/job-name] JobManager is being deployed
2023-06-08 06:31:14,850 o.a.f.k.o.a.JobAutoScalerImpl  [INFO 
][job-name/job-name] Job autoscaler is disabled
2023-06-08 06:31:14,850 o.a.f.k.o.r.d.AbstractFlinkResourceReconciler [INFO 
][job-name/job-name] Resource fully reconciled, nothing to do...
2023-06-08 06:31:14,850 o.a.f.k.o.c.FlinkDeploymentController [INFO 
][job-name/job-name] End of reconciliation
2023-06-08 06:31:24,853 o.a.f.k.o.c.FlinkDeploymentController [INFO 
][job-name/job-name] Starting reconciliation
2023-06-08 06:31:24,854 o.a.f.k.o.s.FlinkResourceContextFactory [INFO 
][job-name/job-name] Getting service for job-name
2023-06-08 06:31:24,854 o.a.f.k.o.o.d.ApplicationObserver [INFO 
][job-name/job-name] Observing JobManager deployment. Previous status: DEPLOYING
2023-06-08 06:31:24,858 o.a.f.k.o.o.d.ApplicationObserver [INFO 
][job-name/job-name] JobManager deployment port is ready, waiting for the Flink 
REST API...
2023-06-08 06:31:24,926 o.a.f.k.o.l.AuditUtils [INFO 
][job-name/job-name] >>> Status | Info| STABLE  | The resource 
deployment is considered to be stable and won’t be rolled back
2023-06-08 06:31:24,927 o.a.f.k.o.a.JobAutoScalerImpl  [INFO 
][job-name/job-name] Job autoscaler is disabled
2023-06-08 06:31:24,927 o.a.f.k.o.r.d.AbstractFlinkResourceReconciler [INFO 
][job-name/job-name] Resource fully reconciled, nothing to do...
2023-06-08 06:31:24,927 o.a.f.k.o.c.FlinkDeploymentController [INFO 
][job-name/job-name] End of reconciliation
2023-06-08 06:31:34,930 o.a.f.k.o.c.FlinkDeploymentController [INFO 
][job-name/job-name] Starting reconciliation
2023-06-08 06:31:34,931 o.a.f.k.o.s.FlinkResourceContextFactory [INFO 
][job-name/job-name] Getting service for job-name
2023-06-08 06:31:34,931 o.a.f.k.o.o.d.ApplicationObserver [INFO 
][job-name/job-name] Observing JobManager deployment. Previous status: 
DEPLOYED_NOT_READY
2023-06-08 06:31:34,931 o.a.f.k.o.o.d.ApplicationObserver [INFO 
][job-name/job-name] JobManager deployment is ready
2023-06-08 06:31:34,931 o.a.f.k.o.o.JobStatusObserver  [INFO 
][job-name/job-name] Observing job status
2023-06-08 06:31:34,936 o.a.f.k.o.o.JobStatusObserver  [INFO 
][job-name/job-name] Job status changed from RECONCILING to RUNNING
2023-06-08 06:31:34,960 o.a.f.k.o.l.AuditUtils [INFO 
][job-name/job-name] >>> Event  | Info| JOBSTATUSCHANGED | Job status 
changed from RECONCILING to RUNNING
2023-06-08 06:31:35,031 o.a.f.k.o.l.AuditUtils [INFO 
][job-name/job-name] >>> Status | Info| STABLE  | The resource 
deployment is considered to be stable and won’t be rolled back
2023-06-08 06:31:35,032 o.a.f.k.o.a.JobAutoScalerImpl  [INFO 
][job-name/job-name] Job autoscaler is disabled
2023-06-08 06:31:35,032 o.a.f.k.o.r.d.AbstractFlinkResourceReconciler [INFO 
][job-name/job-name] Resource fully reconciled, nothing to do...
2023-06-08 06:31:35,032 o.a.f.k.o.c.Fl

Re: Kubernetes operator listing jobs TimeoutException

2023-06-08 Thread Evgeniy Lyutikov
conciliation
2023-06-08 06:32:35,035 o.a.f.k.o.s.FlinkResourceContextFactory [INFO 
][job-name/job-name] Getting service for job-name
2023-06-08 06:32:35,036 o.a.f.k.o.o.JobStatusObserver  [INFO 
][job-name/job-name] Observing job status
2023-06-08 06:32:35,044 o.a.f.k.o.o.JobStatusObserver  [INFO 
][job-name/job-name] Job status (RUNNING) unchanged
2023-06-08 06:32:35,049 o.a.f.k.o.a.JobAutoScalerImpl  [INFO 
][job-name/job-name] Job autoscaler is disabled
2023-06-08 06:32:35,049 o.a.f.k.o.r.d.AbstractFlinkResourceReconciler [INFO 
][job-name/job-name] Resource fully reconciled, nothing to do...

kubernetes configuration of flink:
kubernetes.cluster-id: job-name
kubernetes.container.image.pull-policy: Always
kubernetes.container.image: flink:1.14.4-java11
kubernetes.internal.jobmanager.entrypoint.class: 
org.apache.flink.kubernetes.entrypoint.KubernetesApplicationClusterEntrypoint
kubernetes.jobmanager.annotations: flinkdeployment.flink.apache.org/generation:3
kubernetes.jobmanager.cpu: 4.0
kubernetes.jobmanager.labels: job_version:0.3.0
kubernetes.jobmanager.memory.limit-factor: 1.3
kubernetes.jobmanager.owner.reference: 
blockOwnerDeletion:false,controller:false,name:job-name,uid:b118a60b-80a2-43f9-933c-d1510e63bf6c,kind:FlinkDeployment,apiVersion:flink.apache.org/v1beta1
kubernetes.jobmanager.replicas: 1
kubernetes.namespace: job-name
kubernetes.pod-template-file.jobmanager: 
/tmp/flink_op_generated_podTemplate_8388768779635722075.yaml
kubernetes.pod-template-file.taskmanager: 
/tmp/flink_op_generated_podTemplate_8986511200228142287.yaml
kubernetes.pod-template-file: 
/tmp/flink_op_generated_podTemplate_11143683886521703748.yaml
kubernetes.rest-service.exposed.type: Headless_ClusterIP
kubernetes.service-account: flink
kubernetes.taskmanager.cpu: 12.0
kubernetes.taskmanager.labels: job_version:0.3.0
kubernetes.taskmanager.memory.limit-factor: 1.1

this is what it looks like in metrics
[cid:393756c0-8266-41a2-8d82-eb9ec46e90a3]


От: Shammon FY 
Отправлено: 8 июня 2023 г. 12:55:38
Кому: Evgeniy Lyutikov
Копия: user@flink.apache.org
Тема: Re: Kubernetes operator listing jobs TimeoutException

Hi Evgeniy,

From the following exception message:

at 
org.apache.flink.shaded.netty4.io.netty.bootstrap.Bootstrap.connect(Bootstrap.java:123)
at 
org.apache.flink.runtime.rest.RestClient.submitRequest(RestClient.java:469)
at 
org.apache.flink.runtime.rest.RestClient.sendRequest(RestClient.java:392)
at 
org.apache.flink.runtime.rest.RestClient.sendRequest(RestClient.java:306)
at 
org.apache.flink.client.program.rest.RestClusterClient.lambda$null$37(RestClusterClient.java:931)
at 
java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072)

It seems that the client tried to submit a job to the flink cluster through the 
rest api failed, maybe you need to provide more information such as config of 
k8s for the job and community can help better analyze problems.


Best,
Shammon FY

On Wed, Jun 7, 2023 at 11:35 PM Evgeniy Lyutikov 
mailto:eblyuti...@avito.ru>> wrote:

Hello.
We use Kubernetes operator 1.4.0, operator serves about 50 jobs, but sometimes 
there are errors in the logs that are reflected in the metrics 
(FlinkDeployment.JmDeploymentStatus.READY.Count). What is the reason for such 
errors?


2023-06-07 15:28:27,601 o.a.f.k.o.c.FlinkDeploymentController [INFO 
][job-name/job-name] Starting reconciliation
2023-06-07 15:28:27,602 o.a.f.k.o.s.FlinkResourceContextFactory [INFO 
][job-name/job-name] Getting service for job-name
2023-06-07 15:28:27,602 o.a.f.k.o.o.JobStatusObserver  [INFO 
][job-name/job-name] Observing job status
2023-06-07 15:28:39,623 o.a.f.s.n.i.n.c.AbstractChannel [WARN ] Force-closing a 
channel whose registration task was not accepted by an event loop: [id: 
0xd494f516]
java.util.concurrent.RejectedExecutionException: event executor terminated
at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.reject(SingleThreadEventExecutor.java:923)
at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.offerTask(SingleThreadEventExecutor.java:350)
at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.addTask(SingleThreadEventExecutor.java:343)
at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.execute(SingleThreadEventExecutor.java:825)
at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.execute(SingleThreadEventExecutor.java:815)
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.register(AbstractChannel.java:483)
at 
org.apache.flink.shaded.netty4.io.netty.channel.SingleThreadEventLoop.register(SingleThreadEventLoop.java:87)
at 
org.apache.flink.shaded.netty4.io.netty.channel.SingleThreadEventLoop.register(SingleThreadEventLoop.java:81)

Re: Kubernetes operator listing jobs TimeoutException

2023-06-07 Thread Shammon FY
Hi Evgeniy,

>From the following exception message:

at
org.apache.flink.shaded.netty4.io.netty.bootstrap.Bootstrap.connect(Bootstrap.java:123)
at
org.apache.flink.runtime.rest.RestClient.submitRequest(RestClient.java:469)
at
org.apache.flink.runtime.rest.RestClient.sendRequest(RestClient.java:392)
at
org.apache.flink.runtime.rest.RestClient.sendRequest(RestClient.java:306)
at
org.apache.flink.client.program.rest.RestClusterClient.lambda$null$37(RestClusterClient.java:931)
at
java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072)

It seems that the client tried to submit a job to the flink cluster through
the rest api failed, maybe you need to provide more information such as
config of k8s for the job and community can help better analyze problems.


Best,
Shammon FY

On Wed, Jun 7, 2023 at 11:35 PM Evgeniy Lyutikov 
wrote:

> Hello.
> We use Kubernetes operator 1.4.0, operator serves about 50 jobs, but
> sometimes there are errors in the logs that are reflected in the metrics
> (FlinkDeployment.JmDeploymentStatus.READY.Count). What is the reason for
> such errors?
>
>
> 2023-06-07 15:28:27,601 o.a.f.k.o.c.FlinkDeploymentController [INFO
> ][job-name/job-name] Starting reconciliation
> 2023-06-07 15:28:27,602 o.a.f.k.o.s.FlinkResourceContextFactory [INFO
> ][job-name/job-name] Getting service for job-name
> 2023-06-07 15:28:27,602 o.a.f.k.o.o.JobStatusObserver  [INFO
> ][job-name/job-name] Observing job status
> 2023-06-07 15:28:39,623 o.a.f.s.n.i.n.c.AbstractChannel [WARN ]
> Force-closing a channel whose registration task was not accepted by an
> event loop: [id: 0xd494f516]
> java.util.concurrent.RejectedExecutionException: event executor terminated
> at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.reject(SingleThreadEventExecutor.java:923)
> at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.offerTask(SingleThreadEventExecutor.java:350)
> at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.addTask(SingleThreadEventExecutor.java:343)
> at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.execute(SingleThreadEventExecutor.java:825)
> at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.execute(SingleThreadEventExecutor.java:815)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.register(AbstractChannel.java:483)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.SingleThreadEventLoop.register(SingleThreadEventLoop.java:87)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.SingleThreadEventLoop.register(SingleThreadEventLoop.java:81)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.MultithreadEventLoopGroup.register(MultithreadEventLoopGroup.java:86)
> at
> org.apache.flink.shaded.netty4.io.netty.bootstrap.AbstractBootstrap.initAndRegister(AbstractBootstrap.java:323)
> at
> org.apache.flink.shaded.netty4.io.netty.bootstrap.Bootstrap.doResolveAndConnect(Bootstrap.java:155)
> at
> org.apache.flink.shaded.netty4.io.netty.bootstrap.Bootstrap.connect(Bootstrap.java:139)
> at
> org.apache.flink.shaded.netty4.io.netty.bootstrap.Bootstrap.connect(Bootstrap.java:123)
> at
> org.apache.flink.runtime.rest.RestClient.submitRequest(RestClient.java:469)
> at
> org.apache.flink.runtime.rest.RestClient.sendRequest(RestClient.java:392)
> at
> org.apache.flink.runtime.rest.RestClient.sendRequest(RestClient.java:306)
> at
> org.apache.flink.client.program.rest.RestClusterClient.lambda$null$37(RestClusterClient.java:931)
> at
> java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072)
> at
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
> at
> java.base/java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:610)
> at
> java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:649)
> at
> java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:829)
> 2023-06-07 15:28:39,624 o.a.f.s.n.i.n.u.c.D.rejectedExecution [ERROR]
> Failed to submit a listener notification task. Event loop shut down?
> java.util.concurrent.RejectedExecutionException: event executor terminated
> at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.reject(SingleThreadEventExecutor.java:923)
> at
>