[jira] [Created] (FLINK-35857) Operator restart failed job without latest checkpoint
chenyuzhi created FLINK-35857: - Summary: Operator restart failed job without latest checkpoint Key: FLINK-35857 URL: https://issues.apache.org/jira/browse/FLINK-35857 Project: Flink Issue Type: Bug Components: Kubernetes Operator Affects Versions: kubernetes-operator-1.6.1 Environment: flink kubernetes operator version: 1.6.1 flink version 1.15.2 flink job config: *execution.shutdown-on-application-finish=false* Reporter: chenyuzhi Attachments: image-2024-07-17-15-03-29-618.png, image-2024-07-17-15-04-32-913.png Using flink kubernetes operator, with config: {code:java} kubernetes.operator.job.restart.failed=true {code} We got different failed-job restart result in two case. Case1: A job with period checkpoint enable and an intial checkpoint path, when it failed, the operator will auto redeploy the deployment with the same job_id and latest checkpoint path !image-2024-07-17-15-03-29-618.png! Case2: A job with period checkpoint enable but no intial checkpoint, when it failed, the operator will auto redeploy the deployment with different job_id and no intial checkpoint path. !image-2024-07-17-15-04-32-913.png! I think in the case2, the redeploy behaviour may case data inconsitence. For example the kafka source connector may consume data from earliest/latest offset. Thus i think a job with period checkpoint enable but no intial checkpoint, should be restart with the same job_id and latest checkpoint path, just like case1. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-35192) operator oom
chenyuzhi created FLINK-35192: - Summary: operator oom Key: FLINK-35192 URL: https://issues.apache.org/jira/browse/FLINK-35192 Project: Flink Issue Type: Bug Components: Kubernetes Operator Affects Versions: kubernetes-operator-1.6.1 Environment: jdk: openjdk11 operator version: 1.6.1 Reporter: chenyuzhi Attachments: image-2024-04-22-15-47-49-455.png, image-2024-04-22-15-52-51-600.png, image-2024-04-22-15-58-23-269.png, image-2024-04-22-15-58-42-850.png The kubernetest operator docker process was killed by kernel cause out of memory(the time is 2024.04.04: 18:16) !image-2024-04-22-15-47-49-455.png! metrics: And the pod memory (RSS) is increasing slowly in the past 7 days: !image-2024-04-22-15-52-51-600.png! However the jvm memory metrics of operator not shown obvious anomaly: !image-2024-04-22-15-58-23-269.png! !image-2024-04-22-15-58-42-850.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-34576) Flink deployment keep staying at RECONCILING/STABLE status
chenyuzhi created FLINK-34576: - Summary: Flink deployment keep staying at RECONCILING/STABLE status Key: FLINK-34576 URL: https://issues.apache.org/jira/browse/FLINK-34576 Project: Flink Issue Type: Bug Components: Kubernetes Operator Affects Versions: kubernetes-operator-1.6.1 Reporter: chenyuzhi Attachments: image-2024-03-05-15-13-11-032.png The HA mode of flink-kubernetes-operator is being used. When one of the pods of flink-kubernetes-operator restarts, flink-kubernetes-operator switches the leader. However, some flinkdeployments have been in the *JOB_STATUS=RECONCILING_STATE=STABLE* state for a long time. Through the cmd "kubectl describe flinkdeployment xxx", can see the following error, but there are no exceptions in the flink-kubernetes-operator log. !image-2024-03-05-15-13-11-032.png! 版本: flink-kubernetes-operator: 1.6.1 flink: 1.14.0/1.15.2 [~gyfora] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-34008) Taskmanager blocked with deserialization when starting
chenyuzhi created FLINK-34008: - Summary: Taskmanager blocked with deserialization when starting Key: FLINK-34008 URL: https://issues.apache.org/jira/browse/FLINK-34008 Project: Flink Issue Type: Bug Components: Runtime / Task Affects Versions: 1.15.2 Reporter: chenyuzhi When starting taskmanager in kubernetes cluster, some task will block with deserialization. However, th jobmanager will not consider the taskmanager heartbeat timeout, so the entire Job is in a Block state and cannot be checkpointed. Thread dumb(For taskmanager): {code:java} 2024-01-06 11:37:53Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.202-b08 mixed mode): "flink-akka.actor.default-dispatcher-19" #339 prio=5 os_prio=0 tid=0x7f9ca84d6000 nid=0x5c8 waiting on condition [0x7f9c831fd000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0xd7d7d2d0> (a akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool) at java.util.concurrent.ForkJoinPool.awaitWork(ForkJoinPool.java:1824) at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1693) at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157) Locked ownable synchronizers:- None "flink-akka.actor.default-dispatcher-18" #338 prio=5 os_prio=0 tid=0x7f9c8c144000 nid=0x5c7 waiting on condition [0x7f9c8857d000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0xd7d7d2d0> (a akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool) at java.util.concurrent.ForkJoinPool.awaitWork(ForkJoinPool.java:1824) at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1693) at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157) Locked ownable synchronizers:- None "Attach Listener" #334 daemon prio=9 os_prio=0 tid=0x7f9cbae06000 nid=0x59e waiting on condition [0x] java.lang.Thread.State: RUNNABLE Locked ownable synchronizers:- None "OkHttp WebSocket https://k596.elk.x.netease.com:6443/...; #323 prio=5 os_prio=0 tid=0x7f9ca9611800 nid=0x551 waiting on condition [0x7f9c837fc000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0xe7101010> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Locked ownable synchronizers:- None "flink-taskexecutor-io-thread-3" #316 daemon prio=5 os_prio=0 tid=0x7f9c88584800 nid=0x53c waiting on condition [0x7f9cb0ffc000] java.lang.Thread.State: WAITING (parking)at sun.misc.Unsafe.park(Native Method) - parking to wait for <0xd8027900> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Locked ownable synchronizers:- None "flink-taskexecutor-io-thread-2" #312 daemon prio=5 os_prio=0 tid=0x7f9c9e00a800 nid=0x52d waiting on condition [0x7f9c9dffc000] java.lang.Thread.State: WAITING (parking)at sun.misc.Unsafe.park(Native Method) - parking to wait for <0xd8027900> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) at
[jira] [Created] (FLINK-32669) Support port-range for taskmanager data port
chenyuzhi created FLINK-32669: - Summary: Support port-range for taskmanager data port Key: FLINK-32669 URL: https://issues.apache.org/jira/browse/FLINK-32669 Project: Flink Issue Type: Improvement Components: Runtime / Network Reporter: chenyuzhi We can setup range-port for taskmanager rpc port to avoid occupying an unexpected port(such as the port of datanode service). However, we can't setup range-port for taskmanager data port(config-key: taskmanager.data.port). In production env, it's unreasonable to setup a specify port, thus we usually not setup this configuration key. The problem is without setup taskmanager data port, it can conflict with port of other services . It means still can be port conflict 不合理 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-32649) Mismatch label and value for prometheus reporter
chenyuzhi created FLINK-32649: - Summary: Mismatch label and value for prometheus reporter Key: FLINK-32649 URL: https://issues.apache.org/jira/browse/FLINK-32649 Project: Flink Issue Type: Bug Affects Versions: 1.17.0 Reporter: chenyuzhi When runing unit test 'org.apache.flink.metrics.prometheus.PrometheusReporterTest#metricIsRemovedWhileOtherMetricsWithSameNameExist', it got wrong response string as {code:java} # HELP flink_logical_scope_metric metric (scope: logical_scope) # TYPE flink_logical_scope_metric gauge flink_logical_scope_metric{label1="some_value",label2="value1",} 0.0 {code} in my opinion, the expected right response is : {code:java} # HELP flink_logical_scope_metric metric (scope: logical_scope) # TYPE flink_logical_scope_metric gauge flink_logical_scope_metric{label1="value1",label2="some_value",} 0.0 {code} The reason may be we create two metric with same name, but two different order label-keys in the variables of MetricGroup. And we don't sort the key in variables within methond 'org.apache.flink.metrics.prometheus.AbstractPrometheusReporter#notifyOfAddedMetric'. Maybe it won't happen in production env now, cause we alway use the same 'MetricGroup' instance as the input param of method 'notifyOfAddedMetric'. Howerver it's important to ensure the robustness of method with unpected input param. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-32636) Promethues metric exporter expose stopped job metric
chenyuzhi created FLINK-32636: - Summary: Promethues metric exporter expose stopped job metric Key: FLINK-32636 URL: https://issues.apache.org/jira/browse/FLINK-32636 Project: Flink Issue Type: Bug Components: Runtime / Metrics Affects Versions: 1.14.0 Environment: flink-1.14.0 Reporter: chenyuzhi Attachments: image-2023-07-20-17-02-01-660.png, image-2023-07-20-17-02-37-625.png Firstly, I have submitted two jobs(let's say jobA/jobB ) to flink session. When I stopped one jobA, the promethues reporter continutly reports the metric of jobA and jobB. However, the jmx reporter only reports the metric of jobB. It's unreasonable to report the metric of stopped jobA. JMX reporter metric snapshot: !image-2023-07-20-17-02-01-660.png! promethues metric snapshot: !image-2023-07-20-17-02-37-625.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-30987) output source exception for SocketStreamIterator
chenyuzhi created FLINK-30987: - Summary: output source exception for SocketStreamIterator Key: FLINK-30987 URL: https://issues.apache.org/jira/browse/FLINK-30987 Project: Flink Issue Type: Improvement Reporter: chenyuzhi Attachments: image-2023-02-09-16-47-49-928.png Sometime we could meet some error when using ` org.apache.flink.streaming.experimental. SocketStreamIterator ` for testing or output. Howerver, we can't got the source exception on the log info. May be we could throw the source exception directly !image-2023-02-09-16-47-49-928.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-30818) add metrics to measure the count of window
chenyuzhi created FLINK-30818: - Summary: add metrics to measure the count of window Key: FLINK-30818 URL: https://issues.apache.org/jira/browse/FLINK-30818 Project: Flink Issue Type: Improvement Components: Runtime / Metrics Reporter: chenyuzhi when recovering a job which using window operator, taskmanager will need more resouce(cpu/memory) to process incoming data because there are too many created window on the cenario of data backlog. Thus, maybe we should add metrics to measure the count of window. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-29195) Expose lastCheckpointId metric
chenyuzhi created FLINK-29195: - Summary: Expose lastCheckpointId metric Key: FLINK-29195 URL: https://issues.apache.org/jira/browse/FLINK-29195 Project: Flink Issue Type: New Feature Reporter: chenyuzhi -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-29106) value of metric 'idleTimeMsPerSecond' more than 1000
chenyuzhi created FLINK-29106: - Summary: value of metric 'idleTimeMsPerSecond' more than 1000 Key: FLINK-29106 URL: https://issues.apache.org/jira/browse/FLINK-29106 Project: Flink Issue Type: Bug Components: Runtime / Metrics Affects Versions: 1.11.3 Environment: flink 1.11.3 Reporter: chenyuzhi Attachments: image-2022-08-25-19-18-52-755.png As the picture shown below, the value of metric 'idleTimeMsPerSecond' is more than 1000. It's obviously unreasonable !https://sawiki2.nie.netease.com/media/image/chenyuzhi/20220825181655.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-28852) Closed metric group could be found in metric dashboard from WebUI
chenyuzhi created FLINK-28852: - Summary: Closed metric group could be found in metric dashboard from WebUI Key: FLINK-28852 URL: https://issues.apache.org/jira/browse/FLINK-28852 Project: Flink Issue Type: Bug Components: Runtime / Metrics Environment: Flink 1.11.3 Reporter: chenyuzhi When I close my metric group, the related metrics would be unregistered from metric-reporter, however the closed metrics could be found in metric dashboad from WebUI. This would leak to memory leak. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-25323) Support side output late data for interval join
chenyuzhi created FLINK-25323: - Summary: Support side output late data for interval join Key: FLINK-25323 URL: https://issues.apache.org/jira/browse/FLINK-25323 Project: Flink Issue Type: Improvement Reporter: chenyuzhi Now, Flink just discard late data when using interval-join: [https://github.com/apache/flink/blob/83a2541475228a4ff9e9a9def4049fb742353549/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/operators/co/IntervalJoinOperator.java#L231] Maybe we could use two outputTags to side output late data for left/right stream. -- This message was sent by Atlassian Jira (v8.20.1#820001)