[jira] [Created] (FLINK-35857) Operator restart failed job without latest checkpoint

2024-07-17 Thread chenyuzhi (Jira)
chenyuzhi created FLINK-35857:
-

 Summary: Operator restart failed job without latest checkpoint
 Key: FLINK-35857
 URL: https://issues.apache.org/jira/browse/FLINK-35857
 Project: Flink
  Issue Type: Bug
  Components: Kubernetes Operator
Affects Versions: kubernetes-operator-1.6.1
 Environment:  flink kubernetes operator version: 1.6.1

flink version 1.15.2

flink job config:

*execution.shutdown-on-application-finish=false*
Reporter: chenyuzhi
 Attachments: image-2024-07-17-15-03-29-618.png, 
image-2024-07-17-15-04-32-913.png

Using flink kubernetes operator, with config: 
{code:java}
kubernetes.operator.job.restart.failed=true {code}
We got different failed-job restart result in two case. 

Case1:  

 A job with period checkpoint enable and an intial checkpoint path, when it 
failed, the operator will auto redeploy the deployment with the same job_id and 
latest checkpoint path 

 

!image-2024-07-17-15-03-29-618.png!

 

Case2:

 A job with period checkpoint enable but  no intial checkpoint, when it failed, 
the operator will auto redeploy the deployment with different job_id  and no 
intial checkpoint path.

!image-2024-07-17-15-04-32-913.png!

 

I think in the case2, the redeploy behaviour may case data inconsitence. For 
example the kafka source connector may consume data from earliest/latest offset.

 

Thus i think  a job with period checkpoint enable but  no intial checkpoint, 
should be restart with the same job_id and latest checkpoint path, just like 
case1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-35192) operator oom

2024-04-22 Thread chenyuzhi (Jira)
chenyuzhi created FLINK-35192:
-

 Summary: operator oom
 Key: FLINK-35192
 URL: https://issues.apache.org/jira/browse/FLINK-35192
 Project: Flink
  Issue Type: Bug
  Components: Kubernetes Operator
Affects Versions: kubernetes-operator-1.6.1
 Environment: jdk: openjdk11
operator version: 1.6.1

Reporter: chenyuzhi
 Attachments: image-2024-04-22-15-47-49-455.png, 
image-2024-04-22-15-52-51-600.png, image-2024-04-22-15-58-23-269.png, 
image-2024-04-22-15-58-42-850.png

The kubernetest operator docker process was killed by kernel cause out of 
memory(the time is 2024.04.04: 18:16)

 !image-2024-04-22-15-47-49-455.png! 


metrics:
And the pod memory (RSS) is increasing slowly in the past 7 days:
 !image-2024-04-22-15-52-51-600.png! 

However the jvm memory metrics of operator not shown obvious anomaly:
 !image-2024-04-22-15-58-23-269.png! 
 !image-2024-04-22-15-58-42-850.png! 





--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34576) Flink deployment keep staying at RECONCILING/STABLE status

2024-03-04 Thread chenyuzhi (Jira)
chenyuzhi created FLINK-34576:
-

 Summary: Flink deployment keep staying at RECONCILING/STABLE status
 Key: FLINK-34576
 URL: https://issues.apache.org/jira/browse/FLINK-34576
 Project: Flink
  Issue Type: Bug
  Components: Kubernetes Operator
Affects Versions: kubernetes-operator-1.6.1
Reporter: chenyuzhi
 Attachments: image-2024-03-05-15-13-11-032.png

The HA mode of flink-kubernetes-operator is being used. When one of the pods of 
flink-kubernetes-operator restarts, flink-kubernetes-operator switches the 
leader. However, some flinkdeployments have been in the 
*JOB_STATUS=RECONCILING_STATE=STABLE* state for a long time. Through 
the cmd "kubectl describe flinkdeployment xxx", can see the following error, 
but there are no exceptions in the flink-kubernetes-operator log.
 
!image-2024-03-05-15-13-11-032.png!

 

版本:

flink-kubernetes-operator: 1.6.1

flink: 1.14.0/1.15.2

 

[~gyfora] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34008) Taskmanager blocked with deserialization when starting

2024-01-06 Thread chenyuzhi (Jira)
chenyuzhi created FLINK-34008:
-

 Summary: Taskmanager blocked with deserialization when starting
 Key: FLINK-34008
 URL: https://issues.apache.org/jira/browse/FLINK-34008
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Task
Affects Versions: 1.15.2
Reporter: chenyuzhi


When starting taskmanager in kubernetes cluster,  some task will block with 
deserialization. 

However, th jobmanager will not consider the taskmanager heartbeat timeout, so 
the entire Job is in a Block state and cannot be checkpointed.

 

Thread dumb(For taskmanager):

 

 
{code:java}
2024-01-06 11:37:53Full thread dump Java HotSpot(TM) 64-Bit Server VM 
(25.202-b08 mixed mode):
"flink-akka.actor.default-dispatcher-19" #339 prio=5 os_prio=0 
tid=0x7f9ca84d6000 nid=0x5c8 waiting on condition [0x7f9c831fd000]   
java.lang.Thread.State: TIMED_WAITING (parking) at 
sun.misc.Unsafe.park(Native Method)  - parking to wait for  
<0xd7d7d2d0> (a 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool) at 
java.util.concurrent.ForkJoinPool.awaitWork(ForkJoinPool.java:1824)  at 
java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1693)  at 
java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
   Locked ownable synchronizers:- None
"flink-akka.actor.default-dispatcher-18" #338 prio=5 os_prio=0 
tid=0x7f9c8c144000 nid=0x5c7 waiting on condition [0x7f9c8857d000]   
java.lang.Thread.State: WAITING (parking)   at sun.misc.Unsafe.park(Native 
Method)  - parking to wait for  <0xd7d7d2d0> (a 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool) at 
java.util.concurrent.ForkJoinPool.awaitWork(ForkJoinPool.java:1824)  at 
java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1693)  at 
java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
   Locked ownable synchronizers:- None
"Attach Listener" #334 daemon prio=9 os_prio=0 tid=0x7f9cbae06000 nid=0x59e 
waiting on condition [0x]   java.lang.Thread.State: RUNNABLE
   Locked ownable synchronizers:- None
"OkHttp WebSocket https://k596.elk.x.netease.com:6443/...; #323 prio=5 
os_prio=0 tid=0x7f9ca9611800 nid=0x551 waiting on condition 
[0x7f9c837fc000]   java.lang.Thread.State: TIMED_WAITING (parking)   at 
sun.misc.Unsafe.park(Native Method)  - parking to wait for  
<0xe7101010> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)   at 
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)   at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
   at 
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
 at 
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)   
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) 
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
 at java.lang.Thread.run(Thread.java:748)
   Locked ownable synchronizers:- None
"flink-taskexecutor-io-thread-3" #316 daemon prio=5 os_prio=0 
tid=0x7f9c88584800 nid=0x53c waiting on condition [0x7f9cb0ffc000]   
java.lang.Thread.State: WAITING (parking)at sun.misc.Unsafe.park(Native 
Method)  - parking to wait for  <0xd8027900> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)   at 
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at 
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)  at 
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)   
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) 
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
 at java.lang.Thread.run(Thread.java:748)
   Locked ownable synchronizers:- None
"flink-taskexecutor-io-thread-2" #312 daemon prio=5 os_prio=0 
tid=0x7f9c9e00a800 nid=0x52d waiting on condition [0x7f9c9dffc000]   
java.lang.Thread.State: WAITING (parking)at sun.misc.Unsafe.park(Native 
Method)  - parking to wait for  <0xd8027900> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)   at 
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at 

[jira] [Created] (FLINK-32669) Support port-range for taskmanager data port

2023-07-25 Thread chenyuzhi (Jira)
chenyuzhi created FLINK-32669:
-

 Summary: Support port-range for taskmanager data port
 Key: FLINK-32669
 URL: https://issues.apache.org/jira/browse/FLINK-32669
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Network
Reporter: chenyuzhi


We can setup range-port for taskmanager rpc port to avoid occupying an 
unexpected port(such as the port of datanode service).
 
However, we can't setup range-port for taskmanager data port(config-key: 
taskmanager.data.port).
In production env, it's unreasonable to setup a specify port, thus we usually 
not setup this configuration key. 
 
The problem is without setup taskmanager data port, it can conflict with port 
of other services .
It means still can be port conflict 
不合理



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32649) Mismatch label and value for prometheus reporter

2023-07-22 Thread chenyuzhi (Jira)
chenyuzhi created FLINK-32649:
-

 Summary: Mismatch label and value for prometheus reporter
 Key: FLINK-32649
 URL: https://issues.apache.org/jira/browse/FLINK-32649
 Project: Flink
  Issue Type: Bug
Affects Versions: 1.17.0
Reporter: chenyuzhi


When runing unit test 
'org.apache.flink.metrics.prometheus.PrometheusReporterTest#metricIsRemovedWhileOtherMetricsWithSameNameExist',
 it got wrong response string as 
{code:java}
# HELP flink_logical_scope_metric metric (scope: logical_scope)
# TYPE flink_logical_scope_metric gauge
flink_logical_scope_metric{label1="some_value",label2="value1",} 0.0 {code}
 

in my opinion, the expected right response is :

 
{code:java}
# HELP flink_logical_scope_metric metric (scope: logical_scope)
# TYPE flink_logical_scope_metric gauge
flink_logical_scope_metric{label1="value1",label2="some_value",} 0.0
 {code}
 

 

The reason may be we create two metric with same name, but two different order  
label-keys in the variables of MetricGroup. And we don't sort the key in 
variables within methond 
'org.apache.flink.metrics.prometheus.AbstractPrometheusReporter#notifyOfAddedMetric'.

 

Maybe it won't happen in production env now, cause we alway use the same 
'MetricGroup' instance as the input param of method 'notifyOfAddedMetric'.

Howerver it's important to ensure the robustness of method with unpected input 
param.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32636) Promethues metric exporter expose stopped job metric

2023-07-20 Thread chenyuzhi (Jira)
chenyuzhi created FLINK-32636:
-

 Summary: Promethues metric exporter expose stopped job metric
 Key: FLINK-32636
 URL: https://issues.apache.org/jira/browse/FLINK-32636
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Metrics
Affects Versions: 1.14.0
 Environment: flink-1.14.0
Reporter: chenyuzhi
 Attachments: image-2023-07-20-17-02-01-660.png, 
image-2023-07-20-17-02-37-625.png

Firstly, I have submitted two jobs(let's say jobA/jobB ) to flink session. When 
I stopped one jobA, the promethues reporter continutly reports the metric of 
jobA and jobB. 

However, the jmx reporter only reports  the metric of jobB.

 

 It's unreasonable to report the metric of stopped jobA.

 

JMX reporter metric snapshot:

!image-2023-07-20-17-02-01-660.png!

 

promethues metric snapshot:

 

!image-2023-07-20-17-02-37-625.png!

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-30987) output source exception for SocketStreamIterator

2023-02-09 Thread chenyuzhi (Jira)
chenyuzhi created FLINK-30987:
-

 Summary: output source exception for SocketStreamIterator
 Key: FLINK-30987
 URL: https://issues.apache.org/jira/browse/FLINK-30987
 Project: Flink
  Issue Type: Improvement
Reporter: chenyuzhi
 Attachments: image-2023-02-09-16-47-49-928.png

Sometime we could meet some error when using `

org.apache.flink.streaming.experimental.

SocketStreamIterator

` for testing or output.

 

Howerver, we can't got the source exception on the log info. May be we could 
throw the source exception directly

 

!image-2023-02-09-16-47-49-928.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-30818) add metrics to measure the count of window

2023-01-29 Thread chenyuzhi (Jira)
chenyuzhi created FLINK-30818:
-

 Summary: add metrics to measure the count of window
 Key: FLINK-30818
 URL: https://issues.apache.org/jira/browse/FLINK-30818
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Metrics
Reporter: chenyuzhi


when recovering a job which using window operator,  taskmanager will need more 
resouce(cpu/memory) to process incoming data because there are too many created 
window on the cenario of data backlog.

Thus, maybe we should add metrics to measure the count of window.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-29195) Expose lastCheckpointId metric

2022-09-05 Thread chenyuzhi (Jira)
chenyuzhi created FLINK-29195:
-

 Summary: Expose lastCheckpointId metric 
 Key: FLINK-29195
 URL: https://issues.apache.org/jira/browse/FLINK-29195
 Project: Flink
  Issue Type: New Feature
Reporter: chenyuzhi






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-29106) value of metric 'idleTimeMsPerSecond' more than 1000

2022-08-25 Thread chenyuzhi (Jira)
chenyuzhi created FLINK-29106:
-

 Summary: value of metric 'idleTimeMsPerSecond'  more than 1000
 Key: FLINK-29106
 URL: https://issues.apache.org/jira/browse/FLINK-29106
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Metrics
Affects Versions: 1.11.3
 Environment: flink 1.11.3
Reporter: chenyuzhi
 Attachments: image-2022-08-25-19-18-52-755.png

 As the picture shown below, the value of metric 'idleTimeMsPerSecond' is more 
than 1000. It's obviously unreasonable

!https://sawiki2.nie.netease.com/media/image/chenyuzhi/20220825181655.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-28852) Closed metric group could be found in metric dashboard from WebUI

2022-08-07 Thread chenyuzhi (Jira)
chenyuzhi created FLINK-28852:
-

 Summary: Closed metric group could be found in metric dashboard 
from WebUI
 Key: FLINK-28852
 URL: https://issues.apache.org/jira/browse/FLINK-28852
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Metrics
 Environment: Flink 1.11.3
Reporter: chenyuzhi


When I close my metric group, the related metrics would be unregistered from 
metric-reporter, however the closed metrics could be found in metric dashboad 
from WebUI.

This would leak to memory leak.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-25323) Support side output late data for interval join

2021-12-15 Thread chenyuzhi (Jira)
chenyuzhi created FLINK-25323:
-

 Summary: Support side output late data for interval join
 Key: FLINK-25323
 URL: https://issues.apache.org/jira/browse/FLINK-25323
 Project: Flink
  Issue Type: Improvement
Reporter: chenyuzhi


Now, Flink just discard late data when using interval-join:

[https://github.com/apache/flink/blob/83a2541475228a4ff9e9a9def4049fb742353549/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/operators/co/IntervalJoinOperator.java#L231]
 

 

Maybe we could use two outputTags to side output late data for left/right 
stream.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)