Hi, Ivan

My gut feeling is that it is related to FLINK-22535. Could @Yun Gao
take another look? If that is the case, you can upgrade to 1.13.1.

Best,
Yangze Guo

On Tue, Jul 27, 2021 at 9:41 AM Ivan Yang <ivanygy...@gmail.com> wrote:
>
> Dear Flink experts,
>
> We recently ran into an issue during a job cancellation after upgraded to 
> 1.13. After we issue a cancel (from Flink console or flink cancel {jobid}), a 
> few subtasks stuck in cancelling state. Once it gets to that situation, the 
> behavior is consistent. Those “cancelling tasks will never become canceled. 
> After 3 minutes, The job stopped, as a result, number of task manages were 
> lost. It will take about another 5 minute for the those lost task manager to 
> rejoin the Job manager. Then we can restart the job from the previous 
> checkpoint. Found an exception from the hanging (cancelling) Task Manager.
> ==========================================
>         sun.misc.Unsafe.park(Native Method) 
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) 
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
>  java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) 
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
>  java.util.concurrent.CompletableFuture.join(CompletableFuture.java:1947) 
> org.apache.flink.streaming.runtime.tasks.StreamTask.cleanUpInvoke(StreamTask.java:705)
>  
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask.cleanUpInvoke(SourceStreamTask.java:186)
>  
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:637)
>  org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:776) 
> org.apache.flink.runtime.taskmanager.Task.run(Task.java:563) 
> java.lang.Thread.run(Thread.java:748)
> ===========================================
>
> Here are some background information about our job and setup.
> 1) The job is relatively large, we have 500+ parallelism and 2000+ subtasks. 
> It’s mainly reading from a Kinesis stream and perform some transformation and 
> fanout to multiple output s3 buckets. It’s a stateless ETL job.
> 2) The same code and setup running on smaller environments don’t seem to have 
> this cancel failure problem.
> 3) We have been using Flink 1.11.0 for the same job, and never seen this 
> cancel failure and killing Task Manager problem.
> 4) With upgrading to 1.13, we also added Kubernetes HA (zookeeperless). 
> Pervious we don’t not use HA.
>
> The cancel and restart from previous checkpoint is our regular procedure to 
> support daily operation. With this 10 minutes TM restart cycle, it basically 
> slowed down our throughput. I try to understand what leads into this 
> situation. Hoping maybe some configuration change will smooth things out. 
> Also any suggestion to shorten the waiting. It appears to be some timeout on 
> the TaskManager and JobManager can be  adjusted to speed it up. But really 
> want to avoid stuck in cancellation if we can.
>
> Thanks you, hoping to get some insight knowledge here.
>
> Ivan

Reply via email to