Just failed while starting

2021-08-18 Thread Ivan Yang
Dear Flink community, I recently running into this issue at a job startup. It happened from time to time. Here is the exception from the job manager: 2021-08-17 01:21:01,944 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: Defence raw event prod05_analytics_outpu

Re: TaskManager crash after cancelling a job

2021-07-28 Thread Ivan Yang
1.13.1. > > Best, > Yangze Guo > > On Tue, Jul 27, 2021 at 9:41 AM Ivan Yang wrote: >> >> Dear Flink experts, >> >> We recently ran into an issue during a job cancellation after upgraded to >> 1.13. After we issue a cancel (from Flink console or flin

TaskManager crash after cancelling a job

2021-07-26 Thread Ivan Yang
Dear Flink experts, We recently ran into an issue during a job cancellation after upgraded to 1.13. After we issue a cancel (from Flink console or flink cancel {jobid}), a few subtasks stuck in cancelling state. Once it gets to that situation, the behavior is consistent. Those “cancelling tasks

Re: Flink Kubernetes HA

2021-06-23 Thread Ivan Yang
o many > files exist in the S3 bucket? > > AFAIK, if the K8s HA services work normally, only one completedCheckpoint > file will be retained. Once a > new one is generated, the old one will be deleted. > > > Best, > Yang > > Ivan Yang mailto:ivanygy...@gmail.c

Flink Kubernetes HA

2021-06-22 Thread Ivan Yang
Hi Dear Flink users, We recently implemented enabled the zookeeper less HA in our kubernetes Flink deployment. The set up has high-availability.storageDir: s3://some-bucket/recovery Since we have a retention policy on the s3 bucket, relatively short 7 days. So the HA will fail if the submitte

Re: Exception on s3 committer

2020-08-31 Thread Ivan Yang
Hi Yun, Thank you so much for you suggestion. (1) The job couldn’t restore from the last checkpoint. The exception is in my original email. (2) No, I didn’t change any multipart upload settings. (3) The file is gone. I have another batch process that reads Flink output s3 bucket and pushes obj

Exception on s3 committer

2020-08-28 Thread Ivan Yang
Hi all, We got this exception after a job restart. Does anyone know what may lead to this situation? and how to get pass this Checkpoint issue? Prior to this, the job failed due to “Checkpoint expired before completing.” We are s3 heavy, writing out 10K files to s3 every 10 minutes using Stream

Re: sporadic "Insufficient no of network buffers" issue

2020-07-31 Thread Ivan Yang
s the root cause (or) if > the root cause is something else which triggers this issue. > > On Sat, Aug 1, 2020 at 9:36 AM Ivan Yang <mailto:ivanygy...@gmail.com>> wrote: > Hi Rahul, > > Try to increase taskmanager.network.memory.max to 1GB, basically double what > y

Re: sporadic "Insufficient no of network buffers" issue

2020-07-31 Thread Ivan Yang
Hi Rahul, Try to increase taskmanager.network.memory.max to 1GB, basically double what you have now. However, you only have 4GB RAM for the entire TM, seems out of proportion to have 1GB network buffer with 4GB total RAM. Reducing number of shuffling will require less network buffer. But if you

Re: Flink 1.11 job stop with save point timeout error

2020-07-24 Thread Ivan Yang
Jul 24, 2020 at 4:03 AM Ivan Yang <mailto:ivanygy...@gmail.com>> wrote: > Hello everyone, > > We recently upgrade FLINK from 1.9.1 to 1.11.0. Found one strange behavior > when we stop a job to a save point got following time out error. > I checked Flink web console, th

Flink 1.11 job stop with save point timeout error

2020-07-23 Thread Ivan Yang
Hello everyone, We recently upgrade FLINK from 1.9.1 to 1.11.0. Found one strange behavior when we stop a job to a save point got following time out error. I checked Flink web console, the save point is created in s3 in 1 second.The job is fairly simple, so 1 second for savepoint generation is e

Completed Job List in Flink UI

2020-06-18 Thread Ivan Yang
Hello, In Flink web UI Overview tab, "Completed Job List” displays recent completed or cancelled job only for short period of time. After a while, they are gone. The Job Manager is up and never restarted. Is there a config key to keep job history in the Completed Job List for longer time? I am

Flink on Kubernetes

2020-05-21 Thread Ivan Yang
Hi, I have setup Filnk 1.9.1 on Kubernetes on AWS EKS with one job manager pod, 10 task manager pods, one pod per EC2 instance. Job runs fine. After a while, for some reason, one pod (task manager) crashed, then the pod restarted. After that, the job got into a bad state. All the parallelisms a

Flink performance tuning on operators

2020-05-14 Thread Ivan Yang
Hi, We have a Flink job that reads data from an input stream, then converts each event from JSON string Avro object, finally writes to parquet files using StreamingFileSink with OnCheckPointRollingPolicy of 5 mins. Basically a stateless job. Initially, we use one map operator to convert Json st