Re: taskexecutor .out files

2022-05-16 Thread Zhilong Hong
Hi, Zain: The taskmanager.out only contains contents outputted by stdout. Sometimes some fatal exceptions, like JVM exit exceptions and so on will be outputted to the .out file. If you don't specify the file path for the gc log, the content of the gc log will be saved into the .out file, too. Howe

Re: Task Manager shutdown causing jobs to fail

2022-03-07 Thread Zhilong Hong
Hi, Puneet: Like Terry says, if you find your job failed unexpectedly, you could check the configuration restart-strategy in your flink-conf.yaml. If the restart strategy is set to be disabled or none, the job will transition to failed once it encounters a failover. The job would also fail itself

Re: PyFlink : submission via rest

2022-03-05 Thread Zhilong Hong
Hi, Aryan: You could refer to the official docs [1] for how to submit PyFlink jobs. $ ./bin/flink run \ --target yarn-per-job --python examples/python/table/word_count.py With this command you can submit a per-job application to YARN. The docs [2] and [3] describe how to submit jobs

Re: Flink failure rate restart not work as expect

2022-03-02 Thread Zhilong Hong
Hi, Jiaqiao: Since your job enables checkpoint, you can just try to remove the restart strategy config. The default value will be fixed-delay with Integer.MAX_VALUE restart attempts and '1 s' delay, as mentioned in [1]. In this way when a failover occurs, your job will wait for 1 seconds before it

Re: Flink job recovery after task manager failure

2022-02-24 Thread Zhilong Hong
ptions during the > restart, and the task manager was restarted a few times until it was > stabilized. > > > > You can find the log here: > > jobmanager-log.txt.gz > <https://nokia-my.sharepoint.com/:u:/p/ifat_afek/EUsu4rb_-BpNrkpvSwzI-vgBtBO9OQlIm0CHtW0gsZ7Gqg?email=zh

Re: Flink job recovery after task manager failure

2022-02-23 Thread Zhilong Hong
Hi, Afek! When a TaskManager is killed, JobManager will not be acknowledged until a heartbeat timeout happens. Currently, the default value of heartbeat.timeout is 50 seconds [1]. That's why it takes more than 30 seconds for Flink to trigger a failover. If you'd like to shorten the time a failover

Re: New blog post published - Sort-Based Blocking Shuffle Implementation in Flink

2021-11-08 Thread Zhilong Hong
Thank you for writing this blog post, Daisy and Kevin! It helps me to understand what sort-based shuffle is and how to use it. Looking forward to your future improvements! On Wed, Nov 3, 2021 at 6:32 PM Yuxin Tan wrote: > Thanks Daisy and Kevin! The IO scheduling idea of the sequential reading >