RE: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values

2021-07-22 Thread LINZ, Arnaud
Hello, From a user perspective: we have some (rare) use cases where we use “coarse grain” datasets, with big beans and tasks that do lengthy operation (such as ML training). In these cases we had to increase the time out to huge values (heartbeat.timeout: 50) so that our app is not killed.

Re: Flink TaskManager container got restarted by K8S very frequently

2021-07-22 Thread Fabian Paul
CC user ML

Re: Need help of deploying Flink HA on kubernetes cluster

2021-07-22 Thread Fabian Paul
Hi Dhirendra, Thanks for reaching out. A good way to start is to have a look at [1] and [2]. Once you have everything setup it should be possible to delete the pod of the JobManager while an application is running and the job successfully recovers. You can use one of the example Flink applicati

Re: Questions about keyed streams

2021-07-22 Thread Fabian Paul
Hi Dan, 1) In general, there is no guarantee that your downstream operator is on the same TM although working on the same key group. Nevertheless, you can try force this kind of behaviour to prevent the network transfer by either chaining the two operators (if no shuffle is in between) or confi

Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values

2021-07-22 Thread Till Rohrmann
Thanks for your inputs Gen and Arnaud. I do agree with you, Gen, that we need better guidance for our users on when to change the heartbeat configuration. I think this should happen in any case. I am, however, not so sure whether we can give hard threshold like 5000 tasks, for example, because as

Issue with Flink jobs after upgrading to Flink 1.13.1/Ververica 2.5 - java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration

2021-07-22 Thread Natu Lauchande
Good day Flink community, Apache Flink/Ververica Community Edition - Question I am having an issue with my Flink SQL jobs since updating from Flink 1.12/Ververica 2.4 to Ververica 2.5 . For all the jobs running on parquet and S3 i am getting the following error continuously: INITIALIZING to FAI

Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values

2021-07-22 Thread 刘建刚
Thanks, Till. There are many reasons to reduce the heartbeat interval and timeout. But I am not sure what values are suitable. In our cases, the GC time and big job can be related factors. Since most flink jobs are pipeline and a total failover can cost some time, we should tolerate some stop-world

Re: Need help of deploying Flink HA on kubernetes cluster

2021-07-22 Thread Fabian Paul
Hi Dhiru, No worries I completely understand your point. Usually all the executable scripts from Flink can be found in the main repository [1]. We also provide a community edition of our commercial product [2] which manages the lifecycle of the cluster and you do not have to use these scripts an

Re: Flink TaskManager container got restarted by K8S very frequently

2021-07-22 Thread David Morávek
If you run `kubectl describe pod ...` on the affected pod, you should see a reason why the previous pod has terminated (eg. OOM killed by Kubernetes). Best, D. On Thu, Jul 22, 2021 at 9:30 AM Fabian Paul wrote: > CC user ML > >

Re: Issue with Flink jobs after upgrading to Flink 1.13.1/Ververica 2.5 - java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration

2021-07-22 Thread Timo Walther
Hi Natu, Ververica Platform 2.5 has updated the bundled Hadoop version but this should not result in a NoClassDefFoundError exception. How are you submitting your SQL jobs? You don't use Ververica's SQL service but have built a regular JAR file, right? If this is the case, can you share your

Re: Issue with Flink jobs after upgrading to Flink 1.13.1/Ververica 2.5 - java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration

2021-07-22 Thread Natu Lauchande
Hey Timo, Thanks for the reply. No custom file as we are using Flink SQL and submitting the job directly through the SQL Editor UI. We are using Flink 1.13.1 as the supported flink version. No custom code all through Flink SQL on UI no jars. Thanks, Natu On Thu, Jul 22, 2021 at 2:08 PM Timo Wal

Re: Issue with Flink jobs after upgrading to Flink 1.13.1/Ververica 2.5 - java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration

2021-07-22 Thread Timo Walther
Maybe you can share also which connector/format you are using? What is the DDL? Regards, Timo On 22.07.21 14:11, Natu Lauchande wrote: Hey Timo, Thanks for the reply. No custom file as we are using Flink SQL and submitting the job directly through the SQL Editor UI. We are using Flink 1.13

Re: Issue with Flink jobs after upgrading to Flink 1.13.1/Ververica 2.5 - java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration

2021-07-22 Thread Natu Lauchande
Sure. That's how the ddl table looks like: CREATE TABLE tablea ( `a` BIGINT, `b` BIGINT, `c` BIGINT ) COMMENT '' WITH ( 'auto-compaction' = 'false', 'connector' = 'filesystem', 'format' = 'parquet', 'parquet.block.size' = '134217728', 'parquet.compression' = 'SNAPPY',

Re: Issue with Flink jobs after upgrading to Flink 1.13.1/Ververica 2.5 - java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration

2021-07-22 Thread Timo Walther
Thanks, this should definitely work with the pre-packaged connectors of Ververica platform. I guess we have to investigate what is going on. Until then, a workaround could be to add Hadoop manually and set the HADOOP_CLASSPATH environment variable. The root cause seems that Hadoop cannot be fo

Re: Recover from savepoints with Kubernetes HA

2021-07-22 Thread Austin Cawley-Edwards
Hey Thomas, Hmm, I see no reason why you should not be able to update the checkpoint interval at runtime, and don't believe that information is stored in a savepoint. Can you share the JobManager logs of the job where this is ignored? Thanks, Austin On Wed, Jul 21, 2021 at 11:47 AM Thms Hmm wro

Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values

2021-07-22 Thread Chesnay Schepler
I'm wondering if this discussion isn't going in the wrong direction. It is clear that we cannot support all use-case with the defaults, so let's not try that. We won't find it. And I would argue that is also not their purpose; they are configurable for a reason. I would say the defaults should p

Observability around Flink Pipeline/stateful functions

2021-07-22 Thread Deepak Sharma
@d...@spark.apache.org @user I am looking for an example around the observability framework for Apache Flink pipelines. This could be message tracing across multiple flink pipelines or query on the past state of a message that was processed by any flink pipeline. If anyone has done similar work a

Re: Recover from savepoints with Kubernetes HA

2021-07-22 Thread Yang Wang
Please note that when the job is canceled, the HA data(including the checkpoint pointers) stored in the ConfigMap/ZNode will be deleted. But it is strange that the "-s/--fromSavepoint" does not take effect when redeploying the Flink application. The JobManager logs could help a lot to find the roo

Re: Flink TaskManager container got restarted by K8S very frequently

2021-07-22 Thread Yang Wang
David's suggestion makes a lot of sense. You need to check whether the TaskManager is killed by Kubernetes via `kubectl describe pod` for exit code or the kubelet logs. If it is not killed by Kubernetes, then it might crashed internally. Please use `kubectl logs --previous` to check the logs. Be

Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values

2021-07-22 Thread Gen Luo
Thanks for sharing the thoughts Chesnay, and I overall agree with you. We can't give a default value suitable for all jobs, but we can figure out whether the current default value is too large for most of the jobs, and that is the guideline for this topic. Configurability is reserved for the others

YarnTaskExecutorRunner should contains MapReduce classes

2021-07-22 Thread chenkaibit
Hi: I followed instructions described in [https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/hive] and tested hive streaming sink, met this exception Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.mapred.JobConf [http://apache-flink.147419.n