Re: 如何进行主页中文翻译任务

2022-05-21 Thread Zhilong Hong
Hi, 振宇:

Flink官方主页的代码位于 [1],目录下所有以.zh.md
为结尾的文件都是中文版本。至于
Documentation Style 的中文文档则在 [2],目前确实没有翻译成中文,如果你感兴趣的话可以参考文档 [3]
进行代码贡献。首先在JIRA [4] 上新建一个Issue,用英文阐述相关信息。在Apache Flink
Committer将该Issue指定给你以后,就可以在目录 [1] 下提pull request了~

Best,
Zhilong

[1] https://github.com/apache/flink-web
[2]
https://github.com/apache/flink-web/blob/asf-site/contributing/docs-style.zh.md
[3] https://flink.apache.org/zh/contributing/contribute-documentation.html
[4]
https://issues.apache.org/jira/browse/FLINK-24694?jql=project%20%3D%20FLINK%20AND%20component%20%3D%20chinese-translation%20ORDER%20BY%20created%20DESC

On Sat, May 21, 2022 at 11:59 PM 邢振宇  wrote:

> 各位好:
> 页面 https://nightlies.apache.org/flink/flink-docs-master/ (姑且叫做文档网站) 下的源码是
> docs 目录下的,但是并没有找到 https://flink.apache.org/(姑且叫做主站) 这个页面下的源码,请问这个页面也可以开
> issue 翻译吗?
> 具体是:
> 1. 主站下的 https://flink.apache.org/contributing/docs-style.html 这个页面还没有中文翻译。
> 2. 如果这里的贡献文档是指文档网站的话,那么
> https://flink.apache.org/contributing/contribute-documentation.html
> 页面中的信息也应该更新一下了。比如,首先就应该说明文档具体路径应该是文档网站的地址,同时预览修改的命令和端口也不应该是目前文档中的。
>
> 如果可以,我很乐意去弄一下这个。
>
> 谢谢!
>


Re: taskexecutor .out files

2022-05-16 Thread Zhilong Hong
Hi, Zain:

The taskmanager.out only contains contents outputted by stdout. Sometimes
some fatal exceptions, like JVM exit exceptions and so on will be outputted
to the .out file. If you don't specify the file path for the gc log, the
content of the gc log will be saved into the .out file, too. However, it's
safe to delete the .out file on the TaskExecutors. Once it's deleted,
another new .out file will be created.

Best,
Zhilong


On Sun, May 15, 2022 at 7:05 PM Zain Haider Nemati 
wrote:

> Hi,
> I have been running a streaming job which prints data to .out files the
> size of the file has gotten really large and is choking the root memory for
> my VM. Is it ok to delete the .out files? Would that affect any other
> operation or functionality?
>


Re: Flink OLAP 与 Trino TPC-DS 对比

2022-05-08 Thread Zhilong Hong
十分感谢Yu Li老师的提醒,原邮件中第5个文档连接(即《10GiB TPCDS数据集测试结果》)已经更新至Google Docs [1]。

[1]
https://docs.google.com/spreadsheets/d/1nietTOrFg93p7k7L82lGPlUjwCpw97bWfP21xI_MLcE/edit?usp=sharing

Best,
Zhilong Hong

On Fri, May 6, 2022 at 4:51 PM Yu Li  wrote:

> 感谢大家的分享和分析,也期待Flink在相关方向的持续优化!
>
> Let's make Flink great together. :-)
>
> btw, 第5个引用的语雀文档链接已过期,建议使用google doc并更新一下链接
>
> Best Regards,
> Yu
>
>
> On Sun, 1 May 2022 at 21:57, Zhilong Hong  wrote:
>
> > Hello,
> >
> > 这段时间我们针对 LuNing 反馈的问题进行了深入的分析调研,在此将结论同步给社区。特别感谢 LuNing
> 反馈这一问题并与我们一起进行分析排查。
> >
> > 根据我们的分析,造成 Flink 1.14 在 TPCDS 10G 数据集、2 节点集群规模的情况下,与 Trino 359
> > 性能差距较大的原因主要包括以下 3 点:
> >
> > 1. 使用 SQL Client 提交 Flink 作业的耗时较长(单 query 约需要 4s)。在需要频繁提交作业的 OLAP
> > 场景下,我们建议使用 Flink SQL Gateway 提交作业,避免重复创建 Client 进程、建立网络链接等额外开销。我们目前使用的是
> > Ververica 开源的 SQL Gateway [1],此外社区也正在准备推出官方的 SQL Gateway,详见 FLIP-91 [2]。
> >
> > 2. 测试使用的数据集比较小(10GiB),导致 Hive Source 根据数据量划分出的 Split 数也比较少。而 Split 是
> Source
> > 处理数据的最小单位,这就导致虽然看起来 Source 有 32 个并发,实际读取并处理数据的往往只有 1~2 个并发。此外,由于测试配置中关闭了
> > Hive Source 的自动推断并发度功能 [3],导致上下游的并发数相同并且被 chain
> > 在一起,间接导致了下游算子实际处理数据的并发数也受到了影响。这一问题我们此前也发现过 [4],但没有像在 10GiB 这么小的数据集上影响这么大。
> >
> > 3. 目前对于部分 TPCDS 测试集的 Query,Flink SQL 生成的执行计划不是最优,导致 Flink 实际处理的数据量比 Trino
> > 要大。这与我们在大规模数据集上的观察是一致的,目前社区 SQL 模块的小伙伴们也在继续对这些 case 进行优化。
> >
> > 总的来看,上述 3 点中,第 2 点对 Flink 性能的影响是最大的。我们针对这一问题做了一定优化。打了 patch
> 后,尽管实际读取并处理数据的
> > Hive Source 并发仍达不到配置的 32 并发,但与 Trino 的差距已大幅缩短,详见 [5]。
> >
> > 目前在 OLAP 场景下 Flink 与 Trino 确实还存在差距,社区目前也正在针对这一场景进行优化
> > [6]。我们目前在阿里内部的开发分支上,已经追平了 Trino 的性能,相关优化预计会在 Flink 1.16、1.17
> 两个版本中陆续贡献回社区。
> >
> > Best,
> > Zhilong Hong
> >
> > [1] https://github.com/ververica/flink-sql-gateway
> > [2]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-91%3A+Support+SQL+Gateway
> > [3]
> >
> >
> https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/table/hive/hive_read_write/#source-parallelism-inference
> > [4] https://issues.apache.org/jira/browse/FLINK-27338
> > [5]
> >
> https://www.yuque.com/docs/share/b89479ab-9c24-45c8-9390-77299ae0c4cd?#AkK9
> > [6] https://issues.apache.org/jira/browse/FLINK-25318
> >
> > On Tue, Apr 19, 2022 at 5:43 PM LuNing Wang 
> wrote:
> >
> > >
> >
> https://www.yuque.com/docs/share/8625d14b-d465-48a3-8dc1-0be32b138f34?#lUX6
> > > 《tpcds-各引擎耗时》
> > > 链接有效期至 2022-04-22 10:31:05
> > >
> > > LuNing Wong  于2022年4月18日周一 09:44写道:
> > >
> > > > 补充,用的Hive 3.1.2 Hadoop 3.1.0做的数据源。
> > > >
> > > > LuNing Wong  于2022年4月18日周一 09:42写道:
> > > >
> > > > > Flink版本是1.14.4,
> > > Trino是359版本,tm.memory.process.size和CPU资源我都和Trino对齐了。都是32G
> > > > > 16核 16线程,2台计算节点。
> > > > >
> > > > > Zhilong Hong  于2022年4月15日周五 18:21写道:
> > > > >
> > > > >> Hello, Luning!
> > > > >>
> > > > >>
> > > > >>
> > > >
> > >
> >
> 我们目前也正在关注Flink在OLAP场景的性能表现,请问你测试的Flink和Trino版本分别是什么呢?另外我看到flink-sql-benchmark中所使用的集群配置和你的不太一样,可能需要根据集群资源对flink-conf.yaml中taskmanager.memory.process.size等资源配置进行调整。
> > > > >>
> > > > >> Best,
> > > > >> Zhilong
> > > > >>
> > > > >> On Fri, Apr 15, 2022 at 2:38 PM LuNing Wang <
> wang4lun...@gmail.com>
> > > > >> wrote:
> > > > >>
> > > > >> > 跑了100个 TPC-DS SQL
> > > > >> > 10 GB 数据、2个Worker(TM)、每个32G内存,16个核心。
> > > > >> > Flink平均用时 18秒
> > > > >> > Trino平均用时 7秒
> > > > >> >
> > > > >> > 我看字节跳动和阿里的老师测试,Flink和presto
> > > > >> OLAP性能接近,但是我测的差距很大。想进一步和老师交流下,是不是我Flink设置的有问题。
> > > > >> > 我基本上是按照下面这个项目里模板配置的Flink相关参数。
> > > > >> > https://github.com/ververica/flink-sql-benchmark
> > > > >> >
> > > > >> >
> > > > >> > LuNing Wang  于2022年4月15日周五 14:34写道:
> > > > >> >
> > > > >> > > 跑了100个SQL
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> >
>


Re: Flink OLAP 与 Trino TPC-DS 对比

2022-05-01 Thread Zhilong Hong
Hello,

这段时间我们针对 LuNing 反馈的问题进行了深入的分析调研,在此将结论同步给社区。特别感谢 LuNing 反馈这一问题并与我们一起进行分析排查。

根据我们的分析,造成 Flink 1.14 在 TPCDS 10G 数据集、2 节点集群规模的情况下,与 Trino 359
性能差距较大的原因主要包括以下 3 点:

1. 使用 SQL Client 提交 Flink 作业的耗时较长(单 query 约需要 4s)。在需要频繁提交作业的 OLAP
场景下,我们建议使用 Flink SQL Gateway 提交作业,避免重复创建 Client 进程、建立网络链接等额外开销。我们目前使用的是
Ververica 开源的 SQL Gateway [1],此外社区也正在准备推出官方的 SQL Gateway,详见 FLIP-91 [2]。

2. 测试使用的数据集比较小(10GiB),导致 Hive Source 根据数据量划分出的 Split 数也比较少。而 Split 是 Source
处理数据的最小单位,这就导致虽然看起来 Source 有 32 个并发,实际读取并处理数据的往往只有 1~2 个并发。此外,由于测试配置中关闭了
Hive Source 的自动推断并发度功能 [3],导致上下游的并发数相同并且被 chain
在一起,间接导致了下游算子实际处理数据的并发数也受到了影响。这一问题我们此前也发现过 [4],但没有像在 10GiB 这么小的数据集上影响这么大。

3. 目前对于部分 TPCDS 测试集的 Query,Flink SQL 生成的执行计划不是最优,导致 Flink 实际处理的数据量比 Trino
要大。这与我们在大规模数据集上的观察是一致的,目前社区 SQL 模块的小伙伴们也在继续对这些 case 进行优化。

总的来看,上述 3 点中,第 2 点对 Flink 性能的影响是最大的。我们针对这一问题做了一定优化。打了 patch 后,尽管实际读取并处理数据的
Hive Source 并发仍达不到配置的 32 并发,但与 Trino 的差距已大幅缩短,详见 [5]。

目前在 OLAP 场景下 Flink 与 Trino 确实还存在差距,社区目前也正在针对这一场景进行优化
[6]。我们目前在阿里内部的开发分支上,已经追平了 Trino 的性能,相关优化预计会在 Flink 1.16、1.17 两个版本中陆续贡献回社区。

Best,
Zhilong Hong

[1] https://github.com/ververica/flink-sql-gateway
[2]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-91%3A+Support+SQL+Gateway
[3]
https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/table/hive/hive_read_write/#source-parallelism-inference
[4] https://issues.apache.org/jira/browse/FLINK-27338
[5]
https://www.yuque.com/docs/share/b89479ab-9c24-45c8-9390-77299ae0c4cd?#AkK9
[6] https://issues.apache.org/jira/browse/FLINK-25318

On Tue, Apr 19, 2022 at 5:43 PM LuNing Wang  wrote:

> https://www.yuque.com/docs/share/8625d14b-d465-48a3-8dc1-0be32b138f34?#lUX6
> 《tpcds-各引擎耗时》
> 链接有效期至 2022-04-22 10:31:05
>
> LuNing Wong  于2022年4月18日周一 09:44写道:
>
> > 补充,用的Hive 3.1.2 Hadoop 3.1.0做的数据源。
> >
> > LuNing Wong  于2022年4月18日周一 09:42写道:
> >
> > > Flink版本是1.14.4,
> Trino是359版本,tm.memory.process.size和CPU资源我都和Trino对齐了。都是32G
> > > 16核 16线程,2台计算节点。
> > >
> > > Zhilong Hong  于2022年4月15日周五 18:21写道:
> > >
> > >> Hello, Luning!
> > >>
> > >>
> > >>
> >
> 我们目前也正在关注Flink在OLAP场景的性能表现,请问你测试的Flink和Trino版本分别是什么呢?另外我看到flink-sql-benchmark中所使用的集群配置和你的不太一样,可能需要根据集群资源对flink-conf.yaml中taskmanager.memory.process.size等资源配置进行调整。
> > >>
> > >> Best,
> > >> Zhilong
> > >>
> > >> On Fri, Apr 15, 2022 at 2:38 PM LuNing Wang 
> > >> wrote:
> > >>
> > >> > 跑了100个 TPC-DS SQL
> > >> > 10 GB 数据、2个Worker(TM)、每个32G内存,16个核心。
> > >> > Flink平均用时 18秒
> > >> > Trino平均用时 7秒
> > >> >
> > >> > 我看字节跳动和阿里的老师测试,Flink和presto
> > >> OLAP性能接近,但是我测的差距很大。想进一步和老师交流下,是不是我Flink设置的有问题。
> > >> > 我基本上是按照下面这个项目里模板配置的Flink相关参数。
> > >> > https://github.com/ververica/flink-sql-benchmark
> > >> >
> > >> >
> > >> > LuNing Wang  于2022年4月15日周五 14:34写道:
> > >> >
> > >> > > 跑了100个SQL
> > >> > >
> > >> >
> > >>
> > >
> >
>


Re: The file STDOUT does not exist on the TaskExecutor 异常

2022-04-20 Thread Zhilong Hong
Hello, 卓宇:

这个是REST API的报错,说明你在Flink
Dashboard中TaskManager页面点击了Stdout选项卡,但对应的TaskManager上访问不到stdout文件,因此报错。该错误不会影响任务的正常运行,可以忽略。

Best,
Zhilong

On Wed, Apr 20, 2022 at 3:06 PM 陈卓宇 <2572805...@qq.com.invalid> wrote:

> 大佬您好:
>  小弟想问一下这个异常是什么原因产生的,对生产有何影响,如何消除
>
> java.util.concurrent.CompletionException:
> org.apache.flink.util.FlinkException: The file STDOUT does not exist on the
> TaskExecutor.
>
> at
> org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$requestFileUploadByFilePath$24(TaskExecutor.java:2064)
> ~[flink-dist_2.11-1.13.1.jar:1.13.1]
>
> at
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
> ~[?:1.8.0_202]
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> ~[?:1.8.0_202]
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> ~[?:1.8.0_202]
>
> at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_202]
>
> Caused by: org.apache.flink.util.FlinkException: The file STDOUT does not
> exist on the TaskExecutor.
>
> ... 5 more
>
> 2022-04-20 14:51:47,370 ERROR
> org.apache.flink.runtime.rest.handler.taskmanager.TaskManagerStdoutFileHandler
> [] - Unhandled exception.
>
> org.apache.flink.util.FlinkException: The file STDOUT does not exist on
> the TaskExecutor.
>
> at
> org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$requestFileUploadByFilePath$24(TaskExecutor.java:2064)
> ~[flink-dist_2.11-1.13.1.jar:1.13.1]
>
> at
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
> ~[?:1.8.0_202]
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> ~[?:1.8.0_202]
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> ~[?:1.8.0_202]
>
> at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_202]
> 卓宇
>
>
> 


Re: Flink OLAP 与 Trino TPC-DS 对比

2022-04-15 Thread Zhilong Hong
Hello, Luning!

我们目前也正在关注Flink在OLAP场景的性能表现,请问你测试的Flink和Trino版本分别是什么呢?另外我看到flink-sql-benchmark中所使用的集群配置和你的不太一样,可能需要根据集群资源对flink-conf.yaml中taskmanager.memory.process.size等资源配置进行调整。

Best,
Zhilong

On Fri, Apr 15, 2022 at 2:38 PM LuNing Wang  wrote:

> 跑了100个 TPC-DS SQL
> 10 GB 数据、2个Worker(TM)、每个32G内存,16个核心。
> Flink平均用时 18秒
> Trino平均用时 7秒
>
> 我看字节跳动和阿里的老师测试,Flink和presto OLAP性能接近,但是我测的差距很大。想进一步和老师交流下,是不是我Flink设置的有问题。
> 我基本上是按照下面这个项目里模板配置的Flink相关参数。
> https://github.com/ververica/flink-sql-benchmark
>
>
> LuNing Wang  于2022年4月15日周五 14:34写道:
>
> > 跑了100个SQL
> >
>


Re: io.network.netty.exception

2022-03-07 Thread Zhilong Hong
Hi, 明文:

这个报错实际上是TM失联,一般是TM被kill导致的,可以根据TM的Flink日志和GC日志、集群层面的NM日志(YARN环境)或者是K8S日志查看TM被kill的原因。一般情况下可能是:gc时间过长导致TM心跳超时被kill、TM内存超用导致container/pod被kill等等。

Best.
Zhilong

On Mon, Mar 7, 2022 at 10:18 AM 潘明文  wrote:

> HI 读kafka,入hbase和kafka
> flink任务经常性报错
>
> org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException:
> Connection unexpectedly closed by remote task manager 'cdh02/xxx:42892'.
> This might indicate that the remote task manager was lost.


Re: Task Manager shutdown causing jobs to fail

2022-03-07 Thread Zhilong Hong
Hi, Puneet:

Like Terry says, if you find your job failed unexpectedly, you could check
the configuration restart-strategy in your flink-conf.yaml. If the restart
strategy is set to be disabled or none, the job will transition to failed
once it encounters a failover. The job would also fail itself if the
failover rate or attempts exceed the limit. For more information please
refer to [1] and [2].

Best,
Zhilong

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#fault-tolerance
[2]
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/task_failure_recovery/#exponential-delay-restart-strategy

On Mon, Mar 7, 2022 at 11:45 PM Puneet Duggal 
wrote:

> Hi Terry Wang,
>
> So adding to above provided context.. whenever task manager goes down,
> jobs go into failed state and do not restart. Even though there are good
> enough free slots available on other task manager to get restarted on.
>
> Regards,
> Puneet
>
> On 04-Mar-2022, at 4:54 PM, Terry Wang  wrote:
>
> Hi, Puneet~
>
> AFAIK, that should be expected behavior that jobs on crashed TaskManager
> restarts. HA means there is no single point risk but Flink job still need
> to through failover to ensure state and data consistency. You may refer
> https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/ops/state/task_failure_recovery/
>  for
> more details.
>
> On Fri, Mar 4, 2022 at 2:50 AM Puneet Duggal 
> wrote:
>
>> Hi,
>>
>> Currently in production, i have HA session mode flink cluster with 3 job
>> managers and multiple task managers with more than enough free task slots.
>> But i have seen multiple times that whenever task manager goes down ( e.g.
>> due to heartbeat issue).. so does all the jobs running on it even when
>> there are standby task managers availaible with free slots to run them on.
>> Has anyone faced this issue?
>>
>> Regards,
>> Puneet
>
>
>
> --
> Best Regards,
> Terry Wang
>
>
>


Re: PyFlink : submission via rest

2022-03-05 Thread Zhilong Hong
Hi, Aryan:

You could refer to the official docs [1] for how to submit PyFlink jobs.

$ ./bin/flink run \
  --target yarn-per-job
  --python examples/python/table/word_count.py

With this command you can submit a per-job application to YARN. The docs
[2] and [3] describe how to submit jobs to the YARN session and the
Kubernetes session.

$ ./bin/flink run -t yarn-session \
  -Dyarn.application.id=application__YY \
  --python examples/python/table/word_count.py

$ ./bin/flink run -t kubernetes-session \
  -Dkubernetes.cluster-id=my-first-flink-cluster \
  --python examples/python/table/word_count.py

Best,
Zhilong

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/cli/#submitting-pyflink-jobs
[2]
https://nightlies.apache.org/flink/flink-docs-master/zh/docs/deployment/resource-providers/yarn/#session-mode
[3]
https://nightlies.apache.org/flink/flink-docs-master/zh/docs/deployment/resource-providers/native_kubernetes/#session-mode

On Sun, Mar 6, 2022 at 2:08 AM aryan m  wrote:

> Hi !
>In a session cluster, what is the recommended way to submit a pyFlink
> job via REST ? I am on Flink 1.13 and my job code is available at
> web.upload.dir
> 
>  .
>
>   Appreciate the help!
>
>
>


Re: Flink failure rate restart not work as expect

2022-03-02 Thread Zhilong Hong
Hi, Jiaqiao:

Since your job enables checkpoint, you can just try to remove the restart
strategy config. The default value will be fixed-delay with
Integer.MAX_VALUE restart attempts and '1 s' delay, as mentioned in [1]. In
this way when a failover occurs, your job will wait for 1 seconds before it
restarts. Since the value of max restart attempts is Integer.MAX_VALUE, the
job will not transition to FAILED unless a fatal error occurs.

Best,
Zhilong

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#restart-strategy

On Wed, Mar 2, 2022 at 1:55 PM 刘 家锹  wrote:

> Hi, all
>
> I think we may find the reason, that's relate to the '
> *jobmanager.execution.failover-strategy*' configuration and the job
> region numbers. In our case, we set failover-strategy to 'region' and
> this job has 6 regions running on only one TaskManager. So when the
> container goes down, every regions need to be restart because they belong
> to this only one TaskManager.
> That's easy to tell that 4 retry times is not enough for 6 regions, so
> this job quit is reasonable.
> Also, why my testing job didn't quit, that's because this job is kind of
> different, it only has one region, so the behavior also expected.
>
> For us, we change failover-stratety to 'full', since most of our jobs has
> only one TaskManager and topology is simple. It will be helpful in most
> case. Further more, combine with region failover, that's kind of complex to
> configure a right parameter, we apply it to complex job only.
>
> If has any best practice about pipelined-region failover restart or
> document about region that would be helpfull.
>
> Again, thx for your time to reply, that help us a lot.
> --
> *发件人:* 刘 家锹 
> *发送时间:* 2022年3月1日 23:06
> *收件人:* Matthias Pohl ; user ;
> David Morávek 
> *主题:* Re: Flink failure rate restart not work as expect
>
> I realized I missed mentioning something above, the container exit code is
> 163, which is not the normal code, at least I can’t find any meaning from
> google. So, my test didn’t cover this situation, I don’t know whether it
> impacts the results.
>
> 获取 Outlook for iOS 
> --
> *发件人:* 刘 家锹 
> *发送时间:* Tuesday, March 1, 2022 10:23:50 PM
> *收件人:* Matthias Pohl ; user ;
> David Morávek 
> *主题:* Re: Flink failure rate restart not work as expect
>
> We didn't find any obvious configuration issues in our cluster. As far as
> I know, It works fine in most cases; I also simulate failover under current
> configuration, by starting a new job with only one TaskManager, then kill
> the TaskManager container, and this job recovery from failures
> successfully.
> As you said, yarn logs look it may have some problems, we try digging into
> it to see if we can find any hints.
>
> 获取 Outlook for iOS 
> --
> *发件人:* Matthias Pohl 
> *发送时间:* Tuesday, March 1, 2022 9:50:36 PM
> *收件人:* 刘 家锹 ; user ;
> David Morávek 
> *主题:* Re: Flink failure rate restart not work as expect
>
> The YARN node manager logs support my observation: The container exits
> with a failure which, if I understand it correctly, should cause a
> container restart on the YARN side. In HA mode, Flink expects the
> underlying resource management to restart the Flink cluster in case of
> failure. This does not seem to happen in your case. Is there a
> configuration issue in your YARN cluster? Or does the container recovery
> usually work in failure cases for you? I'm not that experienced with YARN
> deployments. I'm adding David to this thread. He might have some additional
> insights.
>
> Matthias
>
> On Tue, Mar 1, 2022 at 12:19 PM 刘 家锹  wrote:
>
> Unfortunately we did't keep log properly , this happen too far away, yarn
> ResourceMnager log had clean,  and the broken machine had reinstall. We
> only found the yarn log of JobManager on Yarn NodeManager, it maybe
> useless. We will put the detail logs to this thread when it happen again,
> since it happen sometime, like between two weeks,  if one of our cluster
> machine go down.
> --
> *发件人:* Matthias Pohl 
> *发送时间:* 2022年3月1日 17:57
> *收件人:* Alexander Preuß 
> *抄送:* 刘 家锹 ; user@flink.apache.org <
> user@flink.apache.org>
> *主题:* Re: Flink failure rate restart not work as expect
>
> Hi,
> I second Alex' observation - based on the logs it looks like the task
> restart functionality worked as expected: It tried to restart the tasks
> until it reached the limit of 4 attempts due to the missing TaskManager.
> The job-cluster shut down with an error code. At this point, YARN should
> pick it up and bring up a new JobManager based on the non-0 exit code of
> the Flink cluster. It would be interesting to see the YARN logs to figure
> out why the cluster failover didn't work.
>
> Best,
> Matthias
>
> On Tue, Mar 1, 2022 at 8:00 AM Alexander Preuß <
> alexanderpre...@ververica.com> wrote:
>
> Hi,
> from a first glance it looks like 

Re: Flink job recovery after task manager failure

2022-02-24 Thread Zhilong Hong
Hi, Afek

I've read the log you provided. Since you've set the value of
restart-strategy to be exponential-delay and the value
of restart-strategy.exponential-delay.initial-backoff is 10s, everytime a
failover is triggered, the JobManager will have to wait for 10 seconds
before it restarts the job.If you'd prefer a quicker restart, you could
change the restart strategy to fixed-delay and set a small value for
restart-strategy.fixed-delay.delay.

Furthermore, there are two more failovers that happened during the
initialization of recovered tasks. During the initialization of a task, it
will try to recover the states from the last valid checkpoint. A
FileNotFound exception happens during the recovery process. I'm not quite
sure the reason. Since the recovery succeeds after two failovers, I think
maybe it's because the local disks of your cluster are not stable.

Sincerely,
Zhilong

On Thu, Feb 24, 2022 at 9:04 PM Afek, Ifat (Nokia - IL/Kfar Sava) <
ifat.a...@nokia.com> wrote:

> Thanks Zhilong.
>
>
>
> The first launch of our job is fast, I don’t think that’s the issue. I see
> in flink job manager log that there were several exceptions during the
> restart, and the task manager was restarted a few times until it was
> stabilized.
>
>
>
> You can find the log here:
>
> jobmanager-log.txt.gz
> <https://nokia-my.sharepoint.com/:u:/p/ifat_afek/EUsu4rb_-BpNrkpvSwzI-vgBtBO9OQlIm0CHtW0gsZ7Gqg?email=zhlonghong%40gmail.com=ww5Idt>
>
>
>
> Thanks,
>
> Ifat
>
>
>
> *From: *Zhilong Hong 
> *Date: *Wednesday, 23 February 2022 at 19:38
> *To: *"Afek, Ifat (Nokia - IL/Kfar Sava)" 
> *Cc: *"user@flink.apache.org" 
> *Subject: *Re: Flink job recovery after task manager failure
>
>
>
> Hi, Afek!
>
>
>
> When a TaskManager is killed, JobManager will not be acknowledged until a
> heartbeat timeout happens. Currently, the default value of
> heartbeat.timeout is 50 seconds [1]. That's why it takes more than 30
> seconds for Flink to trigger a failover. If you'd like to shorten the time
> a failover is triggered in this situation, you could decrease the value of
> heartbeat.timeout in flink-conf.yaml. However, if the value is set too
> small, heartbeat timeouts will happen more frequently and the cluster will
> be unstable. As FLINK-23403 [2] mentions, if you are using Flink 1.14 or
> 1.15, you could try to set the value to 10s.
>
>
>
> You mentioned that it takes 5-6 minutes to restart the jobs. It seems a
> bit weird. How long does it take to deploy your job for a brand new launch?
> You could compact and upload the log of JobManager to Google Drive or
> OneDrive and attach the sharing link. Maybe we can find out what happens
> via the log.
>
>
>
> Sincerely,
>
> Zhilong
>
>
>
> [1]
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#heartbeat-timeout
>
> [2] https://issues.apache.org/jira/browse/FLINK-23403
>
>
>
> On Thu, Feb 24, 2022 at 12:25 AM Afek, Ifat (Nokia - IL/Kfar Sava) <
> ifat.a...@nokia.com> wrote:
>
> Hi,
>
>
>
> I am trying to use Flink checkpoints solution in order to support task
> manager recovery.
>
> I’m running flink using beam with filesystem storage and the following
> parameters:
>
> checkpointingInterval=3
>
> checkpointingMode=EXACTLY_ONCE.
>
>
>
> What I see is that if I kill a task manager pod, it takes flink about 30
> seconds to identify the failure and another 5-6 minutes to restart the jobs.
>
> Is there a way to shorten the downtime? What is an expected downtime in
> case the task manager is killed, until the jobs are recovered? Are there
> any best practices for handling it? (e.g. different configuration
> parameters)
>
>
>
> Thanks,
>
> Ifat
>
>
>
>


Re: Flink job recovery after task manager failure

2022-02-23 Thread Zhilong Hong
Hi, Afek!

When a TaskManager is killed, JobManager will not be acknowledged until a
heartbeat timeout happens. Currently, the default value of
heartbeat.timeout is 50 seconds [1]. That's why it takes more than 30
seconds for Flink to trigger a failover. If you'd like to shorten the time
a failover is triggered in this situation, you could decrease the value of
heartbeat.timeout in flink-conf.yaml. However, if the value is set too
small, heartbeat timeouts will happen more frequently and the cluster will
be unstable. As FLINK-23403 [2] mentions, if you are using Flink 1.14 or
1.15, you could try to set the value to 10s.

You mentioned that it takes 5-6 minutes to restart the jobs. It seems a bit
weird. How long does it take to deploy your job for a brand new launch? You
could compact and upload the log of JobManager to Google Drive or OneDrive
and attach the sharing link. Maybe we can find out what happens via the log.

Sincerely,
Zhilong

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#heartbeat-timeout
[2] https://issues.apache.org/jira/browse/FLINK-23403

On Thu, Feb 24, 2022 at 12:25 AM Afek, Ifat (Nokia - IL/Kfar Sava) <
ifat.a...@nokia.com> wrote:

> Hi,
>
>
>
> I am trying to use Flink checkpoints solution in order to support task
> manager recovery.
>
> I’m running flink using beam with filesystem storage and the following
> parameters:
>
> checkpointingInterval=3
>
> checkpointingMode=EXACTLY_ONCE.
>
>
>
> What I see is that if I kill a task manager pod, it takes flink about 30
> seconds to identify the failure and another 5-6 minutes to restart the jobs.
>
> Is there a way to shorten the downtime? What is an expected downtime in
> case the task manager is killed, until the jobs are recovered? Are there
> any best practices for handling it? (e.g. different configuration
> parameters)
>
>
>
> Thanks,
>
> Ifat
>
>
>


Re: TaskManager的Slot的释放时机

2022-01-25 Thread Zhilong Hong
Hello, johnjlong:

TaskExecutor#cancel是RPC调用,不包含TM是否存活的信息。TM是否存活是由Heartbeat
Service来负责检测的,目前heartbeat.timeout配置项 [1]
的默认值为50s。而RPC调用的超时配置项akka.ask.timeout [2]
的默认值为10s。如果想要尽快检测到TM丢失的情况,可以将这两个配置项的值调小,但这有可能会导致集群或作业不稳定。

关于降低heartbeat timeout时长社区目前已有讨论,具体可以参考:[3] 和 [4]

[1]
https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/config/#heartbeat-timeout
[2]
https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/config/#akka-ask-timeout
[3] https://issues.apache.org/jira/browse/FLINK-23403
[4] https://issues.apache.org/jira/browse/FLINK-23209

Sincerely,
Zhilong

On Tue, Jan 25, 2022 at 10:06 AM johnjlong  wrote:

> 各位大佬好,请教一个问题。
> 我根据ResourceID主动释放TM的链接的时候,我发现TM对应的Slots仅仅是标记为free。
>
> 而其真正是释放却要等到JobMaster主动cancel整个ExecuteGraph的时候,此时会逐个调用每个定点所在的slot的TM的cancel方法。
> 但是此时相关联的TM已经close掉,触发了rpc超时,默认20s。然后slot才会被释放。
>
>
> 我的问题是:为什么不在调用TaskExecutor的cancelTask之间判断下TM是否存活,如果不存活就直接走cancel的流程,不用等rpc超时后,才进行下一步???
>
> 附上日志截图:
>
> johnjlong
> johnjl...@163.com
>
> 
> 签名由网易邮箱大师 定制
>


Re: flink作业支持资源自动扩缩容吗?

2021-12-11 Thread Zhilong Hong
流作业的话可以看一下自1.13版本开始引入的Reactive模式 [1]
和Adaptive调度,会根据资源的变化对作业并发度进行调整。用户可以根据作业指标对资源进行调整,flink即会根据资源变化对作业进行调整。批作业的话可以了解一下1.15版本中即将推出的Adaptive批调度模式
[2],在这种模式下节点并发度会随着数据量自动进行调整。

[1]
https://nightlies.apache.org/flink/flink-docs-release-1.14/zh/docs/deployment/elastic_scaling/
[2]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-187%3A+Adaptive+Batch+Job+Scheduler

On Wed, Dec 8, 2021 at 5:42 PM casel.chen  wrote:

>
> 实时作业根据上游业务流量大小消耗资源有峰谷,请问最新版本的flink支持在流量大的时候自动扩容(增加cpu/mem或提高并行度等),在流量小的时候自动缩容吗?
> 如果支持,一般需要多久?扩缩容期间会影响作业正常运行吗?


Re: New blog post published - Sort-Based Blocking Shuffle Implementation in Flink

2021-11-08 Thread Zhilong Hong
Thank you for writing this blog post, Daisy and Kevin! It helps me to
understand what sort-based shuffle is and how to use it. Looking forward to
your future improvements!

On Wed, Nov 3, 2021 at 6:32 PM Yuxin Tan  wrote:

> Thanks Daisy and Kevin! The IO scheduling idea of the sequential reading
> and the benchmark result look really great!  Looking forward to the next
> work.
>
> Best,
>
> Yuxin
>
> weijie guo  于2021年11月3日周三 下午5:24写道:
>
>> It's really an amazing job to fill in the defects of flink in batch
>> shuffle. I really appreciate the work done in io scheduling, the sequential
>> reading of the shuffle reader can greatly improve the disk IO performance
>> and stability. Sort-based shuffle realizes this feature in a concise and
>> efficient way. By the way, the default shuffle implementation in the batch
>> mode of flink is still hash-based, maybe we can consider using the new
>> shuffle implementation by default later. Last but not least, thank Yingjie
>> Cao (Kevin) and Daisy Tsang for publishing this blog.
>>
>> Lijie Wang  于2021年11月3日周三 下午4:17写道:
>>
>>> Thanks Daisy and Kevin for bringing this blog, it is very helpful for
>>> understanding the principle of sort shuffle.
>>>
>>>
>>> Best,
>>>
>>> Lijie
>>>
>>> Guowei Ma  于2021年11月3日周三 下午2:57写道:
>>>

 Thank Daisy& Kevin much for your introduction to the improvement of TM
 blocking shuffle, credit base+io scheduling is indeed a very interesting
 thing. At the same time, I look forward to this as a default setting for tm
 blocking shuffle.

 Best,
 Guowei


 On Wed, Nov 3, 2021 at 2:46 PM Gen Luo  wrote:

> Thanks Daisy and Kevin! The benchmark results look really exciting!
>
> On Tue, Nov 2, 2021 at 4:38 PM David Morávek  wrote:
>
>> Thanks Daisy and Kevin for a great write up! ;) Especially the 2nd
>> part was really interesting, I really like the idea of the single spill
>> file with a custom scheduling of read requests.
>>
>> Best,
>> D.
>>
>> On Mon, Nov 1, 2021 at 10:01 AM Daisy Tsang 
>> wrote:
>>
>>> Hey everyone, we have a new two-part post published on the Apache
>>> Flink blog about the sort-based blocking shuffle implementation in 
>>> Flink.
>>> It covers benchmark results, design and implementation details, and 
>>> more!
>>> We hope you like it and welcome any sort of feedback on it. :)
>>>
>>>
>>> https://flink.apache.org/2021/10/26/sort-shuffle-part1.html
>>> https://flink.apache.org/2021/10/26/sort-shuffle-part2.html
>>>
>>


Re: [ANNOUNCE] Apache Flink 1.11.2 released

2020-09-18 Thread Zhilong Hong
Thank you, @ZhuZhu, for driving this release!


Best regards,

Zhilong


From: Zhu Zhu 
Sent: Thursday, September 17, 2020 13:29
To: dev ; user ; user-zh 
; Apache Announce List 
Subject: [ANNOUNCE] Apache Flink 1.11.2 released

The Apache Flink community is very happy to announce the release of Apache
Flink 1.11.2, which is the second bugfix release for the Apache Flink 1.11
series.

Apache Flink® is an open-source stream processing framework for
distributed, high-performing, always-available, and accurate data streaming
applications.

The release is available for download at:
https://flink.apache.org/downloads.html

Please check out the release blog post for an overview of the improvements
for this bugfix release:
https://flink.apache.org/news/2020/09/17/release-1.11.2.html

The full release notes are available in Jira:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12348575

We would like to thank all contributors of the Apache Flink community who
made this release possible!

Thanks,
Zhu