Re: Flink如何实现至多一次(At Most Once)

2020-09-03 Thread Yun Tang
Hi 如果是完全依赖source的offset管理,可以达到类似 at most once 的语义。 社区其实也有更完备的checkpoint at most once 的实现讨论,已经抄送了相关的开发人员 @yuanmei.w...@gmail.com 祝好 唐云 From: Paul Lam Sent: Thursday, September 3, 2020 17:28 To: user-zh Subject: Re:

Re: 从Savepoint/Checkpoint恢复时 keyedstate中部分数据无法被读取

2020-09-03 Thread Yun Tang
Hi 我觉得这个不是root cause,实际上 transient ListState 是一种正确的用法,因为state应该是在函数open方法里面进行初始化,所以transient 修饰即可。 麻烦把这个list state的初始化以及使用方法的代码都贴出来吧。 祝好 唐云 From: Liu Rising Sent: Thursday, September 3, 2020 12:26 To: user-zh@flink.apache.org Subject: Re:

Re: Flink Migration

2020-08-28 Thread Yun Tang
be resolved. You can also check the service of 'jobmanager' whether work as expected via 'kubectl get svc' . [1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/upgrading.html#compatibility-table Best Yun Tang From: Navneeth Krishnan Sent: Friday

Re: flink文档

2020-08-28 Thread Yun Tang
Hi SQL解析不通过的可以在 https://issues.apache.org/jira/projects/FLINK/issues 里面创建相关ticket指明出来,很快会有相关开发来帮助的。 不过需要注意的是,需要用英文进行阐述。 祝好 唐云 From: Dream-底限 Sent: Friday, August 28, 2020 16:42 To: user-zh@flink.apache.org Subject: flink文档 hi、

Re: flink prometheus 无法上报accumulator类型监控吗

2020-08-28 Thread Yun Tang
Hi 没有名为 accumulator 的metrics类型数据,目前只有Counters, Gauges, Histograms 和 Meters [1] 这四种,如果你想要用累积型metrics,可以考虑counters [1] https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html#metric-types 祝好 唐云 From: 赵一旦 Sent: Friday, August 28, 2020

Re: 从savepoint 启动以后,无法在checkpoint页面看到last restore的相关信息

2020-08-27 Thread Yun Tang
Hi 范超 虽然看不到你的图,但是你的启动命令错误了,所有的options应该放在jar包文件地址前面[1] 1. class name 应该在 jar包地址前面 [2] 2. savepoint/checkpoint 地址应该在jar包地址前面 [3] 没有正确从checkpoint恢复的原因应该是这个原因 [1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/cli.html#usage [2]

Re: Performance issue associated with managed RocksDB memory

2020-08-27 Thread Yun Tang
/dataArtisans/frocksdb/blob/49bc897d5d768026f1eb816d960c1f2383396ef4/include/rocksdb/write_buffer_manager.h#L47 [2] https://github.com/dataArtisans/frocksdb/blob/49bc897d5d768026f1eb816d960c1f2383396ef4/db/column_family.cc#L196 Best Yun Tang From: Juha Mynttinen Sent: Monday, August 24, 2020 15:5

Re: [ANNOUNCE] New PMC member: Dian Fu

2020-08-27 Thread Yun Tang
Congratulations , Dian! Best Yun Tang From: Yang Wang Sent: Friday, August 28, 2020 10:28 To: Arvid Heise Cc: Benchao Li ; dev ; user-zh ; Dian Fu ; user Subject: Re: [ANNOUNCE] New PMC member: Dian Fu Congratulations Dian ! Best, Yang Arvid Heise

Re: [ANNOUNCE] New PMC member: Dian Fu

2020-08-27 Thread Yun Tang
Congratulations , Dian! Best Yun Tang From: Yang Wang Sent: Friday, August 28, 2020 10:28 To: Arvid Heise Cc: Benchao Li ; dev ; user-zh ; Dian Fu ; user Subject: Re: [ANNOUNCE] New PMC member: Dian Fu Congratulations Dian ! Best, Yang Arvid Heise

Re: 回复: 流处理任务中checkpoint失败

2020-08-27 Thread Yun Tang
Hi Robert 你的两个source firstSource和secondSource是自己实现的么,一种怀疑是source在没有数据时持锁[1][2]导致checkpoint barrier并没有下发。 建议使用jstack查看在没有数据下发时,source相关task的java调用栈,观察是否存在等待锁释放 [1]

Re: Monitor the usage of keyed state

2020-08-26 Thread Yun Tang
ects/flink/flink-docs-release-1.11/ops/config.html#state-backend-rocksdb-memory-managed [2] https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/config.html#state-backend-rocksdb-metrics-block-cache-usage Best, Yun Tang From: Andrey Zagrebin Sent: Tue

Re: 1.11.2大概什么时候发布

2020-08-26 Thread Yun Tang
可以参照 https://flink.apache.org/downloads.html#all-stable-releases 的历史发布记录,一般是3个月左右,也就是大约10月底。 1.11.2 有什么特别期待的bug fix么? 祝好 唐云 From: abc15...@163.com Sent: Wednesday, August 26, 2020 15:41 To: user-zh@flink.apache.org Subject: 1.11.2大概什么时候发布 1.11.2大概什么时候发布?

Re: flink checkpoint导致反压严重

2020-08-25 Thread Yun Tang
Hi 对于已经改为at least once的checkpoint,其在checkpoint时对于作业吞吐的影响只有task 同步阶段的snapshot,这个时间段的snapshot由于与task的主线程的数据访问持有同一把锁,会影响主线程的数据处理。但是就算这样,我也很怀疑checkpoint本身并不是导致早上10点高峰期无法运行的罪魁祸首。 使用异步的,支持增量的state backend (例如RocksDBStateBackend)会大大缓解该问题。 建议排查思路: 1. 检查使用的state backend类型 2.

Re: [ANNOUNCE] Apache Flink 1.10.2 released

2020-08-25 Thread Yun Tang
Thanks for Zhu's work to manage this release and everyone who contributed to this! Best, Yun Tang From: Yangze Guo Sent: Tuesday, August 25, 2020 14:47 To: Dian Fu Cc: Zhu Zhu ; dev ; user ; user-zh Subject: Re: [ANNOUNCE] Apache Flink 1.10.2 released Thanks

Re: [ANNOUNCE] Apache Flink 1.10.2 released

2020-08-25 Thread Yun Tang
Thanks for Zhu's work to manage this release and everyone who contributed to this! Best, Yun Tang From: Yangze Guo Sent: Tuesday, August 25, 2020 14:47 To: Dian Fu Cc: Zhu Zhu ; dev ; user ; user-zh Subject: Re: [ANNOUNCE] Apache Flink 1.10.2 released Thanks

Re: 有没有可能使用tikv作为flink 分布式的backend

2020-08-25 Thread Yun Tang
Hi 这种思路我觉得是可以尝试的,不过目前看需要改动的地方很多: 1. 需要更改RocksDB 创建checkpoint 到TiKV的代码逻辑 2. 需要改动RocksDB 从checkpoint resume的代码逻辑 3. 如果想要数据可以TiKV可以读取,那么TiKV中存储的格式要么与RocksDB内存储的一样,那这样子的话,lookup时候,需要能够反序列化Flink在RocksDB中的存储格式;要么是重新的格式,但这样子会导致RocksDB的checkpoint流程和时间都会增长。 4. TiKV中的数据的更新依赖于checkpoint

Re: 有没有可能使用tikv作为flink 分布式的backend

2020-08-24 Thread Yun Tang
Hi TiKV 本身就是分布式的,多副本的,可以类比HBase,所以不是将其向Flink内置的state backend靠拢,而是向Flink读写HBase靠拢,这样若干写TiKV的Flink作业就做到了数据共享。 如果想将TiKV向Flink state-backend靠拢,TiKV本身的分布式架构,多副本机制,网络传输(而不是本地磁盘访问)都是缺点或者说不再必要存在的特性。 最后就会演化成现在Flink + RocksDB state-backend的架构,更何况TiKV就是基于RocksDB的,整体意义不是很大。 祝好 唐云

Re: How to sink invalid data from flatmap

2020-08-24 Thread Yun Tang
Hi Chao I think side output [1] might meet your requirements. [1] https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/side_output.html Best Yun Tang From: 范超 Sent: Tuesday, August 25, 2020 10:54 To: user Subject: How to sink invalid data from

Re: state序列化问题

2020-08-21 Thread Yun Tang
要想知道,在MapStateDescriptor声明类型信息时,我是否应该把内部Map声明成明确的HashMap类型,而不是Map类型? Yun Tang 于2020年8月21日周五 上午12:13写道: > Hi > > 如果想要在list中保存String,也就是list中的每个元素格式是String,ListState的格式应该是 > ListState, 而不是 > ListState>,后者表示有一个list,list中的每一个元素均是一个list > > ListState 本身并不属于java的collection,所以不存在ArrayLis

Re: Flink checkpointing with Azure block storage

2020-08-20 Thread Yun Tang
using default (Memory / JobManager) MemoryStateBackend". You can view the log to see whether your changes printed to search for "Loading configuration property". [1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/filesystems/azure.html#credentials-configur

Re: state序列化问题

2020-08-20 Thread Yun Tang
Hi 如果想要在list中保存String,也就是list中的每个元素格式是String,ListState的格式应该是 ListState, 而不是 ListState>,后者表示有一个list,list中的每一个元素均是一个list ListState 本身并不属于java的collection,所以不存在ArrayList 与 LinkedList的区别。 祝好 唐云 From: shizk233 Sent: Thursday, August 20, 2020 18:00 To:

Re: 增量che ckpoint

2020-08-20 Thread Yun Tang
Hi 增量checkpoint与web界面的信息其实没有直接联系,增量checkpoint的信息记录由CheckpointCoordinator中的SharedStateRegistry[1] 进行计数管理,而保留多少checkpoint则由 CheckpointStore管理 [2]. 保留2个checkpoint的执行过程如下: chk-1 completed --> register chk-1 in state registry --> add to checkpoint store chk-2 completed --> register chk-2 in state

Re: 能否考虑针对检查点和保存点设置不同的超时时间

2020-08-19 Thread Yun Tang
Hi 你的这个需求其实社区早已经有相关ticket [1]了,不过这个需求一直不是很强烈,毕竟大多数时候可以通过增大checkpoint timeout即可,增大checkpoint timeout不代表着也会增大checkpoint占据的资源。 [1] https://issues.apache.org/jira/browse/FLINK-9465 祝好 唐云 From: 赵一旦 Sent: Tuesday, August 18, 2020 14:38 To: user-zh@flink.apache.org

Re: flink 1.11 web ui请教

2020-08-19 Thread Yun Tang
Hi 1. 框图的数量是因为默认启用了operator chain导致的,至于连接线上的文字(例如hash)则是由网络连接方式决定了[2] 2. record received 为0 是因为这个指标表征了数据在Flink 的channel内收到的record数量,由于source节点并没有从Flink channel获取数据(往往是从外部系统获取),所以自然record received为0 [1]

Re: Flink checkpoint recovery time

2020-08-18 Thread Yun Tang
. Unfortunately, this part of time is also not recorded in metrics now. If you find the task is in RUNNING state but not consume any record, that might be stuck in restoring checkpoint/savepoint. Best Yun Tang From: Zhinan Cheng Sent: Tuesday, August 18, 2020 11

Re: How to use FsBackBackend without getting deprecation warning

2020-08-11 Thread Yun Tang
Hi Nikola You could use codes below to get rid of the warnings. StateBackend fsStateBackend = new FsStateBackend("hdfs://namenode:40010/flink/checkpoints"); env.setStateBackend(fsStateBackend); In fact, this warning is actually no harmful. Best Yun Tang ___

Re: Flink任务大状态使用filesystem反压

2020-08-10 Thread Yun Tang
的是随着每次执行cp时间越来越长,监控上显示kafkasource端日志堆积越来越大,但是相同的代码只是修改了statebackend为rocksdb 就不存在这种问题,所以很奇怪为什么用的内存反而不如rocksdb了 Yun Tang 于2020年8月10日周一 下午4:21写道: > Hi Yang > > checkpoint执行时间长,具体是同步阶段还是异步阶段长呢,亦或是同步+异步时间不长但是end-to-end 时间长呢? > 如果是异步阶段时间长,一般是因为使用的DFS性能较差。 > 如果各个阶段时间均不长,但是总体时间很长,很有可能还是因为

Re: Flink任务大状态使用filesystem反压

2020-08-10 Thread Yun Tang
Hi Yang checkpoint执行时间长,具体是同步阶段还是异步阶段长呢,亦或是同步+异步时间不长但是end-to-end 时间长呢? 如果是异步阶段时间长,一般是因为使用的DFS性能较差。 如果各个阶段时间均不长,但是总体时间很长,很有可能还是因为反压(如果启用了exactly once checkpoint,可以观察是否buffered的数据很多) kafka数据源积压的数据多,不就是说明source端存在延迟么,这种说明整体作业还是处于反压的状态,需要定位一下究竟是哪里在反压,不一定与使用FsStateBackend有直接关系。 祝好 唐云

Re: TaskManagers are still up even after job execution completed in PerJob deployment mode

2020-08-10 Thread Yun Tang
d when the JM is shutting down. Moreover, idle task manager would also release after 30 seconds by default [1]. [1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#resourcemanager-taskmanager-timeout Best Yun Tang From: narasimha Sent:

Re: flink1.10.1/1.11.1 使用sql 进行group 和 时间窗口 操作后 状态越来越大

2020-08-09 Thread Yun Tang
Hi RocksDB 的文件更新策略是依赖于level-0的文件数目以及level-1 ~ level-N的每层文件总size,可以参照RocksDB社区关于conpaction的图文描述[1]。默认情况下level-1的target size 是256MB [2],也就是说level-1的总size在256MB以下时,应该是没有触发compaction来降低文件大小的。 从你的UI截图看,实际上checkpoint size很小,建议达到一定数据规模之后,再观察是否“状态越来越大” [1]

Re: State Restoration issue with flink 1.10.1

2020-07-30 Thread Yun Tang
difference for this? BTW, you could also try Flink-1.9.x and Flink-1.11 to see whether problem still existed. [1] https://github.com/apache/flink/blob/c5915cf87f96e1c7ebd84ad00f7eabade7e7fe37/flink-libraries/flink-cep/src/main/java/org/apache/flink/cep/operator/CepOperator.java#L184 Best Yun Tang

Re: UnsupportedOperatorException with TensorFlow on checkpointing

2020-07-19 Thread Yun Tang
Hi Sung Gon, Have you ever registered protobuf classes with kryo[1]? [1] https://ci.apache.org/projects/flink/flink-docs-stable/dev/custom_serializers.html Best, Yun Tang From: Sung Gon Yi Sent: Thursday, July 16, 2020 23:00 To: user@flink.apache.org Subject

Re: Are files in savepoint still needed after restoring if turning on incremental checkpointing

2020-07-19 Thread Yun Tang
Hi Lu Once a new checkpoint is completed when restoring from a savepoint, the previous savepoint would be useless if you decide to restore from new checkpoint. In other words, new incremental checkpoint has no relationship with older savepoint from which restored. Best Yun Tang

Re: Flink FsStatebackend is giving better performance than RocksDB

2020-07-19 Thread Yun Tang
and the rest part for other native memory would be limited to the JVM overhead space[1] and consider to increase the memory of that part. [1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/memory/mem_setup.html#capped-fractionated-components Best, Yun Tang

Re: flink 1.9.2 升级 1.10.0 任务失败不能从checkpoint恢复

2020-07-16 Thread Yun Tang
情况和@chenxyz 类似。 http://apache-flink.147419.n8.nabble.com/rocksdb-Could-not-restore-keyed-state-backend-for-KeyedProcessOperator-td2232.html 换成1.10.1 就可以了 Best wishes. Yun Tang 于2020年7月15日周三 下午4:35写道: > Hi Robin > > 其实你的说法不是很准确,社区是明文保证savepoint的兼容性 > [1],但是并不意味着跨大版本时无法从checkpoint恢复,社区不承诺主要

Re: ERROR submmiting a flink job

2020-07-15 Thread Yun Tang
once the job failed and you will meet the errors you pasted if the strategy is "no restart". [1] https://ci.apache.org/projects/flink/flink-docs-stable/dev/task_failure_recovery.html#restart-strategies Best Yun Tang From: Aissa Elaffani Sent: Wednesda

Re: flink1.9状态及作业迁移

2020-07-13 Thread Yun Tang
Sent: Tuesday, July 14, 2020 11:57 To: user-zh@flink.apache.org Subject: Re: flink1.9状态及作业迁移 hi、 请问对于下面的情况,Checkpoint meta中存储的hdfs namespace可以修改吗 》》Checkpoint meta中存储的是完整路径,所以一般会把hdfs的namespace存储起来,导致没办法直接迁移。 Yun Tang 于2020年7月14日周二 上午11:54写道: > Checkpoint meta中存储的是完整路径,所以一般会把hdfs的namespace存储起来,

Re: flink 1.9.2 升级 1.10.0 任务失败不能从checkpoint恢复

2020-07-13 Thread Yun Tang
Hi Peihui 你的异常应该是从增量Checkpoint恢复时,文件已经下载到本地了,做硬链时[1],发现源文件不见了,有很大的可能是当时发生了异常,导致restore流程退出了,所以这个问题应该不是root cause。 [1]

Re: flink1.9状态及作业迁移

2020-07-13 Thread Yun Tang
Checkpoint meta中存储的是完整路径,所以一般会把hdfs的namespace存储起来,导致没办法直接迁移。 Flink-1.11 支持将savepoint(但是不支持Checkpoint)进行位置迁移 [1],而对于Flink-1.9,二者均不支持。 [1] https://issues.apache.org/jira/browse/FLINK-5763 祝好 唐云 From: Dream-底限 Sent: Tuesday, July 14, 2020 11:07 To:

Re: Use state problem

2020-07-08 Thread Yun Tang
aming-java/src/main/java/org/apache/flink/streaming/api/datastream/AllWindowedStream.java#L118 Best Yun Tang From: ゞ野蠻遊戲χ Sent: Thursday, July 9, 2020 9:58 To: user Subject: Use state problem Deal all Keyed state (ValueState, ReducingState, ListState, Aggreg

Re: flink 1.11 on kubernetes 构建失败

2020-07-08 Thread Yun Tang
Hi 你是不是对 /opt/flink/conf 目录下的文件进行了sed相关写操作?社区文档中使用的方法是将configmap挂载成本地的flink-conf.yaml 等文件,而这个挂载的目录其实是不可写的。 直接修改configmap里面的内容,这样挂载时候就会自动更新了。 祝好 唐云 From: SmileSmile Sent: Wednesday, July 8, 2020 13:03 To: Flink user-zh mailing list Subject: flink 1.11 on

Re: Re:Re: 如何在窗口关闭的时候清除状态

2020-07-08 Thread Yun Tang
Hi TTL需要state descriptor明确声明enableTimeToLive[1],而一旦使用window,window内使用的timer和window state实际上不暴露给用户 的,没法开启TTL,二者在使用方式上存在一定互斥。从语义上来说TTL可以清理过期数据,而默认的window实现都会清理已经trigger过的window内的state,所以二者在语义上其实也是有一定互斥的。 从性能角度考虑,一天的窗口显得有点大了,往往性能不好,如果能把类似逻辑迁移到TTL上实现会对性能更友好。 [1]

Re: Check pointing for simple pipeline

2020-07-07 Thread Yun Tang
Hi Prasanna Using incremental checkpoint is always better than not as this is faster and less memory consumed. However, incremental checkpoint is only supported by RocksDB state-backend. Best Yun Tang From: Prasanna kumar Sent: Tuesday, July 7, 2020 20:43

Re: Re:[ANNOUNCE] Apache Flink 1.11.0 released

2020-07-07 Thread Yun Tang
Congratulations to every who involved and thanks for Zhijiang and Piotr's work as release manager. From: chaojianok Sent: Wednesday, July 8, 2020 10:51 To: Zhijiang Cc: dev ; user@flink.apache.org ; announce Subject: Re:[ANNOUNCE] Apache Flink 1.11.0 released

Re: Timeout when using RockDB to handle large state in a stream app

2020-07-07 Thread Yun Tang
-stable/ops/config.html#state-backend-rocksdb-memory-managed [2] https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#taskmanager-memory-managed-fraction [3] https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#taskmanager-memory-managed-size Best, Yun Tang

Re: 回复:rocksdb的block cache usage应该如何使用

2020-07-07 Thread Yun Tang
From: SmileSmile Sent: Monday, July 6, 2020 14:15 To: Yun Tang Cc: user-zh@flink.apache.org Subject: 回复:rocksdb的block cache usage应该如何使用 hi yun tang! 我在容器内加入了libjemalloc.so.2并且在配置中加上了 containerized.master.env.LD_PRELOAD: "/opt/jemalloc/lib/libjemalloc

Re: 回复:rocksdb的block cache usage应该如何使用

2020-07-03 Thread Yun Tang
-bug.html 祝好 唐云 From: SmileSmile Sent: Friday, July 3, 2020 15:22 To: Yun Tang Cc: user-zh@flink.apache.org Subject: 回复:rocksdb的block cache usage应该如何使用 Hi 作业只配置了重启策略,作业如果fail了,只会重启,没有恢复历史数据。 【作业一旦发生failover,state backend的数据都需要清空然后再启动的时候进行加载。】 我目前遇到的情况是作业fail重

Re: 回复:rocksdb的block cache usage应该如何使用

2020-07-03 Thread Yun Tang
3, 2020 15:07 To: Yun Tang Cc: user-zh@flink.apache.org Subject: 回复:rocksdb的block cache usage应该如何使用 hi yun tang! 因为网络或者不可抗力导致pod重生,作业会重启,目前作业没有开启checkpoint,恢复等价继续消费最新数据计算,运行一段时间很容易内存超用被os kill,然后重启,再运行一段时间,间隔变短,死的越来越频繁。 从现象上看很像是内存没有释放,这种场景下,上一次作业残留的未到水位线还没有被触发计算的数据是否在作业重启过程中被清除了? <ht

Re: Checkpoint is disable, will history data in rocksdb be leak when job restart?

2020-07-03 Thread Yun Tang
your job restarts again and again? I think that problem should be first considered. Best Yun Tang From: SmileSmile Sent: Friday, July 3, 2020 14:30 To: Yun Tang Cc: 'user@flink.apache.org' Subject: Re: Checkpoint is disable, will history data in rocksdb be leak

Re: 回复:rocksdb的block cache usage应该如何使用

2020-07-03 Thread Yun Tang
Hi 观察block cache usage的显示数值是否超过你的单个slot的managed memory,计算方法是 managed memory / slot数目,得到一个slot的managed memory,将该数值与block cache usage比较,看内存是否超用。重启之后容易被os kill,使用的是从savepoint恢复数据么? 祝好 唐云 From: SmileSmile Sent: Friday, July 3, 2020 14:20 To: Yun Tang Cc: Flink

Re: Question about RocksDB performance tunning

2020-07-03 Thread Yun Tang
] https://github.com/apache/flink/blob/master/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/PredefinedOptions.java Best Yun Tang From: Peter Huang Sent: Friday, July 3, 2020 13:31 To: user Subject: Question

Re: Timeout when using RockDB to handle large state in a stream app

2020-07-03 Thread Yun Tang
Hi Felipe, I noticed my previous mail has a typo: RocksDB is executed in task main thread which does not take the role to respond to heart beat. Sorry for previous typo, and the key point I want to clarify is that RocksDB should not have business for heartbeat problem. Best Yun Tang

Re: Checkpoint is disable, will history data in rocksdb be leak when job restart?

2020-07-03 Thread Yun Tang
for native memory usage [1]. After Flink-1.10, RocksDB will use 100% managed memory stablely and once you have some extra memory, the pod might be treated as OOM to be killed. [1] https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/memory/mem_detail.html#overview Best Yun Tang

Re: rocksdb的block cache usage应该如何使用

2020-07-03 Thread Yun Tang
Hi 默认Flink启用了rocksDB 的managed memory,这里涉及到这个功能的实现原理,简单来说,一个slot里面的所有rocksDB实例底层“托管”内存的LRU block cache均是一个,这样你可以根据taskmanager和subtask_index 作为tag来区分,你会发现在同一个TM里面的某个subtask对应的不同column_family 的block cache的数值均是完全相同的。所以不需要将这个数值进行求和统计。 祝好 唐云 From: SmileSmile Sent:

Re: flink的state过期设置

2020-07-02 Thread Yun Tang
Hi TTL的时间戳实际是会存储在 state 里面 [1],与每个entry在一起,也就是说从Checkpoint恢复的话,数据里面的时间戳是当时插入时候的时间戳。 [1] https://github.com/apache/flink/blob/ba92b3b8b02e099c8aab4b2b23a37dca4558cabd/flink-runtime/src/main/java/org/apache/flink/runtime/state/ttl/TtlValueState.java#L50 祝好 唐云

Re: Timeout when using RockDB to handle large state in a stream app

2020-06-30 Thread Yun Tang
is really heartbeat timeout instead of crash to exit. Try to increase the heartbeat timeout [1] and watch the GC detail logs to see anything weird. [1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#heartbeat-timeout Best Yun Tang From: Ori

Re: flink 1.10 on yarn 内存超用,被kill

2020-06-18 Thread Yun Tang
Hi 单个Slot的managed memory是多少(可以通过webUI或者TM的日志观察到),rocksDB的 block cache usage会增长到多少,是一直在增长最终超过单个slot的managed memory么? RocksDB的内存托管在绝大部分场景下是work的,但是RocksDB本身的实现限制了这个功能完美发挥作用。具体涉及到LRUcache和Writebuffer manager之间的对应关系,目前RocksDB的strict cache limit和将write buffer

Re: Trouble with large state

2020-06-17 Thread Yun Tang
later one is mainly due to some internal error. 2. Have you checked what reason the remote task manager is lost? If the remote task manager is not crashed, it might be due to GC impact, I think you might need to check task-manager logs and GC logs. Best Yun Tang ___

Re: Implications on incremental checkpoint duration based on RocksDBStateBackend, Number of Slots per TM

2020-06-17 Thread Yun Tang
RocksDB in sync phase is executed in task main thread and one TM could have many task main threads. Since the synchronous checkpoint phase is only triggered after barrier alignment finished, we cannot ensure all RocksDB instances would execute flushing at the same time. Best Yun Tang

Re: Improved performance when using incremental checkpoints

2020-06-17 Thread Yun Tang
Hi Nick I think this thread use the same program as thread "MapState bad performance" talked. Please provide a simple program which could reproduce this so that we can help you more. Best Yun Tang From: Aljoscha Krettek Sent: Tuesday, June 16,

Re: MapState bad performance

2020-06-16 Thread Yun Tang
/mem_setup.html#managed-memory [2] https://ci.apache.org/projects/flink/flink-docs-stable/ops/memory/mem_tuning.html#rocksdb-state-backend Best Yun Tang From: nick toker Sent: Tuesday, June 16, 2020 18:36 To: Yun Tang Cc: user@flink.apache.org Subject: Re: MapState

Re: MapState bad performance

2020-06-16 Thread Yun Tang
-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/RocksDBMapState.java#L254 Best Yun Tang From: nick toker Sent: Tuesday, June 16, 2020 15:35 To: user@flink.apache.org Subject: MapState bad performance Hello, We wrote a very simple streaming

Re: Improved performance when using incremental checkpoints

2020-06-16 Thread Yun Tang
this problem. Best Yun Tang From: nick toker Sent: Tuesday, June 16, 2020 15:44 To: user@flink.apache.org Subject: Improved performance when using incremental checkpoints Hello, We are using RocksDB as the backend state. At first we didn't enable the checkpoints

Re: The Flink job recovered with wrong checkpoint state.

2020-06-14 Thread Yun Tang
Hi Thomas The answer is yes. Without high availability, once the job manager is down and even the job manager is relaunched via YARN, the job graph and last checkpoint would not be recovered. Best Yun Tang From: Thomas Huang Sent: Sunday, June 14, 2020 22:58

Re: 回复:Flink异常及重启容错处理

2020-06-13 Thread Yun Tang
Hi 我想你的问题是数据源中存在之前代码中没有很好处理的corner case,导致在处理某一条“脏数据”时,作业进入FAILED状态。此时即使从之前的checkpoint恢复,由于作业代码逻辑未变,之前的corner case依然无法处理,作业只能无限进去失败状态。 这种场景可以一开始时候将checkpoint的保留策略设置成RETAIN_ON_CANCELLATION [1],这样cancel作业之后,更改业务代码逻辑,从而可以处理之前的问题,再降作业上线从之前的checkpoint恢复 [2],这样做的话,数据是不会丢失的。 [1]

Re: 如何做checkpoint的灾备

2020-06-13 Thread Yun Tang
Hi Xingxing 由于作业仍在运行,所以checkpoint目录下的文件是不断新增以及删除的,其实在使用distcp的时候加上 “-i” [1] 来忽略失败的拷贝(例如FileNotFoundException) 文件即可。因为作业的原始checkpoint目录最终一定可以做到正常restore,所以即使部分文件因为在拷贝时被原作业不需要而删除时,只要最终目录结构一致,是可以做到在另外一个HDFS上实现容灾备份的。 [1]

Re: Flink not restoring from checkpoint when job manager fails even with HA

2020-06-08 Thread Yun Tang
Hi Bhaskar By default, you will get a new job id. There existed some hack and hidden method to set the job id but is not meant to be used by the user Best Yun Tang From: Vijay Bhaskar Sent: Monday, June 8, 2020 15:03 To: Yun Tang Cc: Kathula, Sandeep ; user

Re: Flink not restoring from checkpoint when job manager fails even with HA

2020-06-08 Thread Yun Tang
Hi Bhaskar We strongly not encourage to use such hack configuration to make job always having with the same special job id. If you stick to use this, all runs of this jobgraph would have the same job id. Best Yun Tang From: Vijay Bhaskar Sent: Monday, June 8

Re: Flink not restoring from checkpoint when job manager fails even with HA

2020-06-07 Thread Yun Tang
. If you want resume from previous job, you should use -s command to resume from retained checkpoint. [1] [1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/checkpoints.html#resuming-from-a-retained-checkpoint Best Yun Tang From: Kathula, Sandeep

Re: State expiration in Flink

2020-06-01 Thread Yun Tang
Hi Vasily As far as I know, current TTL of state lack of such kind of trigger, and perhaps onTimer or process specific event to trigger could help your scenario. Best Yun Tang. From: Vasily Melnik Sent: Monday, June 1, 2020 14:13 To: Yun Tang Cc: user Subject

Re: checkpoint失败讨论

2020-06-01 Thread Yun Tang
Hi 这个错误“could only be replicated to 0 nodes instead of minReplication (=1)”是HDFS不稳定导致的,无法将数据进行duplicate与Flink本身并无关系。 祝好 唐云 From: yanggang_it_job Sent: Monday, June 1, 2020 15:30 To: user-zh@flink.apache.org Subject: checkpoint失败讨论

Re: State expiration in Flink

2020-05-31 Thread Yun Tang
Hi Vasily After Flink-1.10, state will be cleaned up periodically as CleanupInBackground is enabled by default. Thus, even you never access some specific entry of state and that entry could still be cleaned up. Best Yun Tang From: Vasily Melnik Sent: Saturday

Re: Running and Maintaining Multiple Jobs

2020-05-29 Thread Yun Tang
Hi Prasanna As far as I know, Flink does not allow to submit new jobgraph without restarting it, and I actually not understand what's your 3rd question meaning. From: Prasanna kumar Sent: Friday, May 29, 2020 11:18 To: Yun Tang Cc: user Subject: Re: Running

Re: In consistent Check point API response

2020-05-27 Thread Yun Tang
flink/blob/master/docs/monitoring/checkpoint_monitoring.zh.md Best Yun Tang From: Vijay Bhaskar Sent: Tuesday, May 26, 2020 15:18 To: Yun Tang Cc: user Subject: Re: In consistent Check point API response Thanks Yun. How can i contribute better documentation o

Re: RocksDB savepoint recovery performance improvements

2020-05-27 Thread Yun Tang
is for quick fix at his scenario. Best Yun Tang From: Steven Wu Sent: Wednesday, May 27, 2020 0:36 To: Joey Pereira Cc: user@flink.apache.org ; Yun Tang ; Mike Mintz ; Shahid Chohan ; Aaron Levin Subject: Re: RocksDB savepoint recovery performanc

Re: In consistent Check point API response

2020-05-26 Thread Yun Tang
"Restored" is for last restored checkpoint and "completed" is for last completed checkpoint, they are actually not the same thing. The only scenario that they're the same in numbers is when Flink just restore successfully before a new checkpoint completes. Best Yun Tang ___

Re: In consistent Check point API response

2020-05-25 Thread Yun Tang
, to tell your story. [1] https://github.com/apache/flink/blob/8f992e8e868b846cf7fe8de23923358fc6b50721/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1250 Best Yun Tang From: Vijay Bhaskar Sent: Monday, May 25, 2020 17

Re: In consistent Check point API response

2020-05-25 Thread Yun Tang
nk, it will create the checkpoint statics >without restored checkpoint and assign it once the latest savepoint/checkpoint >is restored. [1] [1] https://github.com/apache/flink/blob/50253c6b89e3c92cac23edda6556770a63643c90/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/Checkpoi

Re: Using Queryable State within 1 job + docs suggestion

2020-05-21 Thread Yun Tang
that they could query each other, which provide better performance. Best Yun Tang From: Annemarie Burger Sent: Thursday, May 21, 2020 19:45 To: user@flink.apache.org Subject: Re: Using Queryable State within 1 job + docs suggestion Hi, Thanks for your response! What if I'm

Re: RocksDB savepoint recovery performance improvements

2020-05-18 Thread Yun Tang
https://github.com/apache/flink/blob/f35679966eac9e3bb53a02bcdbd36dbd1341d405/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/snapshot/RocksFullSnapshotStrategy.java#L308 Best Yun Tang From: Joey Pereira Sent: Tuesday,

Re: Using Queryable State within 1 job + docs suggestion

2020-05-18 Thread Yun Tang
this feature on server side as Flink job needs to access the queryable-state classes. If you're just running your Flink job locally, add dependency could let your local job access the queryable-state classes which is actually the doc wanted to tell users. Best Yun Tang

Re: Rocksdb implementation

2020-05-18 Thread Yun Tang
-queryable-state-test/src/main/java/org/apache/flink/streaming/tests/queryablestate/QsStateClient.java#L108-L122 Best Yun Tang From: Arvid Heise Sent: Monday, May 18, 2020 23:40 To: Jaswin Shah Cc: user@flink.apache.org Subject: Re: Rocksdb implementation Hi Jaswin, I'd

Re: Protection against huge values in RocksDB List State

2020-05-15 Thread Yun Tang
Yun Tang From: Robin Cassan Sent: Friday, May 15, 2020 20:59 To: Yun Tang Cc: user Subject: Re: Protection against huge values in RocksDB List State Hi Yun, thanks for your answer! And sorry I didn't see this limitation from the documentation, makes sense

Re: the savepoint problem of upgrading job from blink-1.5 to flink-1.10

2020-05-15 Thread Yun Tang
blink. As you can see, we use "org.apache.flink.runtime.state.KeyGroupsStateSnapshot" instead of "org.apache.flink.runtime.state.KeyGroupsStateHandle", and thus the savepoint generated by Blink cannot be easily consumed by Flink. Best Yun Tang ___

Re: Incremental state with purging

2020-05-13 Thread Yun Tang
seconds. [1] https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/state.html#state-time-to-live-ttl Best Yun Tang From: Annemarie Burger Sent: Wednesday, May 13, 2020 2:46 To: user@flink.apache.org Subject: Incremental state with purging Hi, I

Re: Prometheus Pushgateway Reporter Can not DELETE metrics on pushgateway

2020-05-13 Thread Yun Tang
;operator_id;task_id;task_attempt_id", which are rarely used, in >metrics.reporter..scope.variables.excludes. [1] https://ci.apache.org/projects/flink/flink-docs-release-1.10/monitoring/metrics.html#reporter Best Yun Tang From: Thomas Huang Sent: Wednesda

Re: 对flink源码中watermark对齐逻辑的疑惑

2020-05-11 Thread Yun Tang
Hi 正是因为取各个input channel的最小值,所以如果某一个上游一直没有获取到真实数据,发送下来的watermark一直都是Long.MIN_VALUE,这样会导致无法触发window,社区采用idle source [1]的方式walk around该问题 [1] https://ci.apache.org/projects/flink/flink-docs-stable/dev/event_time.html#idling-sources 祝好 唐云 From: Benchao Li Sent:

Re: Flink on Kubernetes unable to Recover from failure

2020-05-08 Thread Yun Tang
ontainers running do not mean they're all registered to the job manager, I think you could refer to the JM and TM log to see whether the register connection is lost. Best Yun Tang From: Robert Metzger Sent: Friday, May 8, 2020 22:33 To: Morgan Geldenhuys Cc: user

Re: What is the RocksDB local directory in flink checkpointing?

2020-05-08 Thread Yun Tang
strapTools.java#L478-L489 Best Yun Tang From: Till Rohrmann Sent: Wednesday, May 6, 2020 17:35 To: LakeShen Cc: dev ; user ; user-zh Subject: Re: What is the RocksDB local directory in flink checkpointing? Hi LakeShen, `state.backend.rocksdb.localdir` defines th

Re: What is the RocksDB local directory in flink checkpointing?

2020-05-08 Thread Yun Tang
strapTools.java#L478-L489 Best Yun Tang From: Till Rohrmann Sent: Wednesday, May 6, 2020 17:35 To: LakeShen Cc: dev ; user ; user-zh Subject: Re: What is the RocksDB local directory in flink checkpointing? Hi LakeShen, `state.backend.rocksdb.localdir` defines th

Re: Flink监控: promethues获取到有的metrics没有包含flink 对应的job_name或者job_id

2020-05-05 Thread Yun Tang
Hi flink_jobmanager_Status 这种metrics属于jobmanager层级的metrics,这种metrics与job level的metrics,从概念上来说是不一样的。因为Flink是支持一个JM里面同时运行多个作业的,但是JM的JVM实际上只有一个,所以如果给JM的metrics增加其从属的job_id 的tag是不符合语义的。当然,如果一个host上有多个JM,现在Flink不太好区分,目前只有TM级别的tm_id可以区分不同的TM。 如果非要加上job_name 或者 job_id

Re: Savepoint memory overhead

2020-04-30 Thread Yun Tang
-backend Best Yun Tang From: Lasse Nedergaard Sent: Thursday, April 30, 2020 12:39 To: Yun Tang Cc: user Subject: Re: Savepoint memory overhead We using Flink 1.10 running on Mesos. Med venlig hilsen / Best regards Lasse Nedergaard Den 30. apr. 2020 kl. 04.53

Re: RocksDB default logging configuration

2020-04-29 Thread Yun Tang
d: rocksdb state.checkpoints.dir: hdfs:///checkpoint-path Best Yun Tang From: Bajaj, Abhinav Sent: Wednesday, April 29, 2020 3:16 To: Yun Tang ; user@flink.apache.org Cc: Chesnay Schepler Subject: Re: RocksDB default logging configuration Thanks Yun for your response.

Re: Savepoint memory overhead

2020-04-29 Thread Yun Tang
Hi Lasse Which version of Flink did you use? Before Flink-1.10, there might exist memory problem when RocksDB executes savepoint with write batch[1]. [1] https://issues.apache.org/jira/browse/FLINK-12785 Best Yun Tang From: Lasse Nedergaard Sent: Wednesday

Re: RocksDB default logging configuration

2020-04-27 Thread Yun Tang
db.localdir" [1] should work for RocksDB in Flink-1.7.1. [1] https://github.com/apache/flink/blob/808cc1a23abb25bd03d24d75537a1e7c6987eef7/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/RocksDBStateBackend.java#L285

Re: K8s native - checkpointing to S3 with RockDBStateBackend

2020-04-23 Thread Yun Tang
Hi Averell Please build your own flink docker with S3 plugin as official doc said [1] [1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/docker.html#using-plugins Best Yun Tang From: Averell Sent: Thursday, April 23, 2020 20:58 To: user

Re: Flink incremental checkpointing - how long does data is kept in the share folder

2020-04-22 Thread Yun Tang
job, the files which are newly created than last checkpoint completed time should also been filter out (if you are not retain multi checkpoints). The rest files are safe to remove. A simple way is stopping the job, and remove all files not recorded in the checkpoint metadata. Best Yun Tang

Re: Flink 1.10.0 stop command

2020-04-22 Thread Yun Tang
Hi I think you could still use ./bin/flink cancel to cancel the job. What is the exception thrown? Best Yun Tang From: seeksst Sent: Wednesday, April 22, 2020 18:17 To: user Subject: Flink 1.10.0 stop command Hi, When i test 1.10.0, i found i must

Re: Latency tracking together with broadcast state can cause job failure

2020-04-22 Thread Yun Tang
FLINK-17322 [1] to track this problem, and related owner would take a look at this problem. Really thank you for reporting this bug. [1] https://issues.apache.org/jira/browse/FLINK-17322 Best Yun Tang From: Yun Tang Sent: Wednesday, April 22, 2020 1:43 To: Lasse

Re: Latency tracking together with broadcast state can cause job failure

2020-04-21 Thread Yun Tang
Hi Lasse Really sorry for missing your reply. I'll run your project and find the root cause in my day time. And thanks for @Robert Metzger<mailto:rmetz...@apache.org> 's kind remind. Best Yun Tang From: Robert Metzger Sent: Tuesday, April 21, 2020

<    1   2   3   4   5   6   >