Re: State leak in tumbling windows

2024-06-06 Thread Yanfei Lei
Hi Adam, Is your job a datastream job or a sql job? After I looked through the window-related code(I'm not particularly familiar with this part of the code), this problem should only exist in datastream. Adam Domanski 于2024年6月3日周一 16:54写道: > > Dear Flink users, > > I spotted the ever growing

Re: TTL issue with large RocksDB keyed state

2024-06-03 Thread Yanfei Lei
Hi, > 1. After multiple full checkpoints and a NATIVE savepoint the size was > unchanged. I'm wondering if RocksDb compaction is because we never update > key values? The state is nearly fully composed of keys' space. Do keys not > get freed using RocksDb compaction filter for TTL? Regarding

Re: Flink 1.18.1 ,重启状态恢复

2024-05-16 Thread Yanfei Lei
看起来和 FLINK-34063 / FLINK-33863 是同样的问题,您可以升级到1.18.2 试试看。 [1] https://issues.apache.org/jira/browse/FLINK-33863 [2] https://issues.apache.org/jira/browse/FLINK-34063 陈叶超 于2024年5月16日周四 16:38写道: > > 升级到 flink 1.18.1 ,任务重启状态恢复的话,遇到如下报错: > 2024-04-09 13:03:48 > java.lang.Exception: Exception while

Re: Re: java.io.IOException: Could not load the native RocksDB library

2024-05-06 Thread Yanfei Lei
ming-java-1.19.0.jar:1.19.0] > at > org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:399) > ~[flink-streaming-java-1.19.0.jar:1.19.0] > at > org.apache.flink.streaming.api.operators.StreamTaskStat

Re: java.io.IOException: Could not load the native RocksDB library

2024-05-06 Thread Yanfei Lei
请问是什么开发环境呢? windows吗? 可以分享一下更详细的报错吗?比如.dll 找不到 ha.fen...@aisino.com 于2024年5月7日周二 09:34写道: > > Configuration config = new Configuration(); > config.set(StateBackendOptions.STATE_BACKEND, "rocksdb"); > config.set(CheckpointingOptions.CHECKPOINT_STORAGE, "filesystem"); >

Re: Flink 1.18: Unable to resume from a savepoint with error InvalidPidMappingException

2024-04-23 Thread Yanfei Lei
ike 7 days, as a 7 days old savepoints is effectively worthless, and > probably adjust "transaction.timeout.ms" to be close to this. > > But can you explain how "transactional.id.expiration.ms" influences the > InvalidPidMappingException, or why having "transact

Re: Flink 1.18: Unable to resume from a savepoint with error InvalidPidMappingException

2024-04-21 Thread Yanfei Lei
Hi JM, Yes, `InvalidPidMappingException` occurs because the transaction is lost in most cases. For short-term, " transaction.timeout.ms" > "transactional.id.expiration.ms" can ignore the `InvalidPidMappingException`[1]. For long-term, FLIP-319[2] provides a solution. [1]

Re: [ANNOUNCE] Apache Paimon is graduated to Top Level Project

2024-03-28 Thread Yanfei Lei
Congratulations! Best, Yanfei Zhanghao Chen 于2024年3月28日周四 19:59写道: > > Congratulations! > > Best, > Zhanghao Chen > > From: Yu Li > Sent: Thursday, March 28, 2024 15:55 > To: d...@paimon.apache.org > Cc: dev ; user > Subject: Re: [ANNOUNCE] Apache Paimon is

Re: Flink job unable to restore from savepoint

2024-03-27 Thread Yanfei Lei
Hi Prashant, Compared to the job that generated savepoint, are there any changes in the new job? For example, data fields were added or deleted, or the type serializer was changed? More detailed job manager logs may help. prashant parbhane 于2024年3月27日周三 14:20写道: > > Hello, > > We have been

Re: There is no savepoint operation with triggerId

2024-03-25 Thread Yanfei Lei
Hi Lars, It looks like the relevant logs when retrieving savepoint. Have you frequently retrieved savepoints through the REST interface? Lars Skjærven 于2024年3月26日周二 07:17写道: > > Hello, > My job manager is constantly complaining with the following error: > > "Exception occurred in REST handler:

Re: Is there any options to control the file names in file sink

2024-03-20 Thread Yanfei Lei
Hi Lasse, If the datastream job is used, you can try setting `OutputFileConfig` for file sink, something like[1]: ``` OutputFileConfig config = OutputFileConfig .builder() .withPartPrefix("prefix") .withPartSuffix(".ext") .build(); FileSink> sink = FileSink .forRowFormat((new

Re: [ANNOUNCE] Apache Flink 1.19.0 released

2024-03-18 Thread Yanfei Lei
Congrats, thanks for the great work! Sergey Nuyanzin 于2024年3月18日周一 19:30写道: > > Congratulations, thanks release managers and everyone involved for the great > work! > > On Mon, Mar 18, 2024 at 12:15 PM Benchao Li wrote: >> >> Congratulations! And thanks to all release managers and everyone >>

Re: [ANNOUNCE] Apache Flink 1.19.0 released

2024-03-18 Thread Yanfei Lei
Congrats, thanks for the great work! Sergey Nuyanzin 于2024年3月18日周一 19:30写道: > > Congratulations, thanks release managers and everyone involved for the great > work! > > On Mon, Mar 18, 2024 at 12:15 PM Benchao Li wrote: >> >> Congratulations! And thanks to all release managers and everyone >>

Re: Flink Checkpoint & Offset Commit

2024-03-07 Thread Yanfei Lei
Hi Jacob, > I have multiple upstream sources to connect to depending on the business > model which are not Kafka. Based on criticality of the system and publisher > dependencies, we cannot switch to Kafka for these. Sounds like you want to implement some custom connectors, [1][2] may be

Re: SecurityManager in Flink

2024-03-06 Thread Yanfei Lei
Hi Kirti Dhar, What is your java version? I guess this problem may be related to FLINK-33309[1]. Maybe you can try adding "-Djava.security.manager" to the java options. [1] https://issues.apache.org/jira/browse/FLINK-33309 Kirti Dhar Upadhyay K via user 于2024年3月6日周三 18:10写道: > > Hi Team, > > >

Re: What to do about local disks with RocksDB with Kubernetes Operator

2023-10-18 Thread Yanfei Lei
Hi Alex, AFAIK, the emptyDir[1] can be used directly as local disks, and emptyDir can be defined by referring to this pod template[2]. If you want to use local disks through PV, you can first create a statefulSet and mount the PV through volume claim templates[3], the example “Local Recovery

Re: Failure to restore from last completed checkpoint

2023-09-07 Thread Yanfei Lei
Hey Jacqlyn, According to the stack trace, it seems that there is a problem when the checkpoint is triggered. Is this the problem after the restore? would you like to share some logs related to restoring? Best, Yanfei Jacqlyn Bender via user 于2023年9月8日周五 05:11写道: > > Hey folks, > > > We

Re: Flink 窗口触发条件

2023-08-09 Thread Yanfei Lei
hi, 感觉和[1]的问题比较像,事件时间的window在onElement和onEventTime时会触发,这两个方法又会根据watermark判断,可以看看o.a.f.table.runtime.operators.window.triggers包和o.a.f.table.runtime.operators.wmassigners包。 [1] https://juejin.cn/post/6850418110010179597 小昌同学 于2023年8月10日周四 10:52写道: > >

Re: Query around Rocksdb

2023-07-05 Thread Yanfei Lei
avepoint > size. Do you suggest adding cleanupInRocksdbCompactFilter(1000) as well? What > will be the impact of this configuration? > > On Tue, Jul 4, 2023 at 8:13 AM Yanfei Lei wrote: >> >> Hi neha, >> >> Due to the limitation of RocksDB, we cannot create a >> strict-c

Re: Query around Rocksdb

2023-07-03 Thread Yanfei Lei
Hi neha, Due to the limitation of RocksDB, we cannot create a strict-capacity-limit LRUCache which shared among rocksDB instance(s), FLINK-15532[1] is created to track this. BTW, have you set TTL for this job[2], TTL can help control the state size. [1]

Re: Checkpointed data size is zero

2023-07-03 Thread Yanfei Lei
Hi Kamal, Is the Full Checkpoint Data Size[1] also zero? If not, it may be that no data is processed during this checkpoint. [1] https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/ops/monitoring/checkpoint_monitoring/ Shammon FY 于2023年7月4日周二 09:10写道: > > Hi Kamal, > > You can

Re: RocksdbStateBackend.enableTtlCompactionFilter

2023-06-20 Thread Yanfei Lei
Hi patricia, The TTL compaction filter in RocksDB has been enabled in 1.10 by default and it is always enabled in 1.11+[1], I think there is no need to explicitly enable the ttl compaction filter. [1]

Re: Changelog fail leads to job fail regardless of tolerable-failed-checkpoints config

2023-06-20 Thread Yanfei Lei
Hi Dongwoo, State changelogs are continuously uploaded to the durable storage when Changelog state backend is enabled. In other words, it will also persist data **outside the checkpoint phase**, and the exception at this time will directly cause the job to fail. And only exceptions in the

Re: Flink1.14 需求超大内存

2023-06-19 Thread Yanfei Lei
Hi, 从 ResourceProfile{taskHeapMemory=1024.000gb (1099511627776 bytes), taskOffHeapMemory=1024.000gb (1099511627776 bytes), managedMemory=128.000mb (134217728 bytes), networkMemory=64.000mb (67108864 bytes)}, numberOfRequiredSlots=1}] 来看,sink节点想申请 1T的 heap memory 和 1T的 off heap

Re: The JobManager is taking minutes to complete and finalize checkpoints despite the Task Managers seem to complete them in a few seconds

2023-05-03 Thread Yanfei Lei
Hi Francesco, The overall checkpoint duration in Flink UI is EndToEndDuration[1], which is the time from Jobmanager triggering checkpoint to collecting the last ack message sent from task manager, depending on the slowest task manager. > "-_message__:__"Completed checkpoint 2515895 for job >

Re: Flink 误报checkpoint失败

2023-05-03 Thread Yanfei Lei
hi, 扩缩容会重启作业,在作业重启期间,job manager 先启动了,还有部分task manager没启动就有可能报“Not all required tasks are currently running..”的错误,作业的所有task完全启动后这个错误就会消失。 Best, Yanfei Chen Yang 于2023年5月4日周四 09:44写道: > > 您好, > > 我的 Flink job是以 reactive 模式运行,然后用了 Kubernetes HPA 来自动扩容/缩容 > TaskManager。每当TaskManager >

Re: Flink SQL State

2023-04-26 Thread Yanfei Lei
Hi Giannis, Except “default” Colume Family(CF), all other CFs represent the state in rocksdb state backend, the name of a CF is the name of a StateDescriptor. - deduplicate-state is a value state, you can find it in DeduplicateFunctionBase.java and MiniBatchDeduplicateFunctionBase.java, they are

Re: Flink rocksDB疑似内存泄露,导致被Linux kernel killed

2023-04-23 Thread Yanfei Lei
Hi, 请问作业有配置ttl吗? 另外可以参考下是否与下面两个问题类似: 1. pin L0 index in memory : https://issues.apache.org/jira/browse/FLINK-31089 2. max open files:https://issues.apache.org/jira/browse/FLINK-31225 Biao Geng 于2023年4月23日周日 15:35写道: > > Hi, > 可以配置下jemalloc来进行堆外内存泄漏的定位。 > 具体操作可以参考下这两篇文章。 >

Re: [ANNOUNCE] Flink Table Store Joins Apache Incubator as Apache Paimon(incubating)

2023-03-27 Thread Yanfei Lei
Congratulations! Best Regards, Yanfei ramkrishna vasudevan 于2023年3月27日周一 21:46写道: > > Congratulations !!! > > On Mon, Mar 27, 2023 at 2:54 PM Yu Li wrote: >> >> Dear Flinkers, >> >> >> As you may have noticed, we are pleased to announce that Flink Table Store >> has joined the Apache

Re: [ANNOUNCE] Flink Table Store Joins Apache Incubator as Apache Paimon(incubating)

2023-03-27 Thread Yanfei Lei
Congratulations! Best Regards, Yanfei ramkrishna vasudevan 于2023年3月27日周一 21:46写道: > > Congratulations !!! > > On Mon, Mar 27, 2023 at 2:54 PM Yu Li wrote: >> >> Dear Flinkers, >> >> >> As you may have noticed, we are pleased to announce that Flink Table Store >> has joined the Apache

Re: Re: flink on yarn 异常停电问题咨询

2023-03-09 Thread Yanfei Lei
Hi 可以尝试去flink配置的checkpoint dir下面去找一找历史chk-x文件夹,如果能找到历史的chk-x,可以尝试手工指定 chk重启[1]。 > flink任务是10个checkpoint,每个checkpoint间隔5秒,如果突然停电,为什么所有的checkpoint都损坏了。 请问作业开启增量checkpoint了吗?在开启了增量的情况下,如果是比较早的一个checkpoint的文件损坏了,会影响后续基于它进行增量的checkpoint。 >

[ANNOUNCE] FRocksDB 6.20.3-ververica-2.0 released

2023-01-30 Thread Yanfei Lei
It is very happy to announce the release of FRocksDB 6.20.3-ververica-2.0. Compiled files for Linux x86, Linux arm, Linux ppc64le, MacOS x86, MacOS arm, and Windows are included in FRocksDB 6.20.3-ververica-2.0 jar, and the FRocksDB in Flink 1.17 would be updated to 6.20.3-ververica-2.0. Release

[ANNOUNCE] FRocksDB 6.20.3-ververica-2.0 released

2023-01-30 Thread Yanfei Lei
It is very happy to announce the release of FRocksDB 6.20.3-ververica-2.0. Compiled files for Linux x86, Linux arm, Linux ppc64le, MacOS x86, MacOS arm, and Windows are included in FRocksDB 6.20.3-ververica-2.0 jar, and the FRocksDB in Flink 1.17 would be updated to 6.20.3-ververica-2.0. Release

Re: How does Flink plugin system work?

2023-01-02 Thread Yanfei Lei
Hi Ruibin, "metrics.reporter.prom.class" is deprecated in 1.16, maybe " metrics.reporter.prom.factory.class"[1] can solve your problem. After reading the related code[2], I think the root cause is that " metrics.reporter.prom.class" would load the code via flink's classpath instead of

Re: ZLIB Vulnerability Exposure in Flink statebackend RocksDB

2022-12-12 Thread Yanfei Lei
>> would like to understand the impact if we make changes in our local Flink >> code with regards to testing efforts and any other affected modules? >> >> Can you please clarify this? >> >> Thanks, >> Vidya Sagar. >> >> >> On Wed,

Re: SNI issue

2022-12-08 Thread Yanfei Lei
Hi, I didn't face this issue, and I'm guessing it might have something to do with the configuration of SSL[1], have you configured the "security.ssl.rest.enabled" option? [1] https://cnightlies.apache.org/flink/flink-docs-master/docs/deployment/security/security-ssl/#configuring-ssl Jean-Damien

Re: Exceeded Checkpoint tolerable failure

2022-12-08 Thread Yanfei Lei
Hi Madan, Maybe you can check the value of " *execution.checkpointing.tolerable-failed-checkpoints"*[1] in your application configuration, and try to increase this value? [1]

Re: ZLIB Vulnerability Exposure in Flink statebackend RocksDB

2022-12-07 Thread Yanfei Lei
Hi Vidya Sagar, Thanks for bringing this up. The RocksDB state backend defaults to Snappy[1]. If the compression option is not specifically configured, this vulnerability of ZLIB has no effect on the Flink application for the time being. *> is there any plan in the coming days to address this?

Re: Task Manager restart and RocksDB incremental checkpoints issue.

2022-11-14 Thread Yanfei Lei
Questions: >>>> - What do you think about the older files that are pulled from the >>>> hostpath to mount path should be deleted first and then create the new >>>> instanceBasepath? >>>> Otherwise, we are going to be ended with the GBs of unwanted d

Re: Task Manager restart and RocksDB incremental checkpoints issue.

2022-11-10 Thread Yanfei Lei
Hi Vidya Sagar, Could you please share the reason for TaskManager restart? If the machine or JVM process of TaskManager crashes, the `RocksDBKeyedStateBackend` can't be disposed/closed normally, so the existing rocksdb instance directory would remain. BTW, if you use Application Mode on k8s, if

Re: 设置slot是vcore的几倍会有什么影响

2022-11-07 Thread Yanfei Lei
Hi junjie, 一个slot可以看作JVM中的一个线程[1],因此可以设置taskmanager.numberOfTaskSlots超过cpu core的数目。 > 这样设置slot是vcore的几倍会有什么影响吗? 设置slot是vcore的几倍可能导致资源bound(如cpu、内存、磁盘、网络带宽等),我曾经遇到过slot数目过多(每个slot上的subtask的状态较大)引起的磁盘不足问题。 [1]

Re: Does reduce function on keyed window gives any guarantee on the order of elements?

2022-11-04 Thread Yanfei Lei
ion instance will only receive elements from the > same key in order. > > > > *From:* Yanfei Lei > *Sent:* 03 November 2022 03:06 > *To:* Qing Lim > *Cc:* User > *Subject:* Re: Does reduce function on keyed window gives any guarantee > on the order of elements? >

Re: Does reduce function on keyed window gives any guarantee on the order of elements?

2022-11-02 Thread Yanfei Lei
Hi Qing, > Does it guarantee that it will be called in the same order of elements in the stream, where value2 is always 1 element after value1? Order is maintained within each parallel stream partition. If the reduce operator only has one sending- sub-task, the answer is YES, but if reduce

Re: Broadcast state and OutOfMemoryError: Direct buffer memory

2022-10-21 Thread yanfei lei
Hi Dan, Usually broadcast state needs more network buffers, the network buffer used to exchange data records between tasks would request a portion of direct memory[1], I think it is possible to get the “Direct buffer memory” OOM errors in this scenarios. Maybe you can try to increase

Re: Presto S3 filesystem access issue - checkpointing - EKS

2022-10-17 Thread yanfei lei
Hi Vignesh, 403 status code makes this look like an authorization issue. > * Some digging into the presto configs and I had this one turned off topresto.s3.use-instance-credentials: "false". (Is this right?)* >From the document[1], it is recommended that set hive. *s3.use-instance-credentials*

Re: Re: flink1.15.1 stop 任务失败

2022-10-14 Thread yanfei lei
Hi yidan && hjw, 我用FlinkKafkaConsumer在本地也复现了这一问题,但用KafakaSource是可以正常做stop-with-savepoint的。FlinkKafkaConsumer在Flink 1.15后被deprecated了[1],推荐用新的KafkaSource再试试。 [1] https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/datastream/kafka/#kafka-sourcefunction Best, Yanfei hjw

Re: Re: OutOfMemoryError: Direct buffer memory

2022-10-11 Thread yanfei lei
er.memory.task.off-heap.size 应该能解决部分问题, > > 我这里还有些疑问,部署的session集群,每次集群只跑这一个任务,执行结束才开始下一个任务,如果是前面的任务read/write申请的堆外内存,执行结束的时候,会立即释放吗? > 执行几次任务之后,才出现这种异常,前面任务都是成功的,后面任务就异常了,感觉有内存泄漏的现象。Flink > taskmanager的off-heap内存管理有更多的介绍吗?(官网的看过了) > Thanks > > 在 2022-10-10 12:34:55,"yan

Re: Re: table store 和connector-kafka包冲突吗?

2022-10-09 Thread yanfei lei
Hi, 从table store的pom中看,table store的dist包shade了一份connector-kafka。 https://repo1.maven.org/maven2/org/apache/flink/flink-table-store-dist/0.2.0/flink-table-store-dist-0.2.0.pom 把flink-connector-kafka-1.15.1.jar 去掉再试试? RS 于2022年10月8日周六 17:19写道: > Hi, > 报错如下: > > > [ERROR] Could not execute SQL

Re: OutOfMemoryError: Direct buffer memory

2022-10-09 Thread yanfei lei
从报错看是Direct memory不够导致的,可以将taskmanager.memory.task.off-heap.size调大试试看。 Best, Yanfei allanqinjy 于2022年10月8日周六 21:19写道: > > 看堆栈信息是内存不够,调大一些看看。我之前在读取hdfs上的一个获取地理位置的离线库,也是内存溢出,通过调整内存大小解决的。用的streamingapi开发的作业,1.12.5版本。 > > > | | > allanqinjy > | > | > allanqi...@163.com > | > 签名由网易邮箱大师定制 > > > On

Re: flink 的数据传输,是上游算子推给下游, 还是下游算子拉取上游, 设计的考虑是啥?

2022-09-21 Thread yanfei lei
Hi, Flink社区有一篇关于Credit-based Flow Control的blog post ,里面介绍了反压机制的原理和优劣势,希望有帮助。 Shammon FY 于2022年9月21日周三 11:43写道: > Hi > 我个人觉得简单的说flink数据传输是pull模型可能会有歧义,一般来讲大家理解的两个模型的执行流程如下 > 1. push模型 >

Re: native k8s部署模式下使用HA架构的HDFS集群无法正常连接

2022-09-21 Thread yanfei lei
Hi Tino, 从org.apache.flink.core.fs.FileSystem.java 来看,Flink直接将fs.default-scheme当作URI来解析,并没有解析相关xml配置的操作,看起来Flink目前是不支持HA架构的HDFS集群的。 Best, Yanfei Xuyang 于2022年9月21日周三

Re: 关于flink的state

2022-08-30 Thread yanfei lei
Hi, 1) state无法在不同的算子共享,如yue ma的建议,或许可以把需要共享的部分存储在外部系统,然后在两个map里访问同一个外部系统以实现共享 2) 除开operatorState,或许自定义一个总是返回相同值的keySelector,也可以把所有的key都聚合到一起。 yue ma 于2022年8月30日周二 14:20写道: > hi > 1) flink 内部的 state 算子之间是不可以共享的,所以你可能需要借助外部的存储(比如 redis)来做类似的事情 > 2) 你可以看看 operatorState 的使用方式 > >

Re: In-flight data within aligned checkpoints/savepoints

2022-08-22 Thread yanfei lei
Hi Darin, > I often see my checkpoints contain "Processed (persisted) in-flight data". The values outside the parentheses represent Processed in-flight data[1], and the values inside the parentheses represent persisted in-flight data[1] , what kind of case did you see in your WEB UI?If the >0

Re: get state from window

2022-08-17 Thread yanfei lei
Hi, there are two methods on the Context object that a process() invocation receives that allows access to the two types of state: - globalState(), which allows access to keyed state that is not scoped to a window - windowState(), which allows access to keyed state that is also scoped

Re: get state from window

2022-08-17 Thread yanfei lei
Hi, there are two methods on the Context object that a process() invocation receives that allows access to the two types of state: - globalState(), which allows access to keyed state that is not scoped to a window - windowState(), which allows access to keyed state that is also scoped

Re: flink中文邮件列表不显示其他用户提问的flink问题。

2022-07-13 Thread yanfei lei
hi, 列表是指您的收件箱列表吗?您可以通过 https://lists.apache.org/list.html?user-zh@flink.apache.org 查看其他用户的问题和答案。 Best, Yanfei 张锴 于2022年7月14日周四 09:17写道: > 我重新订阅了flink中文邮件,但是列表里没有显示其他用户提问或者解答有关flink相关的问题和答案,这是什么原因? >

Re: Flink状态过期时是否可以将其输出到日志中

2022-07-08 Thread yanfei lei
Hi, Flink暂时不支持过期清理时的回调函数。如果用得是cleanupIncrementally策略(https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/fault-tolerance/state/#cleanup-of-expired-state),可以自行在`TtlIncrementalCleanup`类中添加相应的log。 > 2022年6月27日 下午2:09,haishui 写道: > > Hi, >

Re: How to clean up RocksDB local directory in K8s statefulset

2022-06-27 Thread yanfei lei
Hi Allen, what volumes do you use for your TM pod? If you want your data to be deleted when the pod restarts, you can use an ephemeral volume like EmptyDir. And Flink should remove temporary files automatically when they are not needed anymore(see this discussion