[jira] [Commented] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit
[ https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17693937#comment-17693937 ] Yun Tang commented on FLINK-31089: -- [~zhoujira86] If a user forgets to set the state TTL config, I think he will also face disk usage problems and performance regression. I think it would be good to add such documentation. BTW, does FLINK-31225 has relationship with this one? > pin L0 index in memory can lead to slow memory grow finally lead to memory > beyond limit > --- > > Key: FLINK-31089 > URL: https://issues.apache.org/jira/browse/FLINK-31089 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends >Affects Versions: 1.16.1 >Reporter: xiaogang zhou >Priority: Major > Attachments: image-2023-02-15-20-26-58-604.png, > image-2023-02-15-20-32-17-993.png, image-2023-02-17-16-48-59-535.png, > l0pin_open.png > > > with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned > memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if > we switch it to false, we can see the pinned memory stay realtive static. In > our environment, a lot of tasks restart due to memory over limit killed by k8s > !image-2023-02-15-20-26-58-604.png|width=899,height=447! > > !image-2023-02-15-20-32-17-993.png|width=853,height=464! > the two graphs are recorded in yesterday and today, which means the data > stream number per second will not differ alot. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit
[ https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17693581#comment-17693581 ] xiaogang zhou commented on FLINK-31089: --- [~Yanfei Lei] yes, your summary is pretty accurate. except pin l0 can improve the performance, but disable it will not influence too much. But this is not the main topic. My job is a datastream job, my point is to prompt some warning as developer may forget to set the stateTtlConfig whereas they turn on the PinTopLevelIndexAndFilter. this can 100% lead to some oom issue. > pin L0 index in memory can lead to slow memory grow finally lead to memory > beyond limit > --- > > Key: FLINK-31089 > URL: https://issues.apache.org/jira/browse/FLINK-31089 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends >Affects Versions: 1.16.1 >Reporter: xiaogang zhou >Priority: Major > Attachments: image-2023-02-15-20-26-58-604.png, > image-2023-02-15-20-32-17-993.png, image-2023-02-17-16-48-59-535.png, > l0pin_open.png > > > with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned > memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if > we switch it to false, we can see the pinned memory stay realtive static. In > our environment, a lot of tasks restart due to memory over limit killed by k8s > !image-2023-02-15-20-26-58-604.png|width=899,height=447! > > !image-2023-02-15-20-32-17-993.png|width=853,height=464! > the two graphs are recorded in yesterday and today, which means the data > stream number per second will not differ alot. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit
[ https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17693022#comment-17693022 ] Yanfei Lei commented on FLINK-31089: Let me try to summarize this issue: # Enable PinL0FilterAndIndexBlocksInCache or PinTopLevelIndexAndFilter, disable TTL, will result in OOM ## PinTopLevelIndexAndFilter can significantly affect the performance. ## PinL0FilterAndIndexBlocksInCache will NOT affect the performance. # Enable PinL0FilterAndIndexBlocksInCache or PinTopLevelIndexAndFilter, enable TTL, the memory wouldn't keep growing. ## Due to https://issues.apache.org/jira/browse/FLINK-22957 , the TTL can't take effect for the Rank operator in Flink 1.13. Is the TTL set by "table.exec.state.ttl"? If the job is a DataStream job, maybe you can set TTL for the rank operator via StateTtlConfig. > pin L0 index in memory can lead to slow memory grow finally lead to memory > beyond limit > --- > > Key: FLINK-31089 > URL: https://issues.apache.org/jira/browse/FLINK-31089 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends >Affects Versions: 1.16.1 >Reporter: xiaogang zhou >Priority: Major > Attachments: image-2023-02-15-20-26-58-604.png, > image-2023-02-15-20-32-17-993.png, image-2023-02-17-16-48-59-535.png, > l0pin_open.png > > > with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned > memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if > we switch it to false, we can see the pinned memory stay realtive static. In > our environment, a lot of tasks restart due to memory over limit killed by k8s > !image-2023-02-15-20-26-58-604.png|width=899,height=447! > > !image-2023-02-15-20-32-17-993.png|width=853,height=464! > the two graphs are recorded in yesterday and today, which means the data > stream number per second will not differ alot. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit
[ https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692566#comment-17692566 ] xiaogang zhou commented on FLINK-31089: --- some more info: 1, the task with ttl on has been running for long without pinned block cache grow 2,we have many task running with 1.13, which means they are without the fix https://issues.apache.org/jira/browse/FLINK-22957 these task also with the partitioned-index-filters on. They also has oom occasionally > pin L0 index in memory can lead to slow memory grow finally lead to memory > beyond limit > --- > > Key: FLINK-31089 > URL: https://issues.apache.org/jira/browse/FLINK-31089 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends >Affects Versions: 1.16.1 >Reporter: xiaogang zhou >Priority: Major > Attachments: image-2023-02-15-20-26-58-604.png, > image-2023-02-15-20-32-17-993.png, image-2023-02-17-16-48-59-535.png, > l0pin_open.png > > > with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned > memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if > we switch it to false, we can see the pinned memory stay realtive static. In > our environment, a lot of tasks restart due to memory over limit killed by k8s > !image-2023-02-15-20-26-58-604.png|width=899,height=447! > > !image-2023-02-15-20-32-17-993.png|width=853,height=464! > the two graphs are recorded in yesterday and today, which means the data > stream number per second will not differ alot. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit
[ https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692546#comment-17692546 ] xiaogang zhou commented on FLINK-31089: --- [~yunta] if I turn off the PinTopLevelIndexAndFilter, the task can not run correctly as it takes a lot of time load cache. I also found some rank operator does not has compaction filter in LOG file > pin L0 index in memory can lead to slow memory grow finally lead to memory > beyond limit > --- > > Key: FLINK-31089 > URL: https://issues.apache.org/jira/browse/FLINK-31089 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends >Affects Versions: 1.16.1 >Reporter: xiaogang zhou >Priority: Major > Attachments: image-2023-02-15-20-26-58-604.png, > image-2023-02-15-20-32-17-993.png, image-2023-02-17-16-48-59-535.png, > l0pin_open.png > > > with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned > memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if > we switch it to false, we can see the pinned memory stay realtive static. In > our environment, a lot of tasks restart due to memory over limit killed by k8s > !image-2023-02-15-20-26-58-604.png|width=899,height=447! > > !image-2023-02-15-20-32-17-993.png|width=853,height=464! > the two graphs are recorded in yesterday and today, which means the data > stream number per second will not differ alot. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit
[ https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691622#comment-17691622 ] Yun Tang commented on FLINK-31089: -- [~zhoujira86] Thanks for sharing the result, what will happen if no 'partitioned-index-filters' and no TTL is configured? Will the memory still keep growing? BTW, I have sent to your emails to ask for the Dingtalk account. > pin L0 index in memory can lead to slow memory grow finally lead to memory > beyond limit > --- > > Key: FLINK-31089 > URL: https://issues.apache.org/jira/browse/FLINK-31089 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends >Affects Versions: 1.16.1 >Reporter: xiaogang zhou >Priority: Major > Attachments: image-2023-02-15-20-26-58-604.png, > image-2023-02-15-20-32-17-993.png, image-2023-02-17-16-48-59-535.png, > l0pin_open.png > > > with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned > memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if > we switch it to false, we can see the pinned memory stay realtive static. In > our environment, a lot of tasks restart due to memory over limit killed by k8s > !image-2023-02-15-20-26-58-604.png|width=899,height=447! > > !image-2023-02-15-20-32-17-993.png|width=853,height=464! > the two graphs are recorded in yesterday and today, which means the data > stream number per second will not differ alot. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit
[ https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691338#comment-17691338 ] xiaogang zhou commented on FLINK-31089: --- [~yunta] Master, After turn on compaction filter, the pinned block cache size stop growing. Sould we add some warning for situation if 'partitioned-index-filters' is on and no ttl configured? > pin L0 index in memory can lead to slow memory grow finally lead to memory > beyond limit > --- > > Key: FLINK-31089 > URL: https://issues.apache.org/jira/browse/FLINK-31089 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends >Affects Versions: 1.16.1 >Reporter: xiaogang zhou >Priority: Major > Attachments: image-2023-02-15-20-26-58-604.png, > image-2023-02-15-20-32-17-993.png, image-2023-02-17-16-48-59-535.png, > l0pin_open.png > > > with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned > memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if > we switch it to false, we can see the pinned memory stay realtive static. In > our environment, a lot of tasks restart due to memory over limit killed by k8s > !image-2023-02-15-20-26-58-604.png|width=899,height=447! > > !image-2023-02-15-20-32-17-993.png|width=853,height=464! > the two graphs are recorded in yesterday and today, which means the data > stream number per second will not differ alot. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit
[ https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691153#comment-17691153 ] xiaogang zhou commented on FLINK-31089: --- I create another task with ttl open, And will keep monitor the memory growth > pin L0 index in memory can lead to slow memory grow finally lead to memory > beyond limit > --- > > Key: FLINK-31089 > URL: https://issues.apache.org/jira/browse/FLINK-31089 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends >Affects Versions: 1.16.1 >Reporter: xiaogang zhou >Priority: Major > Attachments: image-2023-02-15-20-26-58-604.png, > image-2023-02-15-20-32-17-993.png, image-2023-02-17-16-48-59-535.png, > l0pin_open.png > > > with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned > memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if > we switch it to false, we can see the pinned memory stay realtive static. In > our environment, a lot of tasks restart due to memory over limit killed by k8s > !image-2023-02-15-20-26-58-604.png|width=899,height=447! > > !image-2023-02-15-20-32-17-993.png|width=853,height=464! > the two graphs are recorded in yesterday and today, which means the data > stream number per second will not differ alot. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit
[ https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691129#comment-17691129 ] Yun Tang commented on FLINK-31089: -- Thanks for sharing the profiling results, I will take a look recently. BTW, compaction filter is used for TTL state clean, and I think you did not enable TTL for this job. > pin L0 index in memory can lead to slow memory grow finally lead to memory > beyond limit > --- > > Key: FLINK-31089 > URL: https://issues.apache.org/jira/browse/FLINK-31089 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends >Affects Versions: 1.16.1 >Reporter: xiaogang zhou >Priority: Major > Attachments: image-2023-02-15-20-26-58-604.png, > image-2023-02-15-20-32-17-993.png, image-2023-02-17-16-48-59-535.png, > l0pin_open.png > > > with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned > memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if > we switch it to false, we can see the pinned memory stay realtive static. In > our environment, a lot of tasks restart due to memory over limit killed by k8s > !image-2023-02-15-20-26-58-604.png|width=899,height=447! > > !image-2023-02-15-20-32-17-993.png|width=853,height=464! > the two graphs are recorded in yesterday and today, which means the data > stream number per second will not differ alot. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit
[ https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691114#comment-17691114 ] xiaogang zhou commented on FLINK-31089: --- I found in rocksdb log, 2023/02/20-17:55:33.357582 7f4092f42700 Options.compaction_filter: None 2023/02/20-17:55:33.357583 7f4092f42700 Options.compaction_filter_factory: None could this lead to the index 'oom' issue? > pin L0 index in memory can lead to slow memory grow finally lead to memory > beyond limit > --- > > Key: FLINK-31089 > URL: https://issues.apache.org/jira/browse/FLINK-31089 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends >Affects Versions: 1.16.1 >Reporter: xiaogang zhou >Priority: Major > Attachments: image-2023-02-15-20-26-58-604.png, > image-2023-02-15-20-32-17-993.png, image-2023-02-17-16-48-59-535.png, > l0pin_open.png > > > with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned > memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if > we switch it to false, we can see the pinned memory stay realtive static. In > our environment, a lot of tasks restart due to memory over limit killed by k8s > !image-2023-02-15-20-26-58-604.png|width=899,height=447! > > !image-2023-02-15-20-32-17-993.png|width=853,height=464! > the two graphs are recorded in yesterday and today, which means the data > stream number per second will not differ alot. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit
[ https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17690999#comment-17690999 ] xiaogang zhou commented on FLINK-31089: --- [~yunta] got some update with l0 pin open, see attache l0pin_open. I configured the table.exec.state.ttl to 36hrs. I suspect whether it does not change the rocksdb default ttl configuration? > pin L0 index in memory can lead to slow memory grow finally lead to memory > beyond limit > --- > > Key: FLINK-31089 > URL: https://issues.apache.org/jira/browse/FLINK-31089 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends >Affects Versions: 1.16.1 >Reporter: xiaogang zhou >Priority: Major > Attachments: image-2023-02-15-20-26-58-604.png, > image-2023-02-15-20-32-17-993.png, image-2023-02-17-16-48-59-535.png, > l0pin_open.png > > > with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned > memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if > we switch it to false, we can see the pinned memory stay realtive static. In > our environment, a lot of tasks restart due to memory over limit killed by k8s > !image-2023-02-15-20-26-58-604.png|width=899,height=447! > > !image-2023-02-15-20-32-17-993.png|width=853,height=464! > the two graphs are recorded in yesterday and today, which means the data > stream number per second will not differ alot. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit
[ https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17690346#comment-17690346 ] xiaogang zhou commented on FLINK-31089: --- yes, this is not large when start up, but it keeps growing, so no matter how large the tm memory is, it will finally oom. now I started up another task with setPinL0FilterAndIndexBlocksInCache true, which will have faster growing speed. I will collect another visual profile at weekend, will post it here. And I think it will be convenient to communicate via dingding, I am in a ecommerce company in Shanghai in charge of flink. you can send dingding to my mail zhou16...@163.com > pin L0 index in memory can lead to slow memory grow finally lead to memory > beyond limit > --- > > Key: FLINK-31089 > URL: https://issues.apache.org/jira/browse/FLINK-31089 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends >Affects Versions: 1.16.1 >Reporter: xiaogang zhou >Priority: Major > Attachments: image-2023-02-15-20-26-58-604.png, > image-2023-02-15-20-32-17-993.png, image-2023-02-17-16-48-59-535.png > > > with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned > memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if > we switch it to false, we can see the pinned memory stay realtive static. In > our environment, a lot of tasks restart due to memory over limit killed by k8s > !image-2023-02-15-20-26-58-604.png|width=899,height=447! > > !image-2023-02-15-20-32-17-993.png|width=853,height=464! > the two graphs are recorded in yesterday and today, which means the data > stream number per second will not differ alot. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit
[ https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17690343#comment-17690343 ] Yun Tang commented on FLINK-31089: -- It seems only less than 1.6GB of memory is occupied by RocksDB, is this also larger than your configured managed memory? BTW, could you share the visual profiling result with jeprof? > pin L0 index in memory can lead to slow memory grow finally lead to memory > beyond limit > --- > > Key: FLINK-31089 > URL: https://issues.apache.org/jira/browse/FLINK-31089 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends >Affects Versions: 1.16.1 >Reporter: xiaogang zhou >Priority: Major > Attachments: image-2023-02-15-20-26-58-604.png, > image-2023-02-15-20-32-17-993.png, image-2023-02-17-16-48-59-535.png > > > with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned > memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if > we switch it to false, we can see the pinned memory stay realtive static. In > our environment, a lot of tasks restart due to memory over limit killed by k8s > !image-2023-02-15-20-26-58-604.png|width=899,height=447! > > !image-2023-02-15-20-32-17-993.png|width=853,height=464! > the two graphs are recorded in yesterday and today, which means the data > stream number per second will not differ alot. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit
[ https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17690329#comment-17690329 ] xiaogang zhou commented on FLINK-31089: --- [~yunta] thx some background info jemalloc version: I updated the jemalloc version from 3.6.0-11 to 5.0.1 the first set of data I collected is setPinL0FilterAndIndexBlocksInCache false, and set the flink kafka offset to 2days ago. I saw N4 [label="rocksdb\nUncompressBlockContentsForCompressionType\n1724255408 (63.3%)\r",shape=box,fontsize=47.8]; is the major part of memory consumer > pin L0 index in memory can lead to slow memory grow finally lead to memory > beyond limit > --- > > Key: FLINK-31089 > URL: https://issues.apache.org/jira/browse/FLINK-31089 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends >Affects Versions: 1.16.1 >Reporter: xiaogang zhou >Priority: Major > Attachments: image-2023-02-15-20-26-58-604.png, > image-2023-02-15-20-32-17-993.png, image-2023-02-17-16-48-59-535.png > > > with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned > memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if > we switch it to false, we can see the pinned memory stay realtive static. In > our environment, a lot of tasks restart due to memory over limit killed by k8s > !image-2023-02-15-20-26-58-604.png|width=899,height=447! > > !image-2023-02-15-20-32-17-993.png|width=853,height=464! > the two graphs are recorded in yesterday and today, which means the data > stream number per second will not differ alot. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit
[ https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17690271#comment-17690271 ] Yun Tang commented on FLINK-31089: -- [~zhoujira86] You can set the prof_prefix as "/tmp/jeprof" to ensure 100% write permission. > pin L0 index in memory can lead to slow memory grow finally lead to memory > beyond limit > --- > > Key: FLINK-31089 > URL: https://issues.apache.org/jira/browse/FLINK-31089 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends >Affects Versions: 1.16.1 >Reporter: xiaogang zhou >Priority: Major > Attachments: image-2023-02-15-20-26-58-604.png, > image-2023-02-15-20-32-17-993.png, image-2023-02-17-16-48-59-535.png > > > with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned > memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if > we switch it to false, we can see the pinned memory stay realtive static. In > our environment, a lot of tasks restart due to memory over limit killed by k8s > !image-2023-02-15-20-26-58-604.png|width=899,height=447! > > !image-2023-02-15-20-32-17-993.png|width=853,height=464! > the two graphs are recorded in yesterday and today, which means the data > stream number per second will not differ alot. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit
[ https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17690261#comment-17690261 ] xiaogang zhou commented on FLINK-31089: --- [~yunta] Master , I rebuilt a jemalloc from source and with the config below. the Invalid conf pair warning disappeared. But I can't find the prof files in the location I set. Can you please help suggest how to collect the evidence? !image-2023-02-17-16-48-59-535.png|width=625,height=101! > pin L0 index in memory can lead to slow memory grow finally lead to memory > beyond limit > --- > > Key: FLINK-31089 > URL: https://issues.apache.org/jira/browse/FLINK-31089 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends >Affects Versions: 1.16.1 >Reporter: xiaogang zhou >Priority: Major > Attachments: image-2023-02-15-20-26-58-604.png, > image-2023-02-15-20-32-17-993.png, image-2023-02-17-16-48-59-535.png > > > with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned > memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if > we switch it to false, we can see the pinned memory stay realtive static. In > our environment, a lot of tasks restart due to memory over limit killed by k8s > !image-2023-02-15-20-26-58-604.png|width=899,height=447! > > !image-2023-02-15-20-32-17-993.png|width=853,height=464! > the two graphs are recorded in yesterday and today, which means the data > stream number per second will not differ alot. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit
[ https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689482#comment-17689482 ] xiaogang zhou commented on FLINK-31089: --- [~yunta] Master, Do you aware : Invalid conf pair: prof:true : Invalid conf pair: lg_prof_interval:29 : Invalid conf pair: lg_prof_sample:17 : Invalid conf pair: prof_prefix:/opt/flink/jeprof.out what this stand for BTW, can you please share me you wechat > pin L0 index in memory can lead to slow memory grow finally lead to memory > beyond limit > --- > > Key: FLINK-31089 > URL: https://issues.apache.org/jira/browse/FLINK-31089 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends >Affects Versions: 1.16.1 >Reporter: xiaogang zhou >Priority: Major > Attachments: image-2023-02-15-20-26-58-604.png, > image-2023-02-15-20-32-17-993.png > > > with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned > memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if > we switch it to false, we can see the pinned memory stay realtive static. In > our environment, a lot of tasks restart due to memory over limit killed by k8s > !image-2023-02-15-20-26-58-604.png|width=899,height=447! > > !image-2023-02-15-20-32-17-993.png|width=853,height=464! > the two graphs are recorded in yesterday and today, which means the data > stream number per second will not differ alot. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit
[ https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689471#comment-17689471 ] Yun Tang commented on FLINK-31089: -- [~zhoujira86] Maybe you did not get what I mean. I am sure that you use jemalloc as the default memory allocator, I hope you can use jemalloc to profile the memory usage and share the dump results to know what occupied the memory. > pin L0 index in memory can lead to slow memory grow finally lead to memory > beyond limit > --- > > Key: FLINK-31089 > URL: https://issues.apache.org/jira/browse/FLINK-31089 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends >Affects Versions: 1.16.1 >Reporter: xiaogang zhou >Priority: Major > Attachments: image-2023-02-15-20-26-58-604.png, > image-2023-02-15-20-32-17-993.png > > > with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned > memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if > we switch it to false, we can see the pinned memory stay realtive static. In > our environment, a lot of tasks restart due to memory over limit killed by k8s > !image-2023-02-15-20-26-58-604.png|width=899,height=447! > > !image-2023-02-15-20-32-17-993.png|width=853,height=464! > the two graphs are recorded in yesterday and today, which means the data > stream number per second will not differ alot. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit
[ https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689454#comment-17689454 ] xiaogang zhou commented on FLINK-31089: --- [~yunta] looks like we are already using the jemalloc $ /usr/bin/pmap -x 1 | grep malloc 7f434e9aa000 204 204 0 r-x-- libjemalloc.so.1 7f434e9aa000 0 0 0 r-x-- libjemalloc.so.1 7f434e9dd000 2044 0 0 - libjemalloc.so.1 7f434e9dd000 0 0 0 - libjemalloc.so.1 7f434ebdc000 8 8 8 r libjemalloc.so.1 7f434ebdc000 0 0 0 r libjemalloc.so.1 7f434ebde000 4 4 4 rw--- libjemalloc.so.1 7f434ebde000 0 0 0 rw--- libjemalloc.so.1 and 'state.backend.rocksdb.memory.partitioned-index-filters' yes, we configured it as true. without the two_level_index_cache. the rocksdb performance is really low. And flink_taskmanager_job_task_operator_.*rocksdb_block_cache_pinned_usage can growing quickly if left PinL0FilterAndIndexBlocksInCache true > pin L0 index in memory can lead to slow memory grow finally lead to memory > beyond limit > --- > > Key: FLINK-31089 > URL: https://issues.apache.org/jira/browse/FLINK-31089 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends >Affects Versions: 1.16.1 >Reporter: xiaogang zhou >Priority: Major > Attachments: image-2023-02-15-20-26-58-604.png, > image-2023-02-15-20-32-17-993.png > > > with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned > memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if > we switch it to false, we can see the pinned memory stay realtive static. In > our environment, a lot of tasks restart due to memory over limit killed by k8s > !image-2023-02-15-20-26-58-604.png|width=899,height=447! > > !image-2023-02-15-20-32-17-993.png|width=853,height=464! > the two graphs are recorded in yesterday and today, which means the data > stream number per second will not differ alot. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit
[ https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689134#comment-17689134 ] Yun Tang commented on FLINK-31089: -- [~zhoujira86] Thanks for reporting this. First of all, I think you have set {{state.backend.rocksdb.memory.partitioned-index-filters}} as true, right? Did you get similar results with this value as false? Could you also use jemalloc to help heap profiling the native memory usage (FLINK-19125 , https://github.com/jemalloc/jemalloc/wiki/Use-Case%3A-Heap-Profiling )? I think that could be the tool to figure out this problem. > pin L0 index in memory can lead to slow memory grow finally lead to memory > beyond limit > --- > > Key: FLINK-31089 > URL: https://issues.apache.org/jira/browse/FLINK-31089 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends >Affects Versions: 1.16.1 >Reporter: xiaogang zhou >Priority: Major > Attachments: image-2023-02-15-20-26-58-604.png, > image-2023-02-15-20-32-17-993.png > > > with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned > memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if > we switch it to false, we can see the pinned memory stay realtive static. In > our environment, a lot of tasks restart due to memory over limit killed by k8s > !image-2023-02-15-20-26-58-604.png|width=899,height=447! > > !image-2023-02-15-20-32-17-993.png|width=853,height=464! > the two graphs are recorded in yesterday and today, which means the data > stream number per second will not differ alot. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit
[ https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689097#comment-17689097 ] xiaogang zhou commented on FLINK-31089: --- [~yunta] master, please kindly review. I have also tested the performance. disable PinTopLevelIndexAndFilter can significantly decrease performance disable PinL0FilterAndIndexBlocksInCache does not harm performance a lot. [https://github.com/facebook/rocksdb/issues/4112#issuecomment-405859006] also mentioned, cache top level is enough. as rocksdb memory growing issue has a lot complain in rocksdb issues > pin L0 index in memory can lead to slow memory grow finally lead to memory > beyond limit > --- > > Key: FLINK-31089 > URL: https://issues.apache.org/jira/browse/FLINK-31089 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends >Affects Versions: 1.16.1 >Reporter: xiaogang zhou >Priority: Major > Attachments: image-2023-02-15-20-26-58-604.png, > image-2023-02-15-20-32-17-993.png > > > with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned > memory kept growing(in the pc blow from 48G-> 50G). But if we switch it to > false, we can see the pinned memory stay static. In our environment, a lot of > tasks restart due to memory over limit killed by k8s > !image-2023-02-15-20-26-58-604.png|width=899,height=447! > > !image-2023-02-15-20-32-17-993.png|width=853,height=464! > -- This message was sent by Atlassian Jira (v8.20.10#820010)