[jira] [Commented] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit

2023-02-27 Thread Yun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17693937#comment-17693937
 ] 

Yun Tang commented on FLINK-31089:
--

[~zhoujira86] If a user forgets to set the state TTL config, I think he will 
also face disk usage problems and performance regression.
I think it would be good to add such documentation.

BTW, does FLINK-31225 has relationship with this one?

> pin L0 index in memory can lead to slow memory grow finally lead to memory 
> beyond limit
> ---
>
> Key: FLINK-31089
> URL: https://issues.apache.org/jira/browse/FLINK-31089
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / State Backends
>Affects Versions: 1.16.1
>Reporter: xiaogang zhou
>Priority: Major
> Attachments: image-2023-02-15-20-26-58-604.png, 
> image-2023-02-15-20-32-17-993.png, image-2023-02-17-16-48-59-535.png, 
> l0pin_open.png
>
>
> with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned 
> memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if 
> we switch it to false, we can see the pinned memory stay realtive static. In 
> our environment, a lot of tasks restart due to memory over limit killed by k8s
> !image-2023-02-15-20-26-58-604.png|width=899,height=447!
>  
> !image-2023-02-15-20-32-17-993.png|width=853,height=464!
> the two graphs are recorded in yesterday and today, which means the data 
> stream number per second will not differ alot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit

2023-02-25 Thread xiaogang zhou (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17693581#comment-17693581
 ] 

xiaogang zhou commented on FLINK-31089:
---

[~Yanfei Lei] yes, your summary is pretty accurate. except pin l0 can improve 
the performance, but disable it will not influence too much. But this is not 
the main topic.

 

My job is a datastream job, my point is to prompt some warning as developer may 
forget to set the stateTtlConfig whereas they turn on the 
PinTopLevelIndexAndFilter. this can 100% lead to some oom issue. 

 

 

> pin L0 index in memory can lead to slow memory grow finally lead to memory 
> beyond limit
> ---
>
> Key: FLINK-31089
> URL: https://issues.apache.org/jira/browse/FLINK-31089
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / State Backends
>Affects Versions: 1.16.1
>Reporter: xiaogang zhou
>Priority: Major
> Attachments: image-2023-02-15-20-26-58-604.png, 
> image-2023-02-15-20-32-17-993.png, image-2023-02-17-16-48-59-535.png, 
> l0pin_open.png
>
>
> with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned 
> memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if 
> we switch it to false, we can see the pinned memory stay realtive static. In 
> our environment, a lot of tasks restart due to memory over limit killed by k8s
> !image-2023-02-15-20-26-58-604.png|width=899,height=447!
>  
> !image-2023-02-15-20-32-17-993.png|width=853,height=464!
> the two graphs are recorded in yesterday and today, which means the data 
> stream number per second will not differ alot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit

2023-02-23 Thread Yanfei Lei (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17693022#comment-17693022
 ] 

Yanfei Lei commented on FLINK-31089:


Let me try to summarize this issue:
 # Enable PinL0FilterAndIndexBlocksInCache or PinTopLevelIndexAndFilter, 
disable TTL, will result in OOM

 ## PinTopLevelIndexAndFilter can significantly affect the performance.
 ## PinL0FilterAndIndexBlocksInCache will NOT affect the performance.
 # Enable PinL0FilterAndIndexBlocksInCache or PinTopLevelIndexAndFilter, enable 
TTL, the memory wouldn't keep growing.  
 ## Due to https://issues.apache.org/jira/browse/FLINK-22957 , the TTL can't 
take effect for the Rank operator in Flink 1.13.

Is the TTL set by "table.exec.state.ttl"? If the job is a DataStream job, maybe 
you can set TTL for the rank operator via StateTtlConfig.

> pin L0 index in memory can lead to slow memory grow finally lead to memory 
> beyond limit
> ---
>
> Key: FLINK-31089
> URL: https://issues.apache.org/jira/browse/FLINK-31089
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / State Backends
>Affects Versions: 1.16.1
>Reporter: xiaogang zhou
>Priority: Major
> Attachments: image-2023-02-15-20-26-58-604.png, 
> image-2023-02-15-20-32-17-993.png, image-2023-02-17-16-48-59-535.png, 
> l0pin_open.png
>
>
> with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned 
> memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if 
> we switch it to false, we can see the pinned memory stay realtive static. In 
> our environment, a lot of tasks restart due to memory over limit killed by k8s
> !image-2023-02-15-20-26-58-604.png|width=899,height=447!
>  
> !image-2023-02-15-20-32-17-993.png|width=853,height=464!
> the two graphs are recorded in yesterday and today, which means the data 
> stream number per second will not differ alot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit

2023-02-23 Thread xiaogang zhou (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692566#comment-17692566
 ] 

xiaogang zhou commented on FLINK-31089:
---

some more info:

1, the task with ttl on has been running for long without pinned block cache 
grow

2,we have many task running with 1.13, which means they are without the fix 

https://issues.apache.org/jira/browse/FLINK-22957

 

these task also with the partitioned-index-filters on. They also has oom 
occasionally

> pin L0 index in memory can lead to slow memory grow finally lead to memory 
> beyond limit
> ---
>
> Key: FLINK-31089
> URL: https://issues.apache.org/jira/browse/FLINK-31089
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / State Backends
>Affects Versions: 1.16.1
>Reporter: xiaogang zhou
>Priority: Major
> Attachments: image-2023-02-15-20-26-58-604.png, 
> image-2023-02-15-20-32-17-993.png, image-2023-02-17-16-48-59-535.png, 
> l0pin_open.png
>
>
> with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned 
> memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if 
> we switch it to false, we can see the pinned memory stay realtive static. In 
> our environment, a lot of tasks restart due to memory over limit killed by k8s
> !image-2023-02-15-20-26-58-604.png|width=899,height=447!
>  
> !image-2023-02-15-20-32-17-993.png|width=853,height=464!
> the two graphs are recorded in yesterday and today, which means the data 
> stream number per second will not differ alot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit

2023-02-22 Thread xiaogang zhou (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692546#comment-17692546
 ] 

xiaogang zhou commented on FLINK-31089:
---

[~yunta] if I turn off the PinTopLevelIndexAndFilter, the task can not run 
correctly as it takes a lot of time load cache. I also found some rank operator 
does not has compaction filter in LOG file

> pin L0 index in memory can lead to slow memory grow finally lead to memory 
> beyond limit
> ---
>
> Key: FLINK-31089
> URL: https://issues.apache.org/jira/browse/FLINK-31089
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / State Backends
>Affects Versions: 1.16.1
>Reporter: xiaogang zhou
>Priority: Major
> Attachments: image-2023-02-15-20-26-58-604.png, 
> image-2023-02-15-20-32-17-993.png, image-2023-02-17-16-48-59-535.png, 
> l0pin_open.png
>
>
> with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned 
> memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if 
> we switch it to false, we can see the pinned memory stay realtive static. In 
> our environment, a lot of tasks restart due to memory over limit killed by k8s
> !image-2023-02-15-20-26-58-604.png|width=899,height=447!
>  
> !image-2023-02-15-20-32-17-993.png|width=853,height=464!
> the two graphs are recorded in yesterday and today, which means the data 
> stream number per second will not differ alot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit

2023-02-21 Thread Yun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691622#comment-17691622
 ] 

Yun Tang commented on FLINK-31089:
--

[~zhoujira86] Thanks for sharing the result, what will happen if no 
'partitioned-index-filters' and no TTL is configured? Will the memory still 
keep growing?

BTW, I have sent to your emails to ask for the Dingtalk account.

> pin L0 index in memory can lead to slow memory grow finally lead to memory 
> beyond limit
> ---
>
> Key: FLINK-31089
> URL: https://issues.apache.org/jira/browse/FLINK-31089
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / State Backends
>Affects Versions: 1.16.1
>Reporter: xiaogang zhou
>Priority: Major
> Attachments: image-2023-02-15-20-26-58-604.png, 
> image-2023-02-15-20-32-17-993.png, image-2023-02-17-16-48-59-535.png, 
> l0pin_open.png
>
>
> with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned 
> memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if 
> we switch it to false, we can see the pinned memory stay realtive static. In 
> our environment, a lot of tasks restart due to memory over limit killed by k8s
> !image-2023-02-15-20-26-58-604.png|width=899,height=447!
>  
> !image-2023-02-15-20-32-17-993.png|width=853,height=464!
> the two graphs are recorded in yesterday and today, which means the data 
> stream number per second will not differ alot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit

2023-02-20 Thread xiaogang zhou (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691338#comment-17691338
 ] 

xiaogang zhou commented on FLINK-31089:
---

[~yunta] Master, After turn on compaction filter, the pinned block cache size 
stop growing.

 

Sould we add some warning for situation if 'partitioned-index-filters' is on 
and no ttl configured?

> pin L0 index in memory can lead to slow memory grow finally lead to memory 
> beyond limit
> ---
>
> Key: FLINK-31089
> URL: https://issues.apache.org/jira/browse/FLINK-31089
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / State Backends
>Affects Versions: 1.16.1
>Reporter: xiaogang zhou
>Priority: Major
> Attachments: image-2023-02-15-20-26-58-604.png, 
> image-2023-02-15-20-32-17-993.png, image-2023-02-17-16-48-59-535.png, 
> l0pin_open.png
>
>
> with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned 
> memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if 
> we switch it to false, we can see the pinned memory stay realtive static. In 
> our environment, a lot of tasks restart due to memory over limit killed by k8s
> !image-2023-02-15-20-26-58-604.png|width=899,height=447!
>  
> !image-2023-02-15-20-32-17-993.png|width=853,height=464!
> the two graphs are recorded in yesterday and today, which means the data 
> stream number per second will not differ alot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit

2023-02-20 Thread xiaogang zhou (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691153#comment-17691153
 ] 

xiaogang zhou commented on FLINK-31089:
---

I create another task with ttl open, And will keep monitor the memory growth

> pin L0 index in memory can lead to slow memory grow finally lead to memory 
> beyond limit
> ---
>
> Key: FLINK-31089
> URL: https://issues.apache.org/jira/browse/FLINK-31089
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / State Backends
>Affects Versions: 1.16.1
>Reporter: xiaogang zhou
>Priority: Major
> Attachments: image-2023-02-15-20-26-58-604.png, 
> image-2023-02-15-20-32-17-993.png, image-2023-02-17-16-48-59-535.png, 
> l0pin_open.png
>
>
> with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned 
> memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if 
> we switch it to false, we can see the pinned memory stay realtive static. In 
> our environment, a lot of tasks restart due to memory over limit killed by k8s
> !image-2023-02-15-20-26-58-604.png|width=899,height=447!
>  
> !image-2023-02-15-20-32-17-993.png|width=853,height=464!
> the two graphs are recorded in yesterday and today, which means the data 
> stream number per second will not differ alot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit

2023-02-20 Thread Yun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691129#comment-17691129
 ] 

Yun Tang commented on FLINK-31089:
--

Thanks for sharing the profiling results, I will take a look recently.
BTW, compaction filter is used for TTL state clean, and I think you did not 
enable TTL for this job.

> pin L0 index in memory can lead to slow memory grow finally lead to memory 
> beyond limit
> ---
>
> Key: FLINK-31089
> URL: https://issues.apache.org/jira/browse/FLINK-31089
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / State Backends
>Affects Versions: 1.16.1
>Reporter: xiaogang zhou
>Priority: Major
> Attachments: image-2023-02-15-20-26-58-604.png, 
> image-2023-02-15-20-32-17-993.png, image-2023-02-17-16-48-59-535.png, 
> l0pin_open.png
>
>
> with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned 
> memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if 
> we switch it to false, we can see the pinned memory stay realtive static. In 
> our environment, a lot of tasks restart due to memory over limit killed by k8s
> !image-2023-02-15-20-26-58-604.png|width=899,height=447!
>  
> !image-2023-02-15-20-32-17-993.png|width=853,height=464!
> the two graphs are recorded in yesterday and today, which means the data 
> stream number per second will not differ alot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit

2023-02-20 Thread xiaogang zhou (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691114#comment-17691114
 ] 

xiaogang zhou commented on FLINK-31089:
---

I found in rocksdb log,

 

2023/02/20-17:55:33.357582 7f4092f42700        Options.compaction_filter: None
2023/02/20-17:55:33.357583 7f4092f42700        
Options.compaction_filter_factory: None

 

could this lead to the index 'oom' issue?

> pin L0 index in memory can lead to slow memory grow finally lead to memory 
> beyond limit
> ---
>
> Key: FLINK-31089
> URL: https://issues.apache.org/jira/browse/FLINK-31089
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / State Backends
>Affects Versions: 1.16.1
>Reporter: xiaogang zhou
>Priority: Major
> Attachments: image-2023-02-15-20-26-58-604.png, 
> image-2023-02-15-20-32-17-993.png, image-2023-02-17-16-48-59-535.png, 
> l0pin_open.png
>
>
> with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned 
> memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if 
> we switch it to false, we can see the pinned memory stay realtive static. In 
> our environment, a lot of tasks restart due to memory over limit killed by k8s
> !image-2023-02-15-20-26-58-604.png|width=899,height=447!
>  
> !image-2023-02-15-20-32-17-993.png|width=853,height=464!
> the two graphs are recorded in yesterday and today, which means the data 
> stream number per second will not differ alot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit

2023-02-19 Thread xiaogang zhou (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17690999#comment-17690999
 ] 

xiaogang zhou commented on FLINK-31089:
---

[~yunta] 

got some update with l0 pin open, see attache l0pin_open.

 

I configured the  table.exec.state.ttl to 36hrs. I suspect whether it does not 
change the rocksdb default ttl configuration?

 

> pin L0 index in memory can lead to slow memory grow finally lead to memory 
> beyond limit
> ---
>
> Key: FLINK-31089
> URL: https://issues.apache.org/jira/browse/FLINK-31089
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / State Backends
>Affects Versions: 1.16.1
>Reporter: xiaogang zhou
>Priority: Major
> Attachments: image-2023-02-15-20-26-58-604.png, 
> image-2023-02-15-20-32-17-993.png, image-2023-02-17-16-48-59-535.png, 
> l0pin_open.png
>
>
> with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned 
> memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if 
> we switch it to false, we can see the pinned memory stay realtive static. In 
> our environment, a lot of tasks restart due to memory over limit killed by k8s
> !image-2023-02-15-20-26-58-604.png|width=899,height=447!
>  
> !image-2023-02-15-20-32-17-993.png|width=853,height=464!
> the two graphs are recorded in yesterday and today, which means the data 
> stream number per second will not differ alot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit

2023-02-17 Thread xiaogang zhou (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17690346#comment-17690346
 ] 

xiaogang zhou commented on FLINK-31089:
---

yes, this is not large when start up, but it keeps growing, so no matter how 
large the tm memory is, it will finally oom.

 

now I started up another task with setPinL0FilterAndIndexBlocksInCache true, 
which will have faster growing speed. I will collect another visual profile at 
weekend, will post it here.

 

And I think it will be convenient to communicate via dingding, I am in a 
ecommerce company in Shanghai in charge of flink. you can send dingding to my 
mail zhou16...@163.com

> pin L0 index in memory can lead to slow memory grow finally lead to memory 
> beyond limit
> ---
>
> Key: FLINK-31089
> URL: https://issues.apache.org/jira/browse/FLINK-31089
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / State Backends
>Affects Versions: 1.16.1
>Reporter: xiaogang zhou
>Priority: Major
> Attachments: image-2023-02-15-20-26-58-604.png, 
> image-2023-02-15-20-32-17-993.png, image-2023-02-17-16-48-59-535.png
>
>
> with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned 
> memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if 
> we switch it to false, we can see the pinned memory stay realtive static. In 
> our environment, a lot of tasks restart due to memory over limit killed by k8s
> !image-2023-02-15-20-26-58-604.png|width=899,height=447!
>  
> !image-2023-02-15-20-32-17-993.png|width=853,height=464!
> the two graphs are recorded in yesterday and today, which means the data 
> stream number per second will not differ alot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit

2023-02-17 Thread Yun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17690343#comment-17690343
 ] 

Yun Tang commented on FLINK-31089:
--

It seems only less than 1.6GB of memory is occupied by RocksDB, is this also 
larger than your configured managed memory?
BTW, could you share the visual profiling result with jeprof?

> pin L0 index in memory can lead to slow memory grow finally lead to memory 
> beyond limit
> ---
>
> Key: FLINK-31089
> URL: https://issues.apache.org/jira/browse/FLINK-31089
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / State Backends
>Affects Versions: 1.16.1
>Reporter: xiaogang zhou
>Priority: Major
> Attachments: image-2023-02-15-20-26-58-604.png, 
> image-2023-02-15-20-32-17-993.png, image-2023-02-17-16-48-59-535.png
>
>
> with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned 
> memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if 
> we switch it to false, we can see the pinned memory stay realtive static. In 
> our environment, a lot of tasks restart due to memory over limit killed by k8s
> !image-2023-02-15-20-26-58-604.png|width=899,height=447!
>  
> !image-2023-02-15-20-32-17-993.png|width=853,height=464!
> the two graphs are recorded in yesterday and today, which means the data 
> stream number per second will not differ alot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit

2023-02-17 Thread xiaogang zhou (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17690329#comment-17690329
 ] 

xiaogang zhou commented on FLINK-31089:
---

[~yunta] thx

 

some background info 

jemalloc version: 

I updated the jemalloc version from  3.6.0-11 to 5.0.1

 

the first set of data I collected is setPinL0FilterAndIndexBlocksInCache false, 
and set the flink kafka offset to 2days ago. I saw 

N4 [label="rocksdb\nUncompressBlockContentsForCompressionType\n1724255408 
(63.3%)\r",shape=box,fontsize=47.8]; 

 

is the major part of memory consumer

> pin L0 index in memory can lead to slow memory grow finally lead to memory 
> beyond limit
> ---
>
> Key: FLINK-31089
> URL: https://issues.apache.org/jira/browse/FLINK-31089
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / State Backends
>Affects Versions: 1.16.1
>Reporter: xiaogang zhou
>Priority: Major
> Attachments: image-2023-02-15-20-26-58-604.png, 
> image-2023-02-15-20-32-17-993.png, image-2023-02-17-16-48-59-535.png
>
>
> with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned 
> memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if 
> we switch it to false, we can see the pinned memory stay realtive static. In 
> our environment, a lot of tasks restart due to memory over limit killed by k8s
> !image-2023-02-15-20-26-58-604.png|width=899,height=447!
>  
> !image-2023-02-15-20-32-17-993.png|width=853,height=464!
> the two graphs are recorded in yesterday and today, which means the data 
> stream number per second will not differ alot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit

2023-02-17 Thread Yun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17690271#comment-17690271
 ] 

Yun Tang commented on FLINK-31089:
--

[~zhoujira86] You can set the prof_prefix as "/tmp/jeprof" to ensure 100% write 
permission.

> pin L0 index in memory can lead to slow memory grow finally lead to memory 
> beyond limit
> ---
>
> Key: FLINK-31089
> URL: https://issues.apache.org/jira/browse/FLINK-31089
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / State Backends
>Affects Versions: 1.16.1
>Reporter: xiaogang zhou
>Priority: Major
> Attachments: image-2023-02-15-20-26-58-604.png, 
> image-2023-02-15-20-32-17-993.png, image-2023-02-17-16-48-59-535.png
>
>
> with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned 
> memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if 
> we switch it to false, we can see the pinned memory stay realtive static. In 
> our environment, a lot of tasks restart due to memory over limit killed by k8s
> !image-2023-02-15-20-26-58-604.png|width=899,height=447!
>  
> !image-2023-02-15-20-32-17-993.png|width=853,height=464!
> the two graphs are recorded in yesterday and today, which means the data 
> stream number per second will not differ alot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit

2023-02-17 Thread xiaogang zhou (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17690261#comment-17690261
 ] 

xiaogang zhou commented on FLINK-31089:
---

[~yunta] 

Master , I rebuilt a jemalloc from source and with the config below. the 
Invalid conf pair warning disappeared. But I can't find the prof files in the 
location I set. Can you please help suggest how to collect the evidence?

!image-2023-02-17-16-48-59-535.png|width=625,height=101!

> pin L0 index in memory can lead to slow memory grow finally lead to memory 
> beyond limit
> ---
>
> Key: FLINK-31089
> URL: https://issues.apache.org/jira/browse/FLINK-31089
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / State Backends
>Affects Versions: 1.16.1
>Reporter: xiaogang zhou
>Priority: Major
> Attachments: image-2023-02-15-20-26-58-604.png, 
> image-2023-02-15-20-32-17-993.png, image-2023-02-17-16-48-59-535.png
>
>
> with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned 
> memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if 
> we switch it to false, we can see the pinned memory stay realtive static. In 
> our environment, a lot of tasks restart due to memory over limit killed by k8s
> !image-2023-02-15-20-26-58-604.png|width=899,height=447!
>  
> !image-2023-02-15-20-32-17-993.png|width=853,height=464!
> the two graphs are recorded in yesterday and today, which means the data 
> stream number per second will not differ alot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit

2023-02-15 Thread xiaogang zhou (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689482#comment-17689482
 ] 

xiaogang zhou commented on FLINK-31089:
---

[~yunta] 

Master, Do you aware 

: Invalid conf pair: prof:true
: Invalid conf pair: lg_prof_interval:29
: Invalid conf pair: lg_prof_sample:17
: Invalid conf pair: prof_prefix:/opt/flink/jeprof.out

 

what this stand for

 

BTW, can you please share me you wechat

> pin L0 index in memory can lead to slow memory grow finally lead to memory 
> beyond limit
> ---
>
> Key: FLINK-31089
> URL: https://issues.apache.org/jira/browse/FLINK-31089
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / State Backends
>Affects Versions: 1.16.1
>Reporter: xiaogang zhou
>Priority: Major
> Attachments: image-2023-02-15-20-26-58-604.png, 
> image-2023-02-15-20-32-17-993.png
>
>
> with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned 
> memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if 
> we switch it to false, we can see the pinned memory stay realtive static. In 
> our environment, a lot of tasks restart due to memory over limit killed by k8s
> !image-2023-02-15-20-26-58-604.png|width=899,height=447!
>  
> !image-2023-02-15-20-32-17-993.png|width=853,height=464!
> the two graphs are recorded in yesterday and today, which means the data 
> stream number per second will not differ alot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit

2023-02-15 Thread Yun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689471#comment-17689471
 ] 

Yun Tang commented on FLINK-31089:
--

[~zhoujira86] Maybe you did not get what I mean. I am sure that you use 
jemalloc as the default memory allocator, I hope you can use jemalloc to 
profile the memory usage and share the dump results to know what occupied the 
memory.

> pin L0 index in memory can lead to slow memory grow finally lead to memory 
> beyond limit
> ---
>
> Key: FLINK-31089
> URL: https://issues.apache.org/jira/browse/FLINK-31089
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / State Backends
>Affects Versions: 1.16.1
>Reporter: xiaogang zhou
>Priority: Major
> Attachments: image-2023-02-15-20-26-58-604.png, 
> image-2023-02-15-20-32-17-993.png
>
>
> with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned 
> memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if 
> we switch it to false, we can see the pinned memory stay realtive static. In 
> our environment, a lot of tasks restart due to memory over limit killed by k8s
> !image-2023-02-15-20-26-58-604.png|width=899,height=447!
>  
> !image-2023-02-15-20-32-17-993.png|width=853,height=464!
> the two graphs are recorded in yesterday and today, which means the data 
> stream number per second will not differ alot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit

2023-02-15 Thread xiaogang zhou (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689454#comment-17689454
 ] 

xiaogang zhou commented on FLINK-31089:
---

[~yunta] looks like we are already using the jemalloc

$ /usr/bin/pmap -x 1 | grep malloc
7f434e9aa000     204     204       0 r-x-- libjemalloc.so.1
7f434e9aa000       0       0       0 r-x-- libjemalloc.so.1
7f434e9dd000    2044       0       0 - libjemalloc.so.1
7f434e9dd000       0       0       0 - libjemalloc.so.1
7f434ebdc000       8       8       8 r libjemalloc.so.1
7f434ebdc000       0       0       0 r libjemalloc.so.1
7f434ebde000       4       4       4 rw--- libjemalloc.so.1
7f434ebde000       0       0       0 rw--- libjemalloc.so.1

 

and 'state.backend.rocksdb.memory.partitioned-index-filters' yes, we configured 
it as true. without the two_level_index_cache. the rocksdb performance is 
really low. 

 

And flink_taskmanager_job_task_operator_.*rocksdb_block_cache_pinned_usage can 
growing quickly if left PinL0FilterAndIndexBlocksInCache true

> pin L0 index in memory can lead to slow memory grow finally lead to memory 
> beyond limit
> ---
>
> Key: FLINK-31089
> URL: https://issues.apache.org/jira/browse/FLINK-31089
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / State Backends
>Affects Versions: 1.16.1
>Reporter: xiaogang zhou
>Priority: Major
> Attachments: image-2023-02-15-20-26-58-604.png, 
> image-2023-02-15-20-32-17-993.png
>
>
> with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned 
> memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if 
> we switch it to false, we can see the pinned memory stay realtive static. In 
> our environment, a lot of tasks restart due to memory over limit killed by k8s
> !image-2023-02-15-20-26-58-604.png|width=899,height=447!
>  
> !image-2023-02-15-20-32-17-993.png|width=853,height=464!
> the two graphs are recorded in yesterday and today, which means the data 
> stream number per second will not differ alot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit

2023-02-15 Thread Yun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689134#comment-17689134
 ] 

Yun Tang commented on FLINK-31089:
--

[~zhoujira86] Thanks for reporting this.
First of all, I think you have set 
{{state.backend.rocksdb.memory.partitioned-index-filters}} as true, right? Did 
you get similar results with this value as false?
Could you also use jemalloc to help heap profiling the native memory usage 
(FLINK-19125 , 
https://github.com/jemalloc/jemalloc/wiki/Use-Case%3A-Heap-Profiling )? I think 
that could be the tool to figure out this problem.

> pin L0 index in memory can lead to slow memory grow finally lead to memory 
> beyond limit
> ---
>
> Key: FLINK-31089
> URL: https://issues.apache.org/jira/browse/FLINK-31089
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / State Backends
>Affects Versions: 1.16.1
>Reporter: xiaogang zhou
>Priority: Major
> Attachments: image-2023-02-15-20-26-58-604.png, 
> image-2023-02-15-20-32-17-993.png
>
>
> with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned 
> memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if 
> we switch it to false, we can see the pinned memory stay realtive static. In 
> our environment, a lot of tasks restart due to memory over limit killed by k8s
> !image-2023-02-15-20-26-58-604.png|width=899,height=447!
>  
> !image-2023-02-15-20-32-17-993.png|width=853,height=464!
> the two graphs are recorded in yesterday and today, which means the data 
> stream number per second will not differ alot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit

2023-02-15 Thread xiaogang zhou (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689097#comment-17689097
 ] 

xiaogang zhou commented on FLINK-31089:
---

[~yunta] master, please kindly review. I have also tested the performance.  

disable PinTopLevelIndexAndFilter can significantly decrease performance

disable PinL0FilterAndIndexBlocksInCache does not harm performance a lot.

 

[https://github.com/facebook/rocksdb/issues/4112#issuecomment-405859006] also 
mentioned, cache top level is enough. as rocksdb memory growing issue has a lot 
complain in rocksdb issues

> pin L0 index in memory can lead to slow memory grow finally lead to memory 
> beyond limit
> ---
>
> Key: FLINK-31089
> URL: https://issues.apache.org/jira/browse/FLINK-31089
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / State Backends
>Affects Versions: 1.16.1
>Reporter: xiaogang zhou
>Priority: Major
> Attachments: image-2023-02-15-20-26-58-604.png, 
> image-2023-02-15-20-32-17-993.png
>
>
> with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned 
> memory kept growing(in the pc blow from 48G-> 50G). But if we switch it to 
> false, we can see the pinned memory stay static. In our environment, a lot of 
> tasks restart due to memory over limit killed by k8s
> !image-2023-02-15-20-26-58-604.png|width=899,height=447!
>  
> !image-2023-02-15-20-32-17-993.png|width=853,height=464!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)