[jira] [Comment Edited] (FLINK-25328) Improvement of share memory manager between jobs if they use the same slot in TaskManager for flink olap queries

2021-12-16 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17461218#comment-17461218
 ] 

Xintong Song edited comment on FLINK-25328 at 12/17/21, 6:24 AM:
-

[~zjureel],
The reason we have operators, rocksdb and python sharing one memory pool is 
that, Flink can manage to take whatever amount of memory that is available. The 
problem of a dedicated memory pool is that, the memory may be unused in some 
scenarios. Network buffer pool does not have this problem because it's always 
needed. Unfortunately, operators that use memory segments, rocksdb state 
backend and python udfs are not.


was (Author: xintongsong):
[~zjureel],
The reason we have operators, rocksdb and python sharing one memory pool is 
that, Flink can manage to take whatever amount of memory that is available. The 
problem of a dedicated memory pool is that, the memory may be unused in some 
scenarios. Network buffer pool does not have this problem because it's always 
needed. Unfortunately, operators that use memory segment, rocksdb state backend 
and python udfs are not.

> Improvement of share memory manager between jobs if they use the same slot in 
> TaskManager for flink olap queries
> 
>
> Key: FLINK-25328
> URL: https://issues.apache.org/jira/browse/FLINK-25328
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.14.0, 1.12.5, 1.13.3
>Reporter: Shammon
>Priority: Major
>
> We submit batch jobs to flink session cluster as olap queries, and these 
> jobs' subtasks in TaskManager are frequently created and destroyed because 
> they finish their work quickly. Each slot in taskmanager manages 
> `MemoryManager` for multiple tasks in one job, and the `MemoryManager` is 
> closed when all the subtasks are finished. Join/Aggregate/Sort and etc. 
> operators in the subtasks allocate `MemorySegment` via `MemoryManager` and 
> these `MemorySegment` will be free when they are finished. 
> 
> It causes too much memory allocation and free of `MemorySegment` in 
> taskmanager. For example, a TaskManager contains 50 slots, one job has 3 
> join/agg operatos run in the slot, each operator will allocate 2000 segments 
> and initialize them. If the subtasks of a job take 100ms to execute, then the 
> taskmanager will execute 10 jobs' subtasks one second and it will allocate 
> and free 2000 * 3 * 50 * 10 = 300w segments for them. Allocate and free too 
> many segments from memory will cause two issues:
> 1) Increases the CPU usage of taskmanager
> 2) Increase the cost of subtasks in taskmanager, which will increase the 
> latency of job and decrease the qps.
>   To improve the usage of memory segment between jobs in the same slot, 
> we propose not drop memory manager when all the subtasks in the slot are 
> finished. The slot will hold the `MemoryManager` and not free the allocated 
> `MemorySegment` in it immediately. When some subtasks of another job are 
> assigned to the slot, they don't need to allocate segments from memory and 
> can reuse the `MemoryManager` and `MemorySegment` in it.  WDYT?  [~xtsong] THX



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (FLINK-25328) Improvement of share memory manager between jobs if they use the same slot in TaskManager for flink olap queries

2021-12-15 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459891#comment-17459891
 ] 

Xintong Song edited comment on FLINK-25328 at 12/15/21, 12:13 PM:
--

[~zjureel],

I get that frequently allocating & deallocating lots of memory segments is 
indeed a problem. However, I'm afraid the proposed solution won't work.

I can see two problems in the proposed solution:
- With fine grained resource management introduced in 1.14, a TaskManager no 
longer has fixed number of slots nor fixed size of each slot. That means the 
number of MemoryManagers in a TaskManager and sizes of them can also vary 
making it hard to reuse.
- Managed memory is used not only by operators (as segments), but also by the 
RocksDBStateBackend and the Python VM. The latter does not use memory segments, 
but will reserve memory budget from the MemoryManager and allocate the memory 
themselves. If segments are not released timely, there won't be enough budget 
for them to reserve.

I think to solve this problem, we would need some more careful design, which 
caches the segments only when reusing is possible, without affecting the cases 
otherwise.


was (Author: xintongsong):
[~zjureel],

I get that frequently allocating & deallocating lots of memory segments is 
indeed a problem. However, I'm afraid the proposed solution won't work.

I can see two problems in the proposed solution:
- With fine grained resource management introduced in 1.14, a TaskManager no 
longer has fixed number of slots nor fixed size of each slot. That means the 
number of MemoryManagers in a TaskManager and size of them can also varies, 
making it hard to reuse.
- Managed memory is used not only by operators (as segments), but also by the 
RocksDBStateBackend and the Python VM. The latter does not use memory segments, 
but will reserve memory budget from the MemoryManager and allocate the memory 
themselves. If segments are not released timely, there won't be enough budget 
for them to reserve.

I think to solve this problem, we would need some more careful design, which 
caches the segments only when reusing is possible, without affecting the cases 
otherwise.

> Improvement of share memory manager between jobs if they use the same slot in 
> TaskManager for flink olap queries
> 
>
> Key: FLINK-25328
> URL: https://issues.apache.org/jira/browse/FLINK-25328
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.14.0, 1.12.5, 1.13.3
>Reporter: Shammon
>Priority: Major
>
> We submit batch jobs to flink session cluster as olap queries, and these 
> jobs' subtasks in TaskManager are frequently created and destroyed because 
> they finish their work quickly. Each slot in taskmanager manages 
> `MemoryManager` for multiple tasks in one job, and the `MemoryManager` is 
> closed when all the subtasks are finished. Join/Aggregate/Sort and etc. 
> operators in the subtasks allocate `MemorySegment` via `MemoryManager` and 
> these `MemorySegment` will be free when they are finished. 
> 
> It causes too much memory allocation and free of `MemorySegment` in 
> taskmanager. For example, a TaskManager contains 50 slots, one job has 3 
> join/agg operatos run in the slot, each operator will allocate 2000 segments 
> and initialize them. If the subtasks of a job take 100ms to execute, then the 
> taskmanager will execute 10 jobs' subtasks one second and it will allocate 
> and free 2000 * 3 * 50 * 10 = 300w segments for them. Allocate and free too 
> many segments from memory will cause two issues:
> 1) Increases the CPU usage of taskmanager
> 2) Increase the cost of subtasks in taskmanager, which will increase the 
> latency of job and decrease the qps.
>   To improve the usage of memory segment between jobs in the same slot, 
> we propose not drop memory manager when all the subtasks in the slot are 
> finished. The slot will hold the `MemoryManager` and not free the allocated 
> `MemorySegment` in it immediately. When some subtasks of another job are 
> assigned to the slot, they don't need to allocate segments from memory and 
> can reuse the `MemoryManager` and `MemorySegment` in it.  WDYT?  [~xtsong] THX



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (FLINK-25328) Improvement of share memory manager between jobs if they use the same slot in TaskManager for flink olap queries

2021-12-15 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459891#comment-17459891
 ] 

Xintong Song edited comment on FLINK-25328 at 12/15/21, 12:12 PM:
--

[~zjureel],

I get that frequently allocating & deallocating lots of memory segments is 
indeed a problem. However, I'm afraid the proposed solution won't work.

I can see two problems in the proposed solution:
- With fine grained resource management introduced in 1.14, a TaskManager no 
longer has fixed number of slots nor fixed size of each slot. That means the 
number of MemoryManagers in a TaskManager and size of them can also varies, 
making it hard to reuse.
- Managed memory is used not only by operators (as segments), but also by the 
RocksDBStateBackend and the Python VM. The latter does not use memory segments, 
but will reserve memory budget from the MemoryManager and allocate the memory 
themselves. If segments are not released timely, there won't be enough budget 
for them to reserve.

I think to solve this problem, we would need some more careful design, which 
caches the segments only when reusing is possible, without affecting the cases 
otherwise.


was (Author: xintongsong):
[~zjureel],

I get that frequently allocating & deallocating lots of memory segments is 
indeed a problem. However, I'm afraid the proposed solution won't work.

I can see two problems in the proposed solution:
- With fine grained resource management introduced in 1.14, a TaskManager no 
longer has fixed number of slots nor fixed size of each slot. That means the 
number of MemoryManagers on in TaskManager and size of each can also varies, 
making it hard to reuse.
- Managed memory is used not only by operators (as segments), but also by the 
RocksDBStateBackend and the Python VM. The latter does not use memory segments, 
but will reserve memory budget from the MemoryManager and allocate the memory 
themselves. If segments are not released timely, there won't be enough budget 
for them to reserve.

I think to solve this problem, we would need some more careful design, which 
caches the segments only when reusing is possible, without affecting the cases 
otherwise.

> Improvement of share memory manager between jobs if they use the same slot in 
> TaskManager for flink olap queries
> 
>
> Key: FLINK-25328
> URL: https://issues.apache.org/jira/browse/FLINK-25328
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.14.0, 1.12.5, 1.13.3
>Reporter: Shammon
>Priority: Major
>
> We submit batch jobs to flink session cluster as olap queries, and these 
> jobs' subtasks in TaskManager are frequently created and destroyed because 
> they finish their work quickly. Each slot in taskmanager manages 
> `MemoryManager` for multiple tasks in one job, and the `MemoryManager` is 
> closed when all the subtasks are finished. Join/Aggregate/Sort and etc. 
> operators in the subtasks allocate `MemorySegment` via `MemoryManager` and 
> these `MemorySegment` will be free when they are finished. 
> 
> It causes too much memory allocation and free of `MemorySegment` in 
> taskmanager. For example, a TaskManager contains 50 slots, one job has 3 
> join/agg operatos run in the slot, each operator will allocate 2000 segments 
> and initialize them. If the subtasks of a job take 100ms to execute, then the 
> taskmanager will execute 10 jobs' subtasks one second and it will allocate 
> and free 2000 * 3 * 50 * 10 = 300w segments for them. Allocate and free too 
> many segments from memory will cause two issues:
> 1) Increases the CPU usage of taskmanager
> 2) Increase the cost of subtasks in taskmanager, which will increase the 
> latency of job and decrease the qps.
>   To improve the usage of memory segment between jobs in the same slot, 
> we propose not drop memory manager when all the subtasks in the slot are 
> finished. The slot will hold the `MemoryManager` and not free the allocated 
> `MemorySegment` in it immediately. When some subtasks of another job are 
> assigned to the slot, they don't need to allocate segments from memory and 
> can reuse the `MemoryManager` and `MemorySegment` in it.  WDYT?  [~xtsong] THX



--
This message was sent by Atlassian Jira
(v8.20.1#820001)