morningman opened a new issue #2016: [Proposal] Limit the memory usage of 
Compaction
URL: https://github.com/apache/incubator-doris/issues/2016
 
 
   With the widespread use of the new load framework, existing compaction 
strategies are no longer work in some scenarios. This document focuses on the 
problems that the new load framework brings to the compaction logic and how to 
improve it.
   
   ## Problem
   
   In the new load framework, the load data forms a serial of `Memtables` in 
memory. When the size of a memtabl reaches the threshold (default is 100MB), it 
will be written to the disk to form a `Segment`. A batch of load is 
corresponding ti a `Version`. When a batch of loaded data is relatively large, 
or a row of a table is large, a batch of load may generate thousands of 
segments.
   
   In the compaction logic, at least one version is selected for one 
compaction. Compaction is an external sorting that will open a `RowBlock` for 
each segment, with 1024 rows per RowBlock. So a RowBlock occupies a memory size 
of (1024 * row size).
   
   Assuming that a Compaction has 1000 Segments and each row is 4K in size, 
RowBlock will take up 4G memory. When multiple Compactions are running at the 
same time, the system OOM may be caused.
   
   ## Solution
   
   This proposal is to ensure that Compaction can run stably with less memory 
by estimating and limiting the amount of memory used by Compaction. This work 
is divided into the following three steps.
   
   ### Compaction ratio statistic
   
   To estimate the amount of memory used by a Compaction, it is mainly to 
estimate the size of a row in memory. We can simply use the ratio of the size 
of a memtable in memory to the size of file it is writen on disk as the 
compaction ratio. With this ratio, the size of the data file on the disk, and 
the number of rows in file, we can calculate the approximate occupancy of a 
single row of data in memory.
   
   ### Supported compaction within a version
   
   Currently only Compaction with at least one version is supported. And if 
there are too many Segments in a single version, it still consumes a lot of 
memory. So we need to support compaction with a subset of segments with a 
single version. 
   
   ### Limiting Compaction memory usage
   
   With the previous two steps, it has been possible to estimate and limit the 
memory usage of a single Compaction. Finally, we need an overall limit to 
ensure that the memory overhead can be within a reasonable range when multiple 
Compactions are running at the same time.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to