jt2594838 commented on issue #211:
URL: https://github.com/apache/tsfile/issues/211#issuecomment-2295531166

   When not calling flush manually, the memory usage is mainly controlled by 
`chunk_group_size_threshold_,` which is 128MB by default. It is pretty close to 
the result of 144MB above. Considering that other things also occupy memory 
(like metadata or the program itself), `chunk_group_size_threshold_` seems to 
work fine enough.
   
   When flushing after inserting each tablet, the memory consists mainly of 
metadata. We can do a simple calculation: in the last experiment, each tablet 
only had 10 rows, so a ChunkMetadata was generated for every 10 points of a 
time series. A ChunkMetadata is typically around 50-80 bytes. Therefore, each 
data point will consume 5-8 bytes on average, even after being flushed.
   
   `chunk_group_size_threshold_`  sounds promising, but we may not rely on it. 
The reason is that the memory check is performed after each insertion. That is, 
if a Tablet is larger than `chunk_group_size_threshold_,` this parameter will 
not take effect. Anyway, the size of a tablet should be carefully controlled to 
hold the memory below a strict threshold, e.g., 10MB. If we set 
`chunk_group_size_threshold_`  to 10MB, and there are two tablets, 9MB each, 
then there will be 80% more memory used than we expected. As a result, we 
should still call flush after each insertion to perform strict memory control.
   
   Additionally, we were using the wrong way to estimate the proper size of a 
Tablet. We previously used the number of time series (2500) and the point size 
(8+4 bytes) to divide the memory budget (10MB), but the 2500 timeseries come 
from 50 devices, and we flushed once for each device. The result is that the 
row count of a Tablet should have is significantly underestimated. 
   
   In conclusion, in the next experiments: 1. the flush-after-each-tablet 
policy should be continued; 2. row count in each Tablet should be recalculated, 
much higher than the current value; 3. after 2. is done, metadata still will 
accumulate in memory, but at a much slower speed; if it still has a major 
impact on memory, we should either switch to the next file after a certain 
number of flushes or implement the DiskTSMIterator which is provided in the 
Java Edition.
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to