Tian Jiang created IOTDB-2189: --------------------------------- Summary: Shared chunks to reduce I/Os with massive timeseries Key: IOTDB-2189 URL: https://issues.apache.org/jira/browse/IOTDB-2189 Project: Apache IoTDB Issue Type: Improvement Components: Core/Engine, Core/TsFile Reporter: Tian Jiang Attachments: image-2021-12-22-12-03-03-966.png
When the number of timeseries explodes, the average memory for each series is very limited. For example, when there are 10 million timeseries, storing 100 points for each series results in 1billion points in memory. If each point has an 8-byte timestamp and an 8-byte value, the memory footprint will be 16GB. In this case, each timeseries will generate a chunk of only 100 points, which has the size of less than 1KB (after encoded) when flushed to the disk. As a chunk is the I/O unit during queries, the extremely small chunk size will significantly reduce I/O performance. Moreover, as the number of points is small, some encoding algorithm may not work very well. Compation may solve the problems to some extent, but compaction itself also suffers from small chunks. We notice that timeseries is generally queried together. For example, device queries read all timeseries of one or more devices and compactions also read timeseries in a batched manner. So, if we encapsulate more than one timeseries in a chunk, the chunk size can be much larger and the I/O efficiency is greatly improved. Moreover, the enlarged chunk size may also improve compression ratio. !image-2021-12-22-12-03-03-966.png! The figure above shows the alternation. When 3 timeseries are put into the same chunk, one single I/O of timeseries0 can fetch all of them. As the chunk is cached, the other two timeseries can use the chunk so additional I/O is avoided. The disadvantage is also obvious. If only some timeseries in a chunk is not queried, the bandwidth may be wasted. So the point is to choose wisely what timeseries should be grouped together while others not. One alternative is to simply group timeseries of the same device, provided whole-device queries are very common. A more sophisticate method could be based on statistics or even machine learning. The method can also be dynamic, as it only affects the newly generated chunks. -- This message was sent by Atlassian Jira (v8.20.1#820001)