[Experiment sharing] How chunk size(number of points) impact the query performance

Jialin Qiao Wed, 08 Jul 2020 01:39:52 -0700

Hi,


I'd like to share with you some experiment results about how chunk size impact 
the query performance. 


Hardware: 
MacBook Pro (Retina, 15-inch, Mid 2015)
CPU: 2.2 GHz Intel Core i7
Memory: 16 GB 1600 MHz DDR3
I use a mobile HDD (SEAGATE, 1TB, Model SRD00F1)  as the storage.


Workload: 1 storage group, 1 device, 100 measurements in long type. 1 million 
data points generated randomly for each time series. 


A background knowledge is the origin flushed chunk size = 
memtable_size_threshold / series number / byte per data point (16 for long data 
points)


I adjust the memtable_size_threhold to control the chunk size.


Configurations of IoTDB:


enable_parameter_adapter=false
avg_series_point_number_threshold=10000000 (to make the memtable_size_threshold 
valid)
page_size_in_byte=1000000000 (each chunk has one page)
tsfile_size_threshold = memtable_size_threshold = 
160000/1600000/16000000/160000000/1600000000


I use SessionExample.insertTablet to insert data under different 
configurations. Then I got Chunk sizes from 100 to 1000000.


Then I use SessionExample.queryByIterator to iterate the result set of "select 
s1 from root.sg1.d1" without constructing other data structures.


The results are:


| chunk size | query time cost in ms |
|   100          |     47620                     |
|   1000        |     13984                     |
|   10000      |     2416                       |
|   100000    |     1322                       |


As we could see the chunk size has a dominate impact to the raw data query 
performance. In the current query engine, Chunk is the basic data unit to read 
from the disk. For reading each Chunk, we need one seek + one IO operation. A 
larger chunk size means less Chunks to read. 


Therefore, it's better to enlarge the memtable_size_threshold for accelerate 
queries. However, enlarging memtable_size_threshold means more memory is 
needed. This is not always satisfied in some scenes. Therefore, we need 
compaction, either hot compaction triggered in flushing or the timed compaction 
strategy, to compact small chunks to a large one.


Thanks,
--
Jialin Qiao
School of Software, Tsinghua University

乔嘉林
清华大学 软件学院

[Experiment sharing] How chunk size(number of points) impact the query performance

Reply via email to