Hi Jialin,

Great experiment! Thanks for your sharing.

Looking forward to the function of hot compaction.


Best,
-----------------------------------
Zesong Sun
School of Software, Tsinghua University

孙泽嵩
清华大学 软件学院

> 2020年7月8日 16:39,Jialin Qiao <qj...@mails.tsinghua.edu.cn> 写道:
> 
> Hi,
> 
> 
> I'd like to share with you some experiment results about how chunk size 
> impact the query performance. 
> 
> 
> Hardware: 
> MacBook Pro (Retina, 15-inch, Mid 2015)
> CPU: 2.2 GHz Intel Core i7
> Memory: 16 GB 1600 MHz DDR3
> I use a mobile HDD (SEAGATE, 1TB, Model SRD00F1)  as the storage.
> 
> 
> Workload: 1 storage group, 1 device, 100 measurements in long type. 1 million 
> data points generated randomly for each time series. 
> 
> 
> A background knowledge is the origin flushed chunk size = 
> memtable_size_threshold / series number / byte per data point (16 for long 
> data points)
> 
> 
> I adjust the memtable_size_threhold to control the chunk size.
> 
> 
> Configurations of IoTDB:
> 
> 
> enable_parameter_adapter=false
> avg_series_point_number_threshold=10000000 (to make the 
> memtable_size_threshold valid)
> page_size_in_byte=1000000000 (each chunk has one page)
> tsfile_size_threshold = memtable_size_threshold = 
> 160000/1600000/16000000/160000000/1600000000
> 
> 
> I use SessionExample.insertTablet to insert data under different 
> configurations. Then I got Chunk sizes from 100 to 1000000.
> 
> 
> Then I use SessionExample.queryByIterator to iterate the result set of 
> "select s1 from root.sg1.d1" without constructing other data structures.
> 
> 
> The results are:
> 
> 
> | chunk size | query time cost in ms |
> |   100          |     47620                     |
> |   1000        |     13984                     |
> |   10000      |     2416                       |
> |   100000    |     1322                       |
> 
> 
> As we could see the chunk size has a dominate impact to the raw data query 
> performance. In the current query engine, Chunk is the basic data unit to 
> read from the disk. For reading each Chunk, we need one seek + one IO 
> operation. A larger chunk size means less Chunks to read. 
> 
> 
> Therefore, it's better to enlarge the memtable_size_threshold for accelerate 
> queries. However, enlarging memtable_size_threshold means more memory is 
> needed. This is not always satisfied in some scenes. Therefore, we need 
> compaction, either hot compaction triggered in flushing or the timed 
> compaction strategy, to compact small chunks to a large one.
> 
> 
> Thanks,
> --
> Jialin Qiao
> School of Software, Tsinghua University
> 
> 乔嘉林
> 清华大学 软件学院

Reply via email to