Hi,

I compared snappy and zstd compressor using TPCH for carbondata.


For TPCH lineitem table:
carbon-zstdcarbon-snappy
loading (s)5351
size795MB1.2GB

TPCH-query:
Q14.2898.29
Q212.60912.986
Q314.90214.458
Q46.2765.954
Q523.14721.946
Q61.120.945
Q723.01728.007
Q814.55415.077
Q928.47227.473
Q1024.06724.682
Q113.3213.79
Q125.3115.185
Q1314.0811.84
Q142.2622.087
Q155.4964.772
Q1629.91929.833
Q177.0187.057
Q1817.36717.795
Q192.9312.865
Q2011.34710.937
Q2126.41628.414
Q225.9236.311
sum283.844290.704


As you can see, after using zstd, table size is 33% reduced comparing to 
snappy. And the data loading and query time difference is negligible. So I 
suggest to change the default compressor in carbondata from snappy to zstd.


To change the default compressor, we need to:
1. append the compressor name in the carbondata file name. So that from the 
file name user can know what compressor is used.
For example, file name will be changed from
 part-0-0_batchno0-0-0-1580982686749.carbondata 
to  part-0-0_batchno0-0-0-1580982686749.snappy.carbondata 
or  part-0-0_batchno0-0-0-1580982686749.zstd.carbondata


2. Change the compressor constant in CarbonCommonConstaint.java file to use 
zstd as default compressor


What do you think?


Regards,
Jacky

Reply via email to