Hi Jacky and Ravindra, we have tested ZSTD vs snappy again with the latest code in 3 node spark 2.3 cluster on HDFS with TPCH 500 GB data. Below is the summary
*1. ZSTD store is 28.8% smaller compared to snappy* *2. Overall query time is degraded by 18.35% in ZSTD compared to snappy* *3. Load time in ZSTD has negligible degradation of 0.7 % compared to snappy* Based on this, I guess we cannot use ZSTD as default due to huge degradation in query time. Thanks, Ajantha On Fri, Feb 7, 2020 at 4:54 PM Ravindra Pesala <ravi.pes...@gmail.com> wrote: > Hi Jacky, > > As per the original PR > https://github.com/apache/carbondata/pull/2628 , query performance got > decreased by 20% ~ 50% compared to snappy. So I am concerned about the > performance. Please better have a proper tpch performance report on the > regular cluster like we do for every version and decide based on that. > > Regards, > Ravindra. > > On Fri, 7 Feb 2020 at 10:40 AM, Jacky Li <jacky.li...@qq.com> wrote: > > > Hi Ajantha, > > > > > > Yes, decoder will use the compressorName stored in ChunkCompressionMeta > > from the file header, > > but I think it is better to put it in the name so that user can know the > > compressor in the shell without reading it by launching engine. > > > > > > In spark, for parquet/orc the file name written > > is: part-00115-e2758995-4b10-4bd2-bf15-b4c176e587fe-c000.snappy.orc > > > > > > In PR3606, I will handle the compatibility. > > > > > > Regards, > > Jacky > > > > > > ------------------ 原始邮件 ------------------ > > 发件人: "Ajantha Bhat"<ajanthab...@gmail.com>; > > 发送时间: 2020年2月6日(星期四) 晚上11:51 > > 收件人: "dev"<dev@carbondata.apache.org>; > > > > 主题: Re: Discussion: change default compressor to ZSTD > > > > > > > > Hi, > > > > 33% is huge a reduction in store size. If there is negligible difference > in > > load and query time, we should definitely go for it. > > > > And does user really need to know about what compression is used ? change > > in file name may be need to handle compatibility. > > Already thrift *FileHeader, ChunkCompressionMeta* is storing the > compressor > > name. query time decoding can be based on this. > > > > Thanks, > > Ajantha > > > > > > On Thu, Feb 6, 2020 at 4:27 PM Jacky Li <jacky.li...@qq.com> wrote: > > > > > Hi, > > > > > > > > > I compared snappy and zstd compressor using TPCH for carbondata. > > > > > > > > > For TPCH lineitem table: > > > carbon-zstdcarbon-snappy > > > loading (s)5351 > > > size795MB1.2GB > > > > > > TPCH-query: > > > Q14.2898.29 > > > Q212.60912.986 > > > Q314.90214.458 > > > Q46.2765.954 > > > Q523.14721.946 > > > Q61.120.945 > > > Q723.01728.007 > > > Q814.55415.077 > > > Q928.47227.473 > > > Q1024.06724.682 > > > Q113.3213.79 > > > Q125.3115.185 > > > Q1314.0811.84 > > > Q142.2622.087 > > > Q155.4964.772 > > > Q1629.91929.833 > > > Q177.0187.057 > > > Q1817.36717.795 > > > Q192.9312.865 > > > Q2011.34710.937 > > > Q2126.41628.414 > > > Q225.9236.311 > > > sum283.844290.704 > > > > > > > > > As you can see, after using zstd, table size is 33% reduced > comparing > > to > > > snappy. And the data loading and query time difference is > negligible. > > So I > > > suggest to change the default compressor in carbondata from snappy > to > > zstd. > > > > > > > > > To change the default compressor, we need to: > > > 1. append the compressor name in the carbondata file name. So that > > from > > > the file name user can know what compressor is used. > > > For example, file name will be changed from > > > &nbsp;part-0-0_batchno0-0-0-1580982686749.carbondata > > > > > > to&nbsp;&nbsp;part-0-0_batchno0-0-0-1580982686749.snappy.carbondata > > > > > or&nbsp;&nbsp;part-0-0_batchno0-0-0-1580982686749.zstd.carbondata > > > > > > > > > 2. Change the compressor constant in CarbonCommonConstaint.java file > > to > > > use zstd as default compressor > > > > > > > > > What do you think? > > > > > > > > > Regards, > > > Jacky > > -- > Thanks & Regards, > Ravi >