Re: Discussion: change default compressor to ZSTD

Jacky Li Thu, 20 Feb 2020 23:59:14 -0800

Ok, thanks for the test.
Then for PR3606, I will only add the compressor name to the file name but not 
changing the default compressor to ZSTD.


Regards,
Jacky

> 2020年2月20日 下午12:52，Ajantha Bhat <ajanthab...@gmail.com> 写道：
> 
> Hi Jacky and Ravindra,
> 
> we have tested ZSTD vs snappy again with the latest code in 3 node spark
> 2.3 cluster on HDFS with TPCH 500 GB data.
> Below is the summary
> 
> *1.  ZSTD store is 28.8% smaller compared to snappy*
> *2.  Overall query time is degraded by 18.35% in ZSTD compared to snappy*
> *3.  Load time in ZSTD has negligible degradation of 0.7 % compared to
> snappy*
> 
> Based on this, I guess we cannot use ZSTD as default due to huge
> degradation in query time.
> 
> Thanks,
> Ajantha
> 
> 
> 
> 
> On Fri, Feb 7, 2020 at 4:54 PM Ravindra Pesala <ravi.pes...@gmail.com>
> wrote:
> 
>> Hi Jacky,
>> 
>> As per the original PR
>> https://github.com/apache/carbondata/pull/2628 , query performance got
>> decreased by 20% ~ 50% compared to snappy.  So I am concerned about the
>> performance. Please better have a proper tpch performance report on the
>> regular cluster like we do for every version and decide based on that.
>> 
>> Regards,
>> Ravindra.
>> 
>> On Fri, 7 Feb 2020 at 10:40 AM, Jacky Li <jacky.li...@qq.com> wrote:
>> 
>>> Hi Ajantha,
>>> 
>>> 
>>> Yes, decoder will use the compressorName stored in ChunkCompressionMeta
>>> from the file header,
>>> but I think it is better to put it in the name so that user can know the
>>> compressor in the shell without reading it by launching engine.
>>> 
>>> 
>>> In spark, for parquet/orc the file name written
>>> is:&nbsp;part-00115-e2758995-4b10-4bd2-bf15-b4c176e587fe-c000.snappy.orc
>>> 
>>> 
>>> In PR3606, I will handle the compatibility.
>>> 
>>> 
>>> Regards,
>>> Jacky
>>> 
>>> 
>>> ------------------&nbsp;原始邮件&nbsp;------------------
>>> 发件人:&nbsp;"Ajantha Bhat"<ajanthab...@gmail.com&gt;;
>>> 发送时间:&nbsp;2020年2月6日(星期四) 晚上11:51
>>> 收件人:&nbsp;"dev"<dev@carbondata.apache.org&gt;;
>>> 
>>> 主题:&nbsp;Re: Discussion: change default compressor to ZSTD
>>> 
>>> 
>>> 
>>> Hi,
>>> 
>>> 33% is huge a reduction in store size. If there is negligible difference
>> in
>>> load and query time, we should definitely go for it.
>>> 
>>> And does user really need to know about what compression is used ? change
>>> in file name may be need to handle compatibility.
>>> Already thrift *FileHeader, ChunkCompressionMeta* is storing the
>> compressor
>>> name. query time decoding can be based on this.
>>> 
>>> Thanks,
>>> Ajantha
>>> 
>>> 
>>> On Thu, Feb 6, 2020 at 4:27 PM Jacky Li <jacky.li...@qq.com&gt; wrote:
>>> 
>>> &gt; Hi,
>>> &gt;
>>> &gt;
>>> &gt; I compared snappy and zstd compressor using TPCH for carbondata.
>>> &gt;
>>> &gt;
>>> &gt; For TPCH lineitem table:
>>> &gt; carbon-zstdcarbon-snappy
>>> &gt; loading (s)5351
>>> &gt; size795MB1.2GB
>>> &gt;
>>> &gt; TPCH-query:
>>> &gt; Q14.2898.29
>>> &gt; Q212.60912.986
>>> &gt; Q314.90214.458
>>> &gt; Q46.2765.954
>>> &gt; Q523.14721.946
>>> &gt; Q61.120.945
>>> &gt; Q723.01728.007
>>> &gt; Q814.55415.077
>>> &gt; Q928.47227.473
>>> &gt; Q1024.06724.682
>>> &gt; Q113.3213.79
>>> &gt; Q125.3115.185
>>> &gt; Q1314.0811.84
>>> &gt; Q142.2622.087
>>> &gt; Q155.4964.772
>>> &gt; Q1629.91929.833
>>> &gt; Q177.0187.057
>>> &gt; Q1817.36717.795
>>> &gt; Q192.9312.865
>>> &gt; Q2011.34710.937
>>> &gt; Q2126.41628.414
>>> &gt; Q225.9236.311
>>> &gt; sum283.844290.704
>>> &gt;
>>> &gt;
>>> &gt; As you can see, after using zstd, table size is 33% reduced
>> comparing
>>> to
>>> &gt; snappy. And the data loading and query time difference is
>> negligible.
>>> So I
>>> &gt; suggest to change the default compressor in carbondata from snappy
>> to
>>> zstd.
>>> &gt;
>>> &gt;
>>> &gt; To change the default compressor, we need to:
>>> &gt; 1. append the compressor name in the carbondata file name. So that
>>> from
>>> &gt; the file name user can know what compressor is used.
>>> &gt; For example, file name will be changed from
>>> &gt; &amp;nbsp;part-0-0_batchno0-0-0-1580982686749.carbondata
>>> &gt;
>>> 
>> to&amp;nbsp;&amp;nbsp;part-0-0_batchno0-0-0-1580982686749.snappy.carbondata
>>> &gt;
>>> or&amp;nbsp;&amp;nbsp;part-0-0_batchno0-0-0-1580982686749.zstd.carbondata
>>> &gt;
>>> &gt;
>>> &gt; 2. Change the compressor constant in CarbonCommonConstaint.java file
>>> to
>>> &gt; use zstd as default compressor
>>> &gt;
>>> &gt;
>>> &gt; What do you think?
>>> &gt;
>>> &gt;
>>> &gt; Regards,
>>> &gt; Jacky
>> 
>> --
>> Thanks & Regards,
>> Ravi
>>

Re: Discussion: change default compressor to ZSTD

Reply via email to