?????? Discussion: change default compressor to ZSTD

Jacky Li Thu, 06 Feb 2020 18:41:25 -0800

Hi Ajantha,


Yes, decoder will use the compressorName stored in ChunkCompressionMeta from 
the file header,
but I think it is better to put it in the name so that user can know the 
compressor in the shell without reading it by launching engine.


In spark, for parquet/orc the file name written 
is:&nbsp;part-00115-e2758995-4b10-4bd2-bf15-b4c176e587fe-c000.snappy.orc


In PR3606, I will handle the compatibility.


Regards,
Jacky


------------------&nbsp;????????&nbsp;------------------
??????:&nbsp;"Ajantha Bhat"<ajanthab...@gmail.com&gt;;
????????:&nbsp;2020??2??6??(??????) ????11:51
??????:&nbsp;"dev"<dev@carbondata.apache.org&gt;;

????:&nbsp;Re: Discussion: change default compressor to ZSTD



Hi,

33% is huge a reduction in store size. If there is negligible difference in
load and query time, we should definitely go for it.

And does user really need to know about what compression is used ? change
in file name may be need to handle compatibility.
Already thrift *FileHeader, ChunkCompressionMeta* is storing the compressor
name. query time decoding can be based on this.

Thanks,
Ajantha


On Thu, Feb 6, 2020 at 4:27 PM Jacky Li <jacky.li...@qq.com&gt; wrote:

&gt; Hi,
&gt;
&gt;
&gt; I compared snappy and zstd compressor using TPCH for carbondata.
&gt;
&gt;
&gt; For TPCH lineitem table:
&gt; carbon-zstdcarbon-snappy
&gt; loading (s)5351
&gt; size795MB1.2GB
&gt;
&gt; TPCH-query:
&gt; Q14.2898.29
&gt; Q212.60912.986
&gt; Q314.90214.458
&gt; Q46.2765.954
&gt; Q523.14721.946
&gt; Q61.120.945
&gt; Q723.01728.007
&gt; Q814.55415.077
&gt; Q928.47227.473
&gt; Q1024.06724.682
&gt; Q113.3213.79
&gt; Q125.3115.185
&gt; Q1314.0811.84
&gt; Q142.2622.087
&gt; Q155.4964.772
&gt; Q1629.91929.833
&gt; Q177.0187.057
&gt; Q1817.36717.795
&gt; Q192.9312.865
&gt; Q2011.34710.937
&gt; Q2126.41628.414
&gt; Q225.9236.311
&gt; sum283.844290.704
&gt;
&gt;
&gt; As you can see, after using zstd, table size is 33% reduced comparing to
&gt; snappy. And the data loading and query time difference is negligible. So I
&gt; suggest to change the default compressor in carbondata from snappy to zstd.
&gt;
&gt;
&gt; To change the default compressor, we need to:
&gt; 1. append the compressor name in the carbondata file name. So that from
&gt; the file name user can know what compressor is used.
&gt; For example, file name will be changed from
&gt; &amp;nbsp;part-0-0_batchno0-0-0-1580982686749.carbondata
&gt; to&amp;nbsp;&amp;nbsp;part-0-0_batchno0-0-0-1580982686749.snappy.carbondata
&gt; or&amp;nbsp;&amp;nbsp;part-0-0_batchno0-0-0-1580982686749.zstd.carbondata
&gt;
&gt;
&gt; 2. Change the compressor constant in CarbonCommonConstaint.java file to
&gt; use zstd as default compressor
&gt;
&gt;
&gt; What do you think?
&gt;
&gt;
&gt; Regards,
&gt; Jacky

?????? Discussion: change default compressor to ZSTD

Reply via email to