Hi all, Here I am to explain the modification of 'Support Zstd as Column Compressor'(PR2628). Please give your feedback if you have problems.
# BACKGROUND Zstd is compressor that have higher ratio than Snappy while has similar compression/decompression speed (litte worse than snappy). This compressor has been used in other products in our company and is regarded as a replacement for snappy with higher compression ratio and acceptable decreasing in decompression. So we want to introduce Zstd compressor to compressor the column values in final carbondata file. (The last sentence is meant to distinguish it from the compressor for sort temp files.) # DESIGN&MODIFICATIONS 1. The metadata of the compressor for a column is stored in DataChunk3. CarbonData defined the compressor in thrift. Previously it only supported Snappy, so I 1.1 add Zstd in the thrift. 1.2 add ZstdCompressor and update the CompressorFactory 2. For data loading, before the loading starts, Carbondata will get the compressor from system property file and pass the compressor info to the next procedures, so that all the pages in all the blocklets in this load will use the same compressor. This will avoid the problem if we changed the property in concurrent mode. For this modification, we will 2.1 add the compressor info in CarbonLoadModel and CarbonFactDataHandlerModel. 2.2 add the compressor as a member for ColumnPage 2.3 add the compressor as an input parameter when creating a ColumnPage 3. For data querying, Carbondata will get the compressor info from DataChunk3 in the chunk. Then it will use that compressor to decompress the content. This means that we will 3.1 get the compressor from the dimension/measure chunk during reading 4. For others that use compressor, such as compress the configuration, we will use snappy just like before. This means we will 4.1 explicitly specify the snappy as the compressor for it 5. For legacy store, it use snappy, so we just 5.1 specify snappy as the compressor while reading the legacy store. 6. For streaming segment, it also compress the (streaming) blocklets. Because files in streaming segment did not store the compressor info before, so we 6.1 add the compressor in the FileHeader in thrift file 6.2 During loading for streaming segment, if the stream file already exists, we will read the compressor info from the FileHeader of the file and reuse that compressor. 6.3 If the stream file does not exist, we will read the compressor info from system property and set it to the FileHeader. 6.4 For streaming legacy store, it does not have compressor in the FileHeader, in this case, we will use snappy to write&read the following streaming blocklets. 7. For compaction and handoff, since it reuse the read procedure, so no extra modification has been made for this. And we still 7.1 add test case for it. Please refer to the 'TestLoadDataWithCompression.scala'. 8. For extension for other compressors, it's simple to add a new one. Take LZ4 for example, the following changes are required: 8.1 Add LZ4 in thrift 8.2 Add Lz4Compressor 8.3 Add Lz4Compressor to the compressor factory -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
