Hi all
I want to make a proposal to support QATCodec [1] into CarbonData. QAT Codec 
project provides compression and decompression library for Apache Hadoop/Spark 
to make use of the Intel(r) QuickAssist Technology (Abbrev. QAT) [2] for 
compression/decompression. This project has been open source this year as well 
as the underlying native dependencies - QATZip. And users can install the 
underlying native dependencies using linux package-management utility (e.g. Yum 
for Centos). This projects have two major benefits:
1) A wide ecosystem support
Now it supports Hadoop & Spark directly by implementing Hadoop & Spark 
de/compression API and also provides patches to integrate with Parquet and 
ORC-Hive.
2) High performance and space efficiency
We measured the performance and compression ratio of QATCodec in different 
workloads comparing against Snappy.
For the sort workload (input, intermediate data, output are all 
compression-enabled, 3TB data scale, 5 workers, 2 replica for data) with Map 
Reduce, using QATCodec brings 7.29% performance gain and 7.5% better 
compression ratio. For the sort workload (input and intermediate data are 
compression-enabled, 3TB data scale) with Spark, it brings 14.3% performance 
gain, 7.5% better compression ratio. Also we measured in Hive on MR with 
TPCx-BB workload [3] (3TB data scale), it brings 12.98% performance gain, 
13.65% better compression ratio.
Regards to the hardware requirement, current implementation supports 
falling-back mechanism to software implementation at the absent of QAT device.
Now Carbondata supports two compression codec: Zstd and Snappy. I think it will 
bring the benefit to the users to have an extra compression option with 
hardware acceleration.

Please feel free to share your comments on this proposal.


[1] https://github.com/intel-hadoop/QATCodec
[2] https://01.org/zh/intel-quickassist-technology
[3] http://www.tpc.org/tpcx-bb/default.asp

Best Regards
Ferdinand Xu

Reply via email to