Peng Wang created ORC-1356:
------------------------------

             Summary: Use Intel AVX-512 instructions to accelerate the 
Rle-bit-packing decode
                 Key: ORC-1356
                 URL: https://issues.apache.org/jira/browse/ORC-1356
             Project: ORC
          Issue Type: Improvement
          Components: C++, ORCv2, RLE
    Affects Versions: master
            Reporter: Peng Wang


In the original ORC Rle-bit-packing, it decodes value one by one, and Intel 
AVX-512 brings the capabilities of 512-bit vector operations to accelerate the 
Rle-bit-packing decode process. We only need execute much less CPU instructions 
to unpacking more data than usual. So the performance of AVX-512 vector decode 
is much better than before. In the funcational micro-performance test I suppose 
AVX-512 vector decode could bring average 6X ~ 7X performance latency 
improvement compare vector function unrolledUnpackVectorN with the original 
Rle-bit-packing decode function plainUnpackLongs. In the real world, user will 
store large data with ORC data format, and need to decoding hundreds or 
thousands of bytes, AVX-512 vector decode will be more efficient and help to 
improve this processing.

 

In the real world, the data size in ORC will be less than 32bit as usual. So I 
supplied the vector code transform about the data value size less than 32bits 
in this PR. To the data value is 8bit, 16bit or other 8x bit size, the 
performance improvement will be relatively small compared with other not 8x bit 
size value.

 

Intel AVX512 instructions official link:

https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html

 

1. Added cmake option named "ENABLE_AVX512_BIT_PACKING", to switch this feature 
enable or not in the building process.

The default value of ENABLE_AVX512_BIT_PACKING is OFF.

For example, cmake .. -DCMAKE_CXX_FLAGS="-mavx512vbmi -march=native" 
-DCMAKE_BUILD_TYPE=debug -DBUILD_JAVA=OFF -DENABLE_AVX512_BIT_PACKING=ON 
-DSNAPPY_HOME=/usr/local

2. Added macro "ENABLE_AVX512" to enable this feature code build or not in ORC.

3. Added the function "detect_platform" to dynamicly detect the current 
platform supports AVX-512 or not. When customers build ORC with AVX-512 enable, 
and the current platform ORC running on doesn't support AVX-512, it will use 
the original bit-packing decode function instead of AVX-512 vector decode.

4. Added the functions "unrolledUnpackVectorN" to support N-bit value decode 
instead of the original function plainUnpackLongs or unrolledUnpackN

5. Added the testcases "RleV2_basic_vector_decode_Nbit" to verify N-bit value 
AVX-512 vector decode in the new testcase file TestRleVectorDecoder.cc.

6. Modified the function plainUnpackLongs, added an output parameter uint64_t& 
startBit. This parameter used to store the left bit number after unpacking.

7. AVX-512 vector decode process 512 bits data in every data unpacking. So if 
the current unpacking data length is long enough, almost all of the data can be 
processed by AVX-512. But if the data length (or block size) is too short, less 
than 512 bits, it will not use AVX-512 to do unpacking work. It will back to 
the original decode way to do unpacking one by one.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to