Thanks Gopal and Xiening for your input.
The major purpose for this discussion is to provide a "best" encoding choice on 
behalf of the end point users. The "best" means decent comparison ratio with 
high throughput as measured in my initial benchmark. And the good thing here is 
that we're more focusing on SPLIT and FLIP instead of original FIVE encodings 
while the bad thing is that we still have TWO encodings to discuss. As 
mentioned by Gopal, Split is better than Flip in two main aspects: 1) better 
compression ratio 2) better memory bandwidth for ZLIB. 

(1) higher compression ratio
> There's more disk savings that can come out of FLIP.
Agreed as shown in my original result but at the cost of throughput. 
Additionally, Split provides compression functionality coming from RLE. As 
shown in my previous result, enabling compression on top of it will bring a 
little extra compression ratio at a not low cost of throughput.

(2) better memory bandwidth
> SPLIT is very memory bandwidth friendly and is probably the best format to 
> cache in-memory, because it doesn't explode in size when Zlib decompressed 
> into a buffer.
Does this also benefit other compression codec (e.g. ZSTD, LZO)? For the ZLIB 
data, it may bring some benefits as shown in my original result but not for all 
other compression codec.

Considering this, we can provide two options for double encoding configurations 
1) throughput friendly encoding - FLIP 2) compression friendly - SPLIT
Any thoughts on this?

Thanks
Ferdinand Xu

-----Original Message-----
From: Xiening Dai [mailto:[email protected]] 
Sent: Monday, March 26, 2018 11:07 AM
To: Gopal Vijayaraghavan <[email protected]>; [email protected]; 
[email protected]
Subject: Re: ORC double encoding optimization proposal

Interesting discussion. Thanks Gopal and Cheng.

One of the drawback I can see with Split is the fragmented IO pattern. Since 
Split creates two separated streams, reading one data batch will need an 
additional seek in order to reconstruct the column data. This creates extra 
burden for clusters with very busy IOs.

________________________________
From: Gopal Vijayaraghavan <[email protected]>
Sent: Monday, March 19, 2018 3:45 PM
To: [email protected]; [email protected]
Subject: Re: ORC double encoding optimization proposal

> existing work [1] from Teddy Choi and Owen O'Malley with some new compression 
> codec (e.g. ZSTD and Brotli), we proposed to prompt FLIP as the default 
> encoding for ORC double type to move this feature forwards.

Since we're discussing these, I'm going to summarize my existing notes on this, 
before you conclude.

FLIP & SPLIT are the two best algorithms from different ends of the spectrum & 
they have their own strengths.

FLIP was designed with Intel C++ code in mind, where the Java implementation is 
somewhat slower today, the C++ impl should be very fast.

In an ideal world, the entire FLIP should unroll into a single instruction - 
PSHUFB (AVX512 will be able to unpack 8x8x8 matrix, this is common in many 
platforms due to the similarity to RGBA data transforms).

At some point, we'll be able to rewrite it using Long8 native types (assuming 
JIT can understand a shuffle op).

http://hg.openjdk.java.net/panama/panama/jdk/file/tip/src/java.base/share/classes/java/lang/Long8.java#l16

Here's the tools to run through your own data to determine if FLIP will work 
for you (the byteuniq UDF).

https://github.com/t3rmin4t0r/hive-dq-tools

I haven't run HEPMASS through that script, but you can see the bit level has 
even neater entropy skews than the whole byte, but the byte packing will offer 
enough dictionary items.

https://github.com/t3rmin4t0r/zlib-zoo

shows how the LZ77 in Zlib picks the matches mostly detecting the 7 byte 
patterns instead of detecting the 8 byte patterns which definitely are common 
enough (we can have much tighter symbol detection in LZ77, though I'm more 
interested in poking about the Zstd search depth now).

There's more disk savings that can come out of FLIP.

> It's compression friendly unlike Split and FPC.

SPLIT is very memory bandwidth friendly and is probably the best format to 
cache in-memory, because it doesn't explode in size when Zlib decompressed into 
a buffer.

SPLIT+LLAP cache is likely to be faster than FLIP+LLAP cache, purely from the 
memory bandwidth needs of the loops & the cache overflow rate of FLIP (hitting 
the cache saves the entire Zlib CPU, which is about ~30%).

The core perf issue with the SPLIT algorithm is that it doesn't decompose 
neatly at the bit-level in the java memory model - the current loop has a lot 
of room for improvement.

Basically, right now there are at least 3 branches for the SPLIT and 1 for the 
FLIP - nextNumber() is basically assembling 1 at a time, instead of 8 at a time.

Purely speaking from a decode loop perspective, we have a lot of performance 
improvements to be made in the SPLIT algorithm which are pending - that should 
ideally come as part of RLEv3 & indirectly make the SPLIT reader faster.

With a 8x unrolled impl SPLIT is going to catch up in the total decode rate & I 
started ORC-187 after digging into some of those branching loops.

For the other two next() calls, this is the equivalent unrolling that was done 
by Prasanth with Integers.

https://www.slideshare.net/t3rmin4t0r/orc-2015/23

The Long -> Double has to similarly do more register work instead of calling 
nextNumber() one at a time. And I like SPLIT because it is very natural in its 
implementation & therefore easier to parallelize than FPC.

The current numbers aren't the final numbers by a long shot, but FLIP and SPLIT 
are the ones where I feel like more work is useful.

Cheers,
Gopal


Reply via email to