Re: ORC double encoding optimization proposal

Gopal Vijayaraghavan Mon, 26 Mar 2018 12:33:21 -0700

> the bad thing is that we still have TWO encodings to discuss. 

Two is exactly what we need, not five - from the existing ORC configs


hive.exec.orc.encoding.strategy=[SPEED, COMPRESSION];

FLIP8 was my original suggestion to Teddy from the byteuniq UDF runs, though 
the regressions in compression over the PlainV2 is still bothering me (which is 
why I went digging into the Zlib dictionary builder impl with infgen).

All comparisons below are for Size & against PlainV2

For Zlib, this is pretty bad for FLIP.

ZLIB:HIGGS Regressing on FLIP by 6 points
ZLIB:DISCOUNT_AMT Regressing on FLIP by 10 points
ZLIB:IOT_METER Regressing on FLIP by 32 points
ZLIB:LIST_PRICE Regressing on FLIP by 36 points
ZLIB:PHONE Regressing on FLIP by 50 points

SPLIT has no size regressions.

With ZSTD SPLIT has a couple of regressions in size

ZSTD:DISCOUNT_AMT Regressing on FLIP by 5 points
ZSTD:IOT_METER Regressing on FLIP by 17 points
ZSTD:HIGGS Regressing on FLIP by 18 points
ZSTD:LIST_PRICE Regressing on FLIP by 30 points
ZSTD:PHONE Regressing on FLIP by 55 points

ZSTD:HIGGS Regressing on SPLIT by 10 points
ZSTD:PHONE Regressing on SPLIT by 3 points

but FLIP still has more size regressions & big ones there.

I'm continuing to mess with both algorithms, but I have wider problems to fix 
in FLIP & at a lower algorithm level than in SPLIT.

Cheers,
Gopal

Re: ORC double encoding optimization proposal

Reply via email to