> the bad thing is that we still have TWO encodings to discuss. Two is exactly what we need, not five - from the existing ORC configs
hive.exec.orc.encoding.strategy=[SPEED, COMPRESSION]; FLIP8 was my original suggestion to Teddy from the byteuniq UDF runs, though the regressions in compression over the PlainV2 is still bothering me (which is why I went digging into the Zlib dictionary builder impl with infgen). All comparisons below are for Size & against PlainV2 For Zlib, this is pretty bad for FLIP. ZLIB:HIGGS Regressing on FLIP by 6 points ZLIB:DISCOUNT_AMT Regressing on FLIP by 10 points ZLIB:IOT_METER Regressing on FLIP by 32 points ZLIB:LIST_PRICE Regressing on FLIP by 36 points ZLIB:PHONE Regressing on FLIP by 50 points SPLIT has no size regressions. With ZSTD SPLIT has a couple of regressions in size ZSTD:DISCOUNT_AMT Regressing on FLIP by 5 points ZSTD:IOT_METER Regressing on FLIP by 17 points ZSTD:HIGGS Regressing on FLIP by 18 points ZSTD:LIST_PRICE Regressing on FLIP by 30 points ZSTD:PHONE Regressing on FLIP by 55 points ZSTD:HIGGS Regressing on SPLIT by 10 points ZSTD:PHONE Regressing on SPLIT by 3 points but FLIP still has more size regressions & big ones there. I'm continuing to mess with both algorithms, but I have wider problems to fix in FLIP & at a lower algorithm level than in SPLIT. Cheers, Gopal
