Re: Column Compression and Encoding

Todd Lipcon Tue, 08 May 2018 09:10:30 -0700

Hi Saeid,

We've tried to make the default compression/encoding a reasonable tradeoff
of performance for most common workloads. A couple quick tips I've found
from my experiments:


- high-cardinality strings won't be automatically compressed by
dictionaries. So, if you have such a large string that might have repeated
substrings (eg a set of URLs) then enabling LZ4 compression is a good idea.
- if you have strings with a lot of common prefixes, you might consider
PREFIX_ENCODING
- for integer types, choose the smallest size that fits your intended
range. eg don't use int64 for storing a customer's age. On disk it will
compress to about the same size, but in memory it will use a lot more space
with the larger type.

Perhaps others can jump in with further recommendations based on experience.

-Todd

On Mon, May 7, 2018 at 1:45 AM, Saeid Sattari <saeid.satt...@gmail.com>
wrote:

> Hi all,
>
> Folks who have used the column compression and encoding in Kudu tables:
> can you share your experiences with the performance?  What type of fields
> are worse/better (IO bottleneck vs query return time,..) to compress. We
> can collect a knowledge base regarding these subjects that users can use in
> the future. Thanks.
>
> Regards,
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Column Compression and Encoding

Reply via email to