[ https://issues.apache.org/jira/browse/KUDU-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881716#comment-16881716 ]
Todd Lipcon commented on KUDU-2888: ----------------------------------- attached a little test file I wrote. NOTE: it has some weirdness where it gets the compression ratio of bitshuffle off by four, and maybe there are some perf problems too (didn't spend a lot of time on it). > Better encoding for dictionary code-words > ----------------------------------------- > > Key: KUDU-2888 > URL: https://issues.apache.org/jira/browse/KUDU-2888 > Project: Kudu > Issue Type: Bug > Components: cfile, perf > Reporter: Todd Lipcon > Priority: Major > Attachments: codec-test.py > > > Currently we use bitshuffle for all ints, including dictionary codewords. For > dictionary codewords, we know the maximum possible value up-front, and we > also know that the ints will be non-negative and small. This set of > constraints makes it much better to use a specialized bitpacking algorithm > rather than a more generic compression like bitshuffle+lz4. Based on some > quick experiments I ran, we can probably get a several-fold decoding speedup > with no loss of compression by switching to a codec like simdbitpacking for > these codewords. -- This message was sent by Atlassian JIRA (v7.6.3#76005)