[ https://issues.apache.org/jira/browse/ARROW-6933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated ARROW-6933: ---------------------------------- Labels: pull-request-available (was: ) > [Java] Suppor linear dictionary encoder > --------------------------------------- > > Key: ARROW-6933 > URL: https://issues.apache.org/jira/browse/ARROW-6933 > Project: Apache Arrow > Issue Type: New Feature > Components: Java > Reporter: Liya Fan > Assignee: Liya Fan > Priority: Major > Labels: pull-request-available > > For many scenarios, the distribution of dictionary entries is highly skewed. > In other words, a few dictionary entries occurs much more frequently than > others. If we can sort the dictionary by the non-increasing order of entry > frequencies, and compare each value to encode from the beginning of the > dictionary, we get the following benefits: > 1) We need no extra memory space or data structure. > 2) The search is extremely efficient, as we are likely to find a match > in the first few entries of the dictionary. > This is the basic idea behind the linear dictionary encoder. When the > scenario is right (highly skewed dictionary distribution), it outperforms > both search based encoder and hash table based encoders. -- This message was sent by Atlassian Jira (v8.3.4#803005)