Sven Cattell created ARROW-18090:
------------------------------------

             Summary: Dictionary Style array for Keywords or Tags 
                 Key: ARROW-18090
                 URL: https://issues.apache.org/jira/browse/ARROW-18090
             Project: Apache Arrow
          Issue Type: New Feature
            Reporter: Sven Cattell


I want to efficiently encode lists of tags for each element in my database. In 
my case I have 30 tags, and a few are assigned to each of my ~20m records. 
Here's a simplified example of 5 records:
 * pe, keylogger, cryptojack
 * pe, packed
 * pe, cryptojack, c2
 * pe, keylogger, c2
 * pe

Right now I have to store these in a List<Utf8> and have huge amounts of 
duplicate data. The dictionary array looks almost perfect for this task. I just 
want to allow for a List<T> instead of just T for the allowed primitive index 
type in a dictionary.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to