[ https://issues.apache.org/jira/browse/ARROW-3408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16918629#comment-16918629 ]
Antoine Pitrou commented on ARROW-3408: --------------------------------------- [~wesmckinn] Are chunked dictionary arrays still supposed to have the same dictionary for all chunks? > [C++] Add option to CSV reader to dictionary encode individual columns or all > string / binary columns > ----------------------------------------------------------------------------------------------------- > > Key: ARROW-3408 > URL: https://issues.apache.org/jira/browse/ARROW-3408 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ > Reporter: Wes McKinney > Priority: Major > Labels: csv, dataset > Fix For: 1.0.0 > > > For many datasets, dictionary encoding everything can result in drastically > lower memory usage and subsequently better performance in doing analytics > One difficulty of dictionary encoding in multithreaded conversions is that > ideally you end up with one dictionary at the end. So you have two options: > * Implement a concurrent hashing scheme -- for low cardinality dictionaries, > the overhead associated with mutex contention will not be meaningful, for > high cardinality it can be more of a problem > * Hash each chunk separately, then normalize at the end > My guess is that a crude concurrent hash table with a mutex to protect > mutations and resizes is going to outperform the latter -- This message was sent by Atlassian Jira (v8.3.2#803003)