Liya Fan created ARROW-5917: ------------------------------- Summary: [Java] Redesign the dictionary encoder Key: ARROW-5917 URL: https://issues.apache.org/jira/browse/ARROW-5917 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan
The current dictionary encoder implementation (org.apache.arrow.vector.dictionary.DictionaryEncoder) has heavy performance overhead, which prevents it from being useful in practice: # There are repeated conversions between Java objects and bytes (e.g. vector.getObject(i)). # Unnecessary memory copy (the vector data must be copied to the hash table). # The hash table cannot be reused for encoding multiple vectors (other data structure & results cannot be reused either). # The output vector should not be created/managed by the encoder (just like in the out-of-place sorter) # The hash table requires that the hashCode & equals methods be implemented appropriately, but this is not guaranteed. We plan to implement a new one in the algorithm module, and gradually deprecate the current one. -- This message was sent by Atlassian JIRA (v7.6.14#76016)