[ https://issues.apache.org/jira/browse/ARROW-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated ARROW-3772: ---------------------------------- Labels: parquet pull-request-available (was: parquet) > [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow > DictionaryArray > ----------------------------------------------------------------------------------------- > > Key: ARROW-3772 > URL: https://issues.apache.org/jira/browse/ARROW-3772 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Stav Nir > Assignee: Wes McKinney > Priority: Major > Labels: parquet, pull-request-available > Fix For: 1.0.0 > > > Dictionary data is very common in parquet, in the current implementation > parquet-cpp decodes dictionary encoded data always before creating a plain > arrow array. This process is wasteful since we could use arrow's > DictionaryArray directly and achieve several benefits: > # Smaller memory footprint - both in the decoding process and in the > resulting arrow table - especially when the dict values are large > # Better decoding performance - mostly as a result of the first bullet - > less memory fetches and less allocations. > I think those benefits could achieve significant improvements in runtime. > My direction for the implementation is to read the indices (through the > DictionaryDecoder, after the RLE decoding) and values separately into 2 > arrays and create a DictionaryArray using them. > There are some questions to discuss: > # Should this be the default behavior for dictionary encoded data > # Should it be controlled with a parameter in the API > # What should be the policy in case some of the chunks are dictionary > encoded and some are not. > I started implementing this but would like to hear your opinions. -- This message was sent by Atlassian JIRA (v7.6.14#76016)