[ https://issues.apache.org/jira/browse/PARQUET-1324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wes McKinney updated PARQUET-1324: ---------------------------------- Fix Version/s: cpp-1.6.0 > [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow > DictionaryArray > ----------------------------------------------------------------------------------------- > > Key: PARQUET-1324 > URL: https://issues.apache.org/jira/browse/PARQUET-1324 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp > Reporter: Stav Nir > Priority: Major > Fix For: cpp-1.6.0 > > > Dictionary data is very common in parquet, in the current implementation > parquet-cpp decodes dictionary encoded data always before creating a plain > arrow array. This process is wasteful since we could use arrow's > DictionaryArray directly and achieve several benefits: > # Smaller memory footprint - both in the decoding process and in the > resulting arrow table - especially when the dict values are large > # Better decoding performance - mostly as a result of the first bullet - > less memory fetches and less allocations. > I think those benefits could achieve significant improvements in runtime. > My direction for the implementation is to read the indices (through the > DictionaryDecoder, after the RLE decoding) and values separately into 2 > arrays and create a DictionaryArray using them. > There are some questions to discuss: > # Should this be the default behavior for dictionary encoded data > # Should it be controlled with a parameter in the API > # What should be the policy in case some of the chunks are dictionary > encoded and some are not. > I started implementing this but would like to hear your opinions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)