I implemented the first stage of this https://github.com/apache/arrow/pull/3492
Basically what this allows is that column data decoders to write directly into an Arrow builder class, including BinaryDictionaryBuilder. So once that's merged the remaining steps are: * Define public API for requesting that a particular field be returned as DictionaryArray * in RecordReader, toggle between arrow::internal::ChunkedBinaryBuilder and arrow::BinaryDictionaryBuilder based on the option for that column If we want to exactly preserve the dictionary values (for data that originated in a DictionaryArray), more work is needed (and we need to do this in order to faithfully round trip pandas.Categorical types). The way I would do it: * Add an API to BinaryDictionaryBuilder so that the hash table can be "initialized" with the exact dictionary in the Parquet row group. This method should return an indicator whether the encoded dictionary indices can be inserted directly, skipping hashing (this will yield significantly better performance) * Add API to BinaryDictionaryBuilder that enables bulk-insert of dictionary indices, bypassing hash logic So in subsequent row groups there are two cases to consider * Dictionary the same, or the prior dictionary is a "prefix" of the new one: using that "initialization" method will verify that the dictionary is the same, so that you can use the 2nd API to bulk-insert the indices * Dictionaries different: 2 possibilities, either you can hash the values or compute the dictionary permutation and then insert that So on that first thing, what I mean is that if we have Row group 1, dictionary ['A', 'B', 'C'] Row group 2, dictionary ['A', 'B', 'C', 'D'] Then both row groups can have their indices bulk-appended without having to hash thanks Wes On Fri, Jan 25, 2019 at 8:04 AM Wes McKinney <wesmck...@gmail.com> wrote: > > I'm undertaking some refactoring to expose some of the decoding > internals in a more direct way to the Arrow builder classes, see > https://issues.apache.org/jira/browse/PARQUET-1508. > > I'll try to have a patch up sometime today or over the weekend for you > to review (WIP: > https://github.com/wesm/arrow/tree/parquet-decode-into-arrow-builder) > > After that we should be able to alter the Arrow read path to read > directly into a arrow::BinaryDictionaryBuilder. > > > > On Thu, Jan 24, 2019 at 10:59 AM Hatem Helal > <hatem.he...@mathworks.co.uk> wrote: > > > > Thanks Wes, > > > > Glad to hear this in your plan. > > > > I probably should have done this earlier...but here are some JIRA tickets > > that seem to cover this: > > > > https://issues.apache.org/jira/browse/ARROW-3772 > > https://issues.apache.org/jira/browse/ARROW-3325 > > https://issues.apache.org/jira/browse/ARROW-3769 > > > > > > > > On 1/24/19, 4:27 PM, "Wes McKinney" <wesmck...@gmail.com> wrote: > > > > hi Hatem, > > > > There are several issues open about this already (I'll have to dig > > them up), so this is something that we have desired for a long time, > > but have not gotten around to implementing. > > > > Since many Parquet writers use dictionary encoding, it would make most > > sense to have an option to return DictionaryArray (which can be > > converted to pandas.Categorical) from any column, and internally we > > will perform the conversion from the encoded Parquet format as > > efficiently as possible. > > > > There are many cases to consider: > > > > * Dictionary encoded, but different dictionaries in each row group > > (this is actually the most likely scenario) > > * Dictionary encoded, but the same dictionary in all row groups > > * PLAIN encoded data that we pass through DictionaryBuilder as it is > > decoded to yield DictionaryArray > > * Dictionary encoded, but switch over to PLAIN encoding mid-stream > > > > Having column metadata to automatically "opt in" to the > > DictionaryArray conversion sounds reasonable (so long as Arrow readers > > have a way to opt out, probably via a global flag to ignore such > > custom metadata fields) for usability. > > > > Part of the reason this work was not done in the past was because some > > of our hash table machinery was a bit immature. Antoine has recently > > improved things significantly, so it should be a lot easier now to do > > this work. This is a quite large project, though, and one that affects > > a _lot_ of users, so I would be willing to take an initial pass on the > > implementation. > > > > Along with completing the nested data read/write path I would say this > > is the 2nd highest priority project in parquet-cpp for Arrow users. > > > > - Wes > > > > On Thu, Jan 24, 2019 at 9:59 AM Hatem Helal > > <hatem.he...@mathworks.co.uk> wrote: > > > > > > Hi everyone, > > > > > > I wanted to gauge interest and feasibility for adding support for > > natively reading an arrow::DictionaryArray from a parquet file. Currently, > > writing an arrow::DictionaryArray is read back as the native index type > > [0]. I came across a prior discussion for this problem in the context of > > pandas [1] but I think this would be useful for other arrow clients (C++ or > > otherwise). > > > > > > The solution I had in mind would be to add arrow type information as > > column metadata. This metadata would then be used when reading back the > > parquet file to determine which arrow type to create for the column data. > > > > > > I’m willing to contribute this feature but first wanted to get some > > feedback on whether this would be generally useful and if the high-level > > proposed solution would make sense. > > > > > > Thanks! > > > > > > Hatem > > > > > > > > > [0] This test demonstrates this behavior > > > > > https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/arrow-reader-writer-test.cc#L1848 > > > [1] https://github.com/apache/arrow/issues/1688 > > > >