Ok. Thanks for the suggestions. I'll see if I can use the finer grained writing to handle this.
I filed ARROW-11634 a bit before you responded because it did seem like a bug. Hope that's sufficient for tracking. -Dan Nugent On Tue, Feb 16, 2021 at 1:00 AM Micah Kornfield <[email protected]> wrote: > Hi Dan, > This seems suboptimal to me as well (and we should probably open a JIRA to > track a solution). I think the problematic code is [1] since we don't > appear to update statistics for the actual indices but simply the overall > dictionary (and of course there is that TODO) > > There are a couple of potential workarounds: > 1. Try to make finer grained tables, with smaller dictionaries and use the > fine grained writing API [2]. This still might not work (it could cause > the fallback to dense if the object lifecycles aren't correct). > 2. Before writing, cast the column to dense (not dictionary encoded) (you > might want to still iterate the table in chunks in this case to avoid > excessive memory usage due to the loss of dictionary encoding compactness). > > Hope this helps. > > -Micah > > [1] > https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc#L1492 > [2] > https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing > > On Sat, Feb 13, 2021 at 12:28 AM Nugent, Daniel <[email protected]> > wrote: > >> Pyarrow version is 3.0.0 >> >> >> >> Naively, I would expect the max and min to not just reflect the max and >> min value of the dictionary for each row group, but the max and min value >> of the actual values in the rowgroup. >> >> >> >> I looked at the Parquet spec which seems to reflect this as it refers to >> the statistics applying to the logical type of the column, but I may be >> misunderstanding. >> >> >> >> This is just a toy example, of course. The real data I'm working with is >> quite a bit larger and ordered on the column this applies to, so being able >> to use the statistics for predicate pushdown would be ideal. >> >> >> >> If pyarrow.parquet.write_table is not the preferred way to write Parquet >> files out from Arrow data and there is a more germane method, I'd >> appreciate being elucidated. I'd also appreciate any workaround suggestions >> for the time being. >> >> >> >> Thank you, >> >> -Dan Nugent >> >> >> >> >>> import pyarrow as pa >> >> >>> import pyarrow.parquet as papq >> >> >>> d = pa.DictionaryArray.from_arrays((100*[0]) + (100*[1]),["A","B"]) >> >> >>> t = pa.table({"col":d}) >> >> >>> papq.write_table(t,'sample.parquet',row_group_size=100) >> >> >>> f = papq.ParquetFile('sample.parquet') >> >> >>> (f.metadata.row_group(0).column(0).statistics.min, >> f.metadata.row_group(0).column(0).statistics.max) >> >> ('A', 'B') >> >> >>> (f.metadata.row_group(1).column(0).statistics.min, >> f.metadata.row_group(1).column(0).statistics.max) >> >> ('A', 'B') >> >> >>> f.read_row_groups([0]).column(0) >> >> <pyarrow.lib.ChunkedArray object at 0x7f37346abe90> >> >> [ >> >> >> >> -- dictionary: >> >> [ >> >> "A", >> >> "B" >> >> ] >> >> -- indices: >> >> [ >> >> 0, >> >> 0, >> >> 0, >> >> 0, >> >> 0, >> >> 0, >> >> 0, >> >> 0, >> >> 0, >> >> 0, >> >> ... >> >> 0, >> >> 0, >> >> 0, >> >> 0, >> >> 0, >> >> 0, >> >> 0, >> >> 0, >> >> 0, >> >> 0 >> >> ] >> >> ] >> >> >>> f.read_row_groups([1]).column(0) >> >> <pyarrow.lib.ChunkedArray object at 0x7f37346abef0> >> >> [ >> >> >> >> -- dictionary: >> >> [ >> >> "A", >> >> "B" >> >> ] >> >> -- indices: >> >> [ >> >> 1, >> >> 1, >> >> 1, >> >> 1, >> >> 1, >> >> 1, >> >> 1, >> >> 1, >> >> 1, >> >> 1, >> >> ... >> >> 1, >> >> 1, >> >> 1, >> >> 1, >> >> 1, >> >> 1, >> >> 1, >> >> 1, >> >> 1, >> >> 1 >> >> ] >> >> ] >> >> ###################################################################### >> >> The information contained in this communication is confidential and >> >> may contain information that is privileged or exempt from disclosure >> >> under applicable law. If you are not a named addressee, please notify >> >> the sender immediately and delete this email from your system. >> >> If you have received this communication, and are not a named >> >> recipient, you are hereby notified that any dissemination, >> >> distribution or copying of this communication is strictly prohibited. >> ###################################################################### >> >>
