> > I filed ARROW-11634 a bit before you responded because it did seem like a > bug. Hope that's sufficient for tracking.
Yes, thank you. I added a pointer to the code mentioned above. On Mon, Feb 15, 2021 at 11:44 PM Daniel Nugent <[email protected]> wrote: > Ok. Thanks for the suggestions. I'll see if I can use the finer grained > writing to handle this. > > I filed ARROW-11634 a bit before you responded because it did seem like a > bug. Hope that's sufficient for tracking. > > -Dan Nugent > > > On Tue, Feb 16, 2021 at 1:00 AM Micah Kornfield <[email protected]> > wrote: > >> Hi Dan, >> This seems suboptimal to me as well (and we should probably open a JIRA >> to track a solution). I think the problematic code is [1] since we don't >> appear to update statistics for the actual indices but simply the overall >> dictionary (and of course there is that TODO) >> >> There are a couple of potential workarounds: >> 1. Try to make finer grained tables, with smaller dictionaries and use >> the fine grained writing API [2]. This still might not work (it could >> cause the fallback to dense if the object lifecycles aren't correct). >> 2. Before writing, cast the column to dense (not dictionary encoded) >> (you might want to still iterate the table in chunks in this case to avoid >> excessive memory usage due to the loss of dictionary encoding compactness). >> >> Hope this helps. >> >> -Micah >> >> [1] >> https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc#L1492 >> [2] >> https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing >> >> On Sat, Feb 13, 2021 at 12:28 AM Nugent, Daniel <[email protected]> >> wrote: >> >>> Pyarrow version is 3.0.0 >>> >>> >>> >>> Naively, I would expect the max and min to not just reflect the max and >>> min value of the dictionary for each row group, but the max and min value >>> of the actual values in the rowgroup. >>> >>> >>> >>> I looked at the Parquet spec which seems to reflect this as it refers to >>> the statistics applying to the logical type of the column, but I may be >>> misunderstanding. >>> >>> >>> >>> This is just a toy example, of course. The real data I'm working with is >>> quite a bit larger and ordered on the column this applies to, so being able >>> to use the statistics for predicate pushdown would be ideal. >>> >>> >>> >>> If pyarrow.parquet.write_table is not the preferred way to write Parquet >>> files out from Arrow data and there is a more germane method, I'd >>> appreciate being elucidated. I'd also appreciate any workaround suggestions >>> for the time being. >>> >>> >>> >>> Thank you, >>> >>> -Dan Nugent >>> >>> >>> >>> >>> import pyarrow as pa >>> >>> >>> import pyarrow.parquet as papq >>> >>> >>> d = pa.DictionaryArray.from_arrays((100*[0]) + (100*[1]),["A","B"]) >>> >>> >>> t = pa.table({"col":d}) >>> >>> >>> papq.write_table(t,'sample.parquet',row_group_size=100) >>> >>> >>> f = papq.ParquetFile('sample.parquet') >>> >>> >>> (f.metadata.row_group(0).column(0).statistics.min, >>> f.metadata.row_group(0).column(0).statistics.max) >>> >>> ('A', 'B') >>> >>> >>> (f.metadata.row_group(1).column(0).statistics.min, >>> f.metadata.row_group(1).column(0).statistics.max) >>> >>> ('A', 'B') >>> >>> >>> f.read_row_groups([0]).column(0) >>> >>> <pyarrow.lib.ChunkedArray object at 0x7f37346abe90> >>> >>> [ >>> >>> >>> >>> -- dictionary: >>> >>> [ >>> >>> "A", >>> >>> "B" >>> >>> ] >>> >>> -- indices: >>> >>> [ >>> >>> 0, >>> >>> 0, >>> >>> 0, >>> >>> 0, >>> >>> 0, >>> >>> 0, >>> >>> 0, >>> >>> 0, >>> >>> 0, >>> >>> 0, >>> >>> ... >>> >>> 0, >>> >>> 0, >>> >>> 0, >>> >>> 0, >>> >>> 0, >>> >>> 0, >>> >>> 0, >>> >>> 0, >>> >>> 0, >>> >>> 0 >>> >>> ] >>> >>> ] >>> >>> >>> f.read_row_groups([1]).column(0) >>> >>> <pyarrow.lib.ChunkedArray object at 0x7f37346abef0> >>> >>> [ >>> >>> >>> >>> -- dictionary: >>> >>> [ >>> >>> "A", >>> >>> "B" >>> >>> ] >>> >>> -- indices: >>> >>> [ >>> >>> 1, >>> >>> 1, >>> >>> 1, >>> >>> 1, >>> >>> 1, >>> >>> 1, >>> >>> 1, >>> >>> 1, >>> >>> 1, >>> >>> 1, >>> >>> ... >>> >>> 1, >>> >>> 1, >>> >>> 1, >>> >>> 1, >>> >>> 1, >>> >>> 1, >>> >>> 1, >>> >>> 1, >>> >>> 1, >>> >>> 1 >>> >>> ] >>> >>> ] >>> >>> ###################################################################### >>> >>> The information contained in this communication is confidential and >>> >>> may contain information that is privileged or exempt from disclosure >>> >>> under applicable law. If you are not a named addressee, please notify >>> >>> the sender immediately and delete this email from your system. >>> >>> If you have received this communication, and are not a named >>> >>> recipient, you are hereby notified that any dissemination, >>> >>> distribution or copying of this communication is strictly prohibited. >>> ###################################################################### >>> >>>
