Re: Are the Parquet file statistics correct in the following example?

Daniel Nugent Mon, 15 Feb 2021 23:44:30 -0800

Ok. Thanks for the suggestions. I'll see if I can use the finer grained
writing to handle this.


I filed ARROW-11634 a bit before you responded because it did seem like a
bug. Hope that's sufficient for tracking.

-Dan Nugent


On Tue, Feb 16, 2021 at 1:00 AM Micah Kornfield <[email protected]>
wrote:

> Hi Dan,
> This seems suboptimal to me as well (and we should probably open a JIRA to
> track a solution).  I think the problematic code is [1] since we don't
> appear to update statistics for the actual indices but simply the overall
> dictionary (and of course there is that TODO)
>
> There are a couple of potential workarounds:
> 1. Try to make finer grained tables, with smaller dictionaries and use the
> fine grained writing API [2].  This still might not work (it could cause
> the fallback to dense if the object lifecycles aren't correct).
> 2.  Before writing, cast the column to dense (not dictionary encoded) (you
> might want to still iterate the table in chunks in this case to avoid
> excessive memory usage due to the loss of dictionary encoding compactness).
>
> Hope this helps.
>
> -Micah
>
> [1]
> https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc#L1492
> [2]
> https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing
>
> On Sat, Feb 13, 2021 at 12:28 AM Nugent, Daniel <[email protected]>
> wrote:
>
>> Pyarrow version is 3.0.0
>>
>>
>>
>> Naively, I would expect the max and min to not just reflect the max and
>> min value of the dictionary for each row group, but the max and min value
>> of the actual values in the rowgroup.
>>
>>
>>
>> I looked at the Parquet spec which seems to reflect this as it refers to
>> the statistics applying to the logical type of the column, but I may be
>> misunderstanding.
>>
>>
>>
>> This is just a toy example, of course. The real data I'm working with is
>> quite a bit larger and ordered on the column this applies to, so being able
>> to use the statistics for predicate pushdown would be ideal.
>>
>>
>>
>> If pyarrow.parquet.write_table is not the preferred way to write Parquet
>> files out from Arrow data and there is a more germane method, I'd
>> appreciate being elucidated. I'd also appreciate any workaround suggestions
>> for the time being.
>>
>>
>>
>> Thank you,
>>
>> -Dan Nugent
>>
>>
>>
>> >>> import pyarrow as pa
>>
>> >>> import pyarrow.parquet as papq
>>
>> >>> d = pa.DictionaryArray.from_arrays((100*[0]) + (100*[1]),["A","B"])
>>
>> >>> t = pa.table({"col":d})
>>
>> >>> papq.write_table(t,'sample.parquet',row_group_size=100)
>>
>> >>> f = papq.ParquetFile('sample.parquet')
>>
>> >>> (f.metadata.row_group(0).column(0).statistics.min,
>> f.metadata.row_group(0).column(0).statistics.max)
>>
>> ('A', 'B')
>>
>> >>> (f.metadata.row_group(1).column(0).statistics.min,
>> f.metadata.row_group(1).column(0).statistics.max)
>>
>> ('A', 'B')
>>
>> >>> f.read_row_groups([0]).column(0)
>>
>> <pyarrow.lib.ChunkedArray object at 0x7f37346abe90>
>>
>> [
>>
>>
>>
>>   -- dictionary:
>>
>>     [
>>
>>       "A",
>>
>>       "B"
>>
>>     ]
>>
>>   -- indices:
>>
>>     [
>>
>>       0,
>>
>>       0,
>>
>>       0,
>>
>>       0,
>>
>>       0,
>>
>>       0,
>>
>>       0,
>>
>>       0,
>>
>>       0,
>>
>>       0,
>>
>>       ...
>>
>>       0,
>>
>>       0,
>>
>>       0,
>>
>>       0,
>>
>>       0,
>>
>>       0,
>>
>>       0,
>>
>>       0,
>>
>>       0,
>>
>>       0
>>
>>     ]
>>
>> ]
>>
>> >>> f.read_row_groups([1]).column(0)
>>
>> <pyarrow.lib.ChunkedArray object at 0x7f37346abef0>
>>
>> [
>>
>>
>>
>>   -- dictionary:
>>
>>     [
>>
>>       "A",
>>
>>       "B"
>>
>>     ]
>>
>>   -- indices:
>>
>>     [
>>
>>       1,
>>
>>       1,
>>
>>       1,
>>
>>       1,
>>
>>       1,
>>
>>       1,
>>
>>       1,
>>
>>       1,
>>
>>       1,
>>
>>       1,
>>
>>       ...
>>
>>       1,
>>
>>       1,
>>
>>       1,
>>
>>       1,
>>
>>       1,
>>
>>       1,
>>
>>       1,
>>
>>       1,
>>
>>       1,
>>
>>       1
>>
>>     ]
>>
>> ]
>>
>> ######################################################################
>>
>> The information contained in this communication is confidential and
>>
>> may contain information that is privileged or exempt from disclosure
>>
>> under applicable law. If you are not a named addressee, please notify
>>
>> the sender immediately and delete this email from your system.
>>
>> If you have received this communication, and are not a named
>>
>> recipient, you are hereby notified that any dissemination,
>>
>> distribution or copying of this communication is strictly prohibited.
>> ######################################################################
>>
>>

Re: Are the Parquet file statistics correct in the following example?

Reply via email to