Re: Are the Parquet file statistics correct in the following example?

Micah Kornfield Mon, 15 Feb 2021 23:54:50 -0800

>
> I filed ARROW-11634 a bit before you responded because it did seem like a
> bug. Hope that's sufficient for tracking.



Yes, thank you.  I added a pointer to the code mentioned above.


On Mon, Feb 15, 2021 at 11:44 PM Daniel Nugent <[email protected]> wrote:

> Ok. Thanks for the suggestions. I'll see if I can use the finer grained
> writing to handle this.
>
> I filed ARROW-11634 a bit before you responded because it did seem like a
> bug. Hope that's sufficient for tracking.
>
> -Dan Nugent
>
>
> On Tue, Feb 16, 2021 at 1:00 AM Micah Kornfield <[email protected]>
> wrote:
>
>> Hi Dan,
>> This seems suboptimal to me as well (and we should probably open a JIRA
>> to track a solution).  I think the problematic code is [1] since we don't
>> appear to update statistics for the actual indices but simply the overall
>> dictionary (and of course there is that TODO)
>>
>> There are a couple of potential workarounds:
>> 1. Try to make finer grained tables, with smaller dictionaries and use
>> the fine grained writing API [2].  This still might not work (it could
>> cause the fallback to dense if the object lifecycles aren't correct).
>> 2.  Before writing, cast the column to dense (not dictionary encoded)
>> (you might want to still iterate the table in chunks in this case to avoid
>> excessive memory usage due to the loss of dictionary encoding compactness).
>>
>> Hope this helps.
>>
>> -Micah
>>
>> [1]
>> https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc#L1492
>> [2]
>> https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing
>>
>> On Sat, Feb 13, 2021 at 12:28 AM Nugent, Daniel <[email protected]>
>> wrote:
>>
>>> Pyarrow version is 3.0.0
>>>
>>>
>>>
>>> Naively, I would expect the max and min to not just reflect the max and
>>> min value of the dictionary for each row group, but the max and min value
>>> of the actual values in the rowgroup.
>>>
>>>
>>>
>>> I looked at the Parquet spec which seems to reflect this as it refers to
>>> the statistics applying to the logical type of the column, but I may be
>>> misunderstanding.
>>>
>>>
>>>
>>> This is just a toy example, of course. The real data I'm working with is
>>> quite a bit larger and ordered on the column this applies to, so being able
>>> to use the statistics for predicate pushdown would be ideal.
>>>
>>>
>>>
>>> If pyarrow.parquet.write_table is not the preferred way to write Parquet
>>> files out from Arrow data and there is a more germane method, I'd
>>> appreciate being elucidated. I'd also appreciate any workaround suggestions
>>> for the time being.
>>>
>>>
>>>
>>> Thank you,
>>>
>>> -Dan Nugent
>>>
>>>
>>>
>>> >>> import pyarrow as pa
>>>
>>> >>> import pyarrow.parquet as papq
>>>
>>> >>> d = pa.DictionaryArray.from_arrays((100*[0]) + (100*[1]),["A","B"])
>>>
>>> >>> t = pa.table({"col":d})
>>>
>>> >>> papq.write_table(t,'sample.parquet',row_group_size=100)
>>>
>>> >>> f = papq.ParquetFile('sample.parquet')
>>>
>>> >>> (f.metadata.row_group(0).column(0).statistics.min,
>>> f.metadata.row_group(0).column(0).statistics.max)
>>>
>>> ('A', 'B')
>>>
>>> >>> (f.metadata.row_group(1).column(0).statistics.min,
>>> f.metadata.row_group(1).column(0).statistics.max)
>>>
>>> ('A', 'B')
>>>
>>> >>> f.read_row_groups([0]).column(0)
>>>
>>> <pyarrow.lib.ChunkedArray object at 0x7f37346abe90>
>>>
>>> [
>>>
>>>
>>>
>>>   -- dictionary:
>>>
>>>     [
>>>
>>>       "A",
>>>
>>>       "B"
>>>
>>>     ]
>>>
>>>   -- indices:
>>>
>>>     [
>>>
>>>       0,
>>>
>>>       0,
>>>
>>>       0,
>>>
>>>       0,
>>>
>>>       0,
>>>
>>>       0,
>>>
>>>       0,
>>>
>>>       0,
>>>
>>>       0,
>>>
>>>       0,
>>>
>>>       ...
>>>
>>>       0,
>>>
>>>       0,
>>>
>>>       0,
>>>
>>>       0,
>>>
>>>       0,
>>>
>>>       0,
>>>
>>>       0,
>>>
>>>       0,
>>>
>>>       0,
>>>
>>>       0
>>>
>>>     ]
>>>
>>> ]
>>>
>>> >>> f.read_row_groups([1]).column(0)
>>>
>>> <pyarrow.lib.ChunkedArray object at 0x7f37346abef0>
>>>
>>> [
>>>
>>>
>>>
>>>   -- dictionary:
>>>
>>>     [
>>>
>>>       "A",
>>>
>>>       "B"
>>>
>>>     ]
>>>
>>>   -- indices:
>>>
>>>     [
>>>
>>>       1,
>>>
>>>       1,
>>>
>>>       1,
>>>
>>>       1,
>>>
>>>       1,
>>>
>>>       1,
>>>
>>>       1,
>>>
>>>       1,
>>>
>>>       1,
>>>
>>>       1,
>>>
>>>       ...
>>>
>>>       1,
>>>
>>>       1,
>>>
>>>       1,
>>>
>>>       1,
>>>
>>>       1,
>>>
>>>       1,
>>>
>>>       1,
>>>
>>>       1,
>>>
>>>       1,
>>>
>>>       1
>>>
>>>     ]
>>>
>>> ]
>>>
>>> ######################################################################
>>>
>>> The information contained in this communication is confidential and
>>>
>>> may contain information that is privileged or exempt from disclosure
>>>
>>> under applicable law. If you are not a named addressee, please notify
>>>
>>> the sender immediately and delete this email from your system.
>>>
>>> If you have received this communication, and are not a named
>>>
>>> recipient, you are hereby notified that any dissemination,
>>>
>>> distribution or copying of this communication is strictly prohibited.
>>> ######################################################################
>>>
>>>

Re: Are the Parquet file statistics correct in the following example?

Reply via email to