Stephen Simmons created ARROW-10444:
---------------------------------------

             Summary: [Python] Timestamp metadata min/max stored as INT96 
cannot be read in
                 Key: ARROW-10444
                 URL: https://issues.apache.org/jira/browse/ARROW-10444
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 2.0.0
            Reporter: Stephen Simmons


I am working with Parquet files produced by AWS Redshift's UNLOAD command. The 
schema has several timestamp columns stored as INT96. I have noticed their 
min/max values are omitted from the PyArrow's metadata 

e.g. For this column in my table schema: {{dv_startdateutc: timestamp[ns]}}, 
the statistics section of the column metadata is None, i.e. not filled in with 
the min/max values present in the other non-timestamp columns:
{code:python}
<pyarrow._parquet.ColumnChunkMetaData object at 0x7ff5000d1a10>
 file_offset: 1342723
 file_path: 
 physical_type: INT96
 num_values: 150144
 path_in_schema: dv_startdateutc
 is_stats_set: False
 statistics: None
 compression: SNAPPY
 encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE')
 has_dictionary_page: True
 dictionary_page_offset: 1342659
 data_page_offset: 1342687
 total_compressed_size: 64
 total_uncompressed_size: 60
{code}
This means PyArrow cannot use metadata to filter dataset reads by date/time.
  
 I suspect this bug arises in {{_cast_statistic_raw_min()}} and 
`_cast_statistic_raw_max()` in {{/python/pyarrow/_parquet.pyx}} at L180. The 
code extracts below show there are casts for {{ParqetType_INT32}} and 
{{ParqetType_INT64}}, but not for {{ParqetType_INT96}}.

Can a case be added for {{ParqetType_INT96}} in both of these?

Those raw {{ParqetType_INT96}} will be converted to the appropriate timestamp 
type in {{_box_logical_type_value(raw, statistics.descr())}}.

Thanks
 Stephen
{code:python}
 cdef _cast_statistic_raw_min(CStatistics* statistics):
     cdef ParquetType physical_type = statistics.physical_type()
     cdef uint32_t type_length = statistics.descr().type_length()
     if physical_type == ParquetType_BOOLEAN:
         return (<CBoolStatistics*> statistics).min()
     elif physical_type == ParquetType_INT32:
         return (<CInt32Statistics*> statistics).min()
     elif physical_type == ParquetType_INT64:
         return (<CInt64Statistics*> statistics).min()
     # ADD ParquetType_INT96 here!!!     
     elif physical_type == ParquetType_FLOAT:
         return (<CFloatStatistics*> statistics).min()
     elif physical_type == ParquetType_DOUBLE:
         return (<CDoubleStatistics*> statistics).min()
     elif physical_type == ParquetType_BYTE_ARRAY:
         return _box_byte_array((<CByteArrayStatistics*> statistics).min())
     elif physical_type == ParquetType_FIXED_LEN_BYTE_ARRAY:
         return _box_flba((<CFLBAStatistics*> statistics).min(), type_length)

cdef _cast_statistic_raw_max(CStatistics* statistics):
     cdef ParquetType physical_type = statistics.physical_type()
     cdef uint32_t type_length = statistics.descr().type_length()
     if physical_type == ParquetType_BOOLEAN:
         return (<CBoolStatistics*> statistics).max()
     elif physical_type == ParquetType_INT32:
         return (<CInt32Statistics*> statistics).max()
     elif physical_type == ParquetType_INT64:
         return (<CInt64Statistics*> statistics).max()
     # ADD ParquetType_INT96 here!!!
     elif physical_type == ParquetType_FLOAT:
         return (<CFloatStatistics*> statistics).max()
     elif physical_type == ParquetType_DOUBLE:
         return (<CDoubleStatistics*> statistics).max()
     elif physical_type == ParquetType_BYTE_ARRAY:
         return _box_byte_array((<CByteArrayStatistics*> statistics).max())
     elif physical_type == ParquetType_FIXED_LEN_BYTE_ARRAY:
         return _box_flba((<CFLBAStatistics*> statistics).max(), type_length)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to