Stephen Simmons created ARROW-10444: ---------------------------------------
Summary: [Python] Timestamp metadata min/max stored as INT96 cannot be read in Key: ARROW-10444 URL: https://issues.apache.org/jira/browse/ARROW-10444 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 2.0.0 Reporter: Stephen Simmons I am working with Parquet files produced by AWS Redshift's UNLOAD command. The schema has several timestamp columns stored as INT96. I have noticed their min/max values are omitted from the PyArrow's metadata e.g. For this column in my table schema: {{dv_startdateutc: timestamp[ns]}}, the statistics section of the column metadata is None, i.e. not filled in with the min/max values present in the other non-timestamp columns: {code:python} <pyarrow._parquet.ColumnChunkMetaData object at 0x7ff5000d1a10> file_offset: 1342723 file_path: physical_type: INT96 num_values: 150144 path_in_schema: dv_startdateutc is_stats_set: False statistics: None compression: SNAPPY encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE') has_dictionary_page: True dictionary_page_offset: 1342659 data_page_offset: 1342687 total_compressed_size: 64 total_uncompressed_size: 60 {code} This means PyArrow cannot use metadata to filter dataset reads by date/time. I suspect this bug arises in {{_cast_statistic_raw_min()}} and `_cast_statistic_raw_max()` in {{/python/pyarrow/_parquet.pyx}} at L180. The code extracts below show there are casts for {{ParqetType_INT32}} and {{ParqetType_INT64}}, but not for {{ParqetType_INT96}}. Can a case be added for {{ParqetType_INT96}} in both of these? Those raw {{ParqetType_INT96}} will be converted to the appropriate timestamp type in {{_box_logical_type_value(raw, statistics.descr())}}. Thanks Stephen {code:python} cdef _cast_statistic_raw_min(CStatistics* statistics): cdef ParquetType physical_type = statistics.physical_type() cdef uint32_t type_length = statistics.descr().type_length() if physical_type == ParquetType_BOOLEAN: return (<CBoolStatistics*> statistics).min() elif physical_type == ParquetType_INT32: return (<CInt32Statistics*> statistics).min() elif physical_type == ParquetType_INT64: return (<CInt64Statistics*> statistics).min() # ADD ParquetType_INT96 here!!! elif physical_type == ParquetType_FLOAT: return (<CFloatStatistics*> statistics).min() elif physical_type == ParquetType_DOUBLE: return (<CDoubleStatistics*> statistics).min() elif physical_type == ParquetType_BYTE_ARRAY: return _box_byte_array((<CByteArrayStatistics*> statistics).min()) elif physical_type == ParquetType_FIXED_LEN_BYTE_ARRAY: return _box_flba((<CFLBAStatistics*> statistics).min(), type_length) cdef _cast_statistic_raw_max(CStatistics* statistics): cdef ParquetType physical_type = statistics.physical_type() cdef uint32_t type_length = statistics.descr().type_length() if physical_type == ParquetType_BOOLEAN: return (<CBoolStatistics*> statistics).max() elif physical_type == ParquetType_INT32: return (<CInt32Statistics*> statistics).max() elif physical_type == ParquetType_INT64: return (<CInt64Statistics*> statistics).max() # ADD ParquetType_INT96 here!!! elif physical_type == ParquetType_FLOAT: return (<CFloatStatistics*> statistics).max() elif physical_type == ParquetType_DOUBLE: return (<CDoubleStatistics*> statistics).max() elif physical_type == ParquetType_BYTE_ARRAY: return _box_byte_array((<CByteArrayStatistics*> statistics).max()) elif physical_type == ParquetType_FIXED_LEN_BYTE_ARRAY: return _box_flba((<CFLBAStatistics*> statistics).max(), type_length) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)