isichei commented on pull request #10575: URL: https://github.com/apache/arrow/pull/10575#issuecomment-870414553
I've gotten stuck on the dataset API and not sure what is missing: - `pq.ParquetFile` definitely works with the new `coerce_int96_timestamp_unit` parameter. - I've added the new parameter the `ParquetScanOptions` as in `_dataset.pyx` that class has access to the ArrowReaderProperties cpp class which has the setter for the parameter. - I have also added the parameter to the `_ParquetDatasetV2` (in `parquet.py`) allowing it to be passed down to `ParquetFileFormat` and `ParquetScanOptions`. - I've added a test (in `test_dataset.py::test_parquet_scan_options`) to check that this is actually being set properly and as far as I can tell it is being set. - However, when i call `pq.read_table` which uses `_ParquetDatasetV2` the test fails at it looks like `coerce_int96_timestamp_unit` is not being set on reading the parquet file and I can't figure out why (see `parquet/test_datetime.py::test_coerce_int96_timestamp_overflow[read_table]`). ### Next Steps If someone else doesn't have time to look at this in more detail. It might be beneficial for me to just make this PR expose the parameter to `ParquetFile` and drop it from the Dataset API (so it is ready to merge for V5 release). Then I can create a new feature request on JIRA to expose the Dataset API to the new parameter. Let me know what you think. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org