isichei commented on pull request #10575:
URL: https://github.com/apache/arrow/pull/10575#issuecomment-870414553


   I've gotten stuck on the dataset API and not sure what is missing:
   
   - `pq.ParquetFile` definitely works with the new 
`coerce_int96_timestamp_unit` parameter.
   - I've added the new parameter the `ParquetScanOptions` as in  
`_dataset.pyx` that class has access to the ArrowReaderProperties cpp class 
which has the setter for the parameter.
   - I have also added the parameter to the `_ParquetDatasetV2` (in 
`parquet.py`) allowing it to be passed down to `ParquetFileFormat` and 
`ParquetScanOptions`.
   - I've added a test (in `test_dataset.py::test_parquet_scan_options`) to 
check that this is actually being set properly and as far as I can tell it is 
being set.
   - However, when i call `pq.read_table` which uses `_ParquetDatasetV2` the 
test fails at it looks like `coerce_int96_timestamp_unit` is not being set on 
reading the parquet file and I can't figure out why (see 
`parquet/test_datetime.py::test_coerce_int96_timestamp_overflow[read_table]`).
   
   ### Next Steps
   
   If someone else doesn't have time to look at this in more detail. It might 
be beneficial for me to just make this PR expose the parameter to `ParquetFile` 
and drop it from the Dataset API (so it is ready to merge for V5 release). Then 
I can create a new feature request on JIRA to expose the Dataset API to the new 
parameter.
   
   Let me know what you think.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to