scovich commented on PR #7307:
URL: https://github.com/apache/arrow-rs/pull/7307#issuecomment-2808130256

   > I think we need to be very careful to balance adding new features in the 
parquet reader with keeping it fast and maintainable. I haven't had a chance to 
look at this PR yet, but I do worry about performance and complexity
   
   100% agreed that simplicity and maintainability are paramount... but row 
numbers are a pretty fundamental feature that's very hard to emulate in higher 
layers if the parquet reader doesn't support them. Back when 
https://github.com/delta-io/delta first took a dependency on row numbers, 
spark's parquet reader did not yet support them; we had to disable row group 
pruning and other optimizations in order to make it (mostly) safe to manually 
compute row numbers in the query engine. It was really painful. 
   
   AFAIK, most parquet readers now support row numbers. We can add 
[DuckDB](https://github.com/duckdb/duckdb/blob/main/extension/parquet/include/reader/row_number_column_reader.hpp)
 and 
[Iceberg](https://github.com/apache/iceberg/blob/main/parquet/src/main/java/org/apache/iceberg/parquet/ParquetValueReaders.java#L292)
 to the ones already mentioned above. I was actually surprised to trip over 
this PR and learn that arrow-parquet does not yet support row numbers. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to