scovich commented on PR #7307: URL: https://github.com/apache/arrow-rs/pull/7307#issuecomment-2808130256
> I think we need to be very careful to balance adding new features in the parquet reader with keeping it fast and maintainable. I haven't had a chance to look at this PR yet, but I do worry about performance and complexity 100% agreed that simplicity and maintainability are paramount... but row numbers are a pretty fundamental feature that's very hard to emulate in higher layers if the parquet reader doesn't support them. Back when https://github.com/delta-io/delta first took a dependency on row numbers, spark's parquet reader did not yet support them; we had to disable row group pruning and other optimizations in order to make it (mostly) safe to manually compute row numbers in the query engine. It was really painful. AFAIK, most parquet readers now support row numbers. We can add [DuckDB](https://github.com/duckdb/duckdb/blob/main/extension/parquet/include/reader/row_number_column_reader.hpp) and [Iceberg](https://github.com/apache/iceberg/blob/main/parquet/src/main/java/org/apache/iceberg/parquet/ParquetValueReaders.java#L292) to the ones already mentioned above. I was actually surprised to trip over this PR and learn that arrow-parquet does not yet support row numbers. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org