Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]

via GitHub Tue, 15 Apr 2025 20:26:50 -0700


scovich commented on PR #7307:
URL: https://github.com/apache/arrow-rs/pull/7307#issuecomment-2808130256

> I think we need to be very careful to balance adding new features in the
parquet reader with keeping it fast and maintainable. I haven't had a chance to
look at this PR yet, but I do worry about performance and complexity

100% agreed that simplicity and maintainability are paramount... but row
numbers are a pretty fundamental feature that's very hard to emulate in higher
layers if the parquet reader doesn't support them. Back when
https://github.com/delta-io/delta first took a dependency on row numbers,
spark's parquet reader did not yet support them; we had to disable row group
pruning and other optimizations in order to make it (mostly) safe to manually
compute row numbers in the query engine. It was really painful.

AFAIK, most parquet readers now support row numbers. We can add
[DuckDB](https://github.com/duckdb/duckdb/blob/main/extension/parquet/include/reader/row_number_column_reader.hpp)
and
[Iceberg](https://github.com/apache/iceberg/blob/main/parquet/src/main/java/org/apache/iceberg/parquet/ParquetValueReaders.java#L292)
to the ones already mentioned above. I was actually surprised to trip over
this PR and learn that arrow-parquet does not yet support row numbers.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]

Reply via email to