[I] Return file row number in Parquet readers [arrow-rs]

via GitHub Sun, 16 Mar 2025 04:15:26 -0700


jkylling opened a new issue, #7299:
URL: https://github.com/apache/arrow-rs/issues/7299


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   <!--
   A clear and concise description of what the problem is. Ex. I'm always 
frustrated when [...] 
   (This section helps Arrow developers understand the context and *why* for 
this feature, in addition to  the *what*)
   -->
   Deletion vectors in the Delta Lake and Iceberg table formats are defined in 
terms of row numbers within individual Parquet files. To be able to filter out 
rows defined as deleted by deletion vectors we need a way to know the file row 
number of the rows read by the Arrow Parquet reader.
   
   **Describe the solution you'd like**
   <!--
   A clear and concise description of what you want to happen.
   -->
   The  Arrow Parquet reader should optionally return a column containing the 
row number of each row. We add a method 
`ArrowReaderBuilder::with_row_numbers(self, with_row_numbers: bool) -> Self`, 
which configures the Arrow Parquet reader to add an extra column named 
`row_number` to its schema (possibly the method could be 
`ArrowReaderBuilder::with_row_number_column(self, with_row_numbers: 
Option<String>) -> Self` to make the column name configurable). This column 
contains the row number within the file.
   
   **Describe alternatives you've considered**
   <!--
   A clear and concise description of any alternative solutions or features 
you've considered.
   -->
   There is a corresponding issue on Datafusion 
https://github.com/apache/datafusion/issues/13261. It considers an alternative 
using primary keys and existing SQL primitives, but this comes with a 
performance penalty. The consensus on the issue is
   
   > I agree with the assessment that the information must be coning from the 
file reader itself.
   
   That is, the Arrow Parquet reader.
   
   **Additional context**
   <!--
   Add any other context or screenshots about the feature request here.
   -->
   Please see https://github.com/apache/datafusion/issues/13261 for the 
corresponding issue in Datafusion. There is also a discussion in Datafusion to 
add system/metadata columns in https://github.com/apache/datafusion/pull/14057 
through which this additional file row number column could be exposed. Though, 
we do not need system/metadata columns to be available to support deletion 
vectors in delta-rs or iceberg-rs, since the delta-rs and iceberg-rs Datafusion 
based readers use the Datafusion ParquetSource directly to construct the 
execution plans for the scans of their TableProviders.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Return file row number in Parquet readers [arrow-rs]

Reply via email to