Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]
jkylling closed pull request #7307: Add support for file row numbers in Parquet readers URL: https://github.com/apache/arrow-rs/pull/7307 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]
vustef commented on PR #7307: URL: https://github.com/apache/arrow-rs/pull/7307#issuecomment-3446831909 > > One concern I have with the approach here is how to provide exact row numbers if we start selectively reading row group metadata. If we don't have metadata for all preceding row groups, we can't know the starting row number. This at least argues for reverting back to using an `Option` for the start index. > > I don't think we will be able to provide row numbers if we don't have all the preceding row group metadata > > Given the main usecase I have heard so far is indexing and delete vectors, which require exact and accurate row numbers, I think it would be better if the reader simply returned an error if it was configured to read row numbers but didn't have enough information to do so Sounds good, I'll bring back the `Option`, and make row number array reader error out if it lacks that information. I'll hold off with pushing changes here until we finish discussion on the GH issue. And most likely will create a new PR, because I'm not able to update description of this one nor move it between draft and active. (If jkylling doesn't object. Will add due credits to him in there ofc, and keep his commits). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]
alamb commented on PR #7307: URL: https://github.com/apache/arrow-rs/pull/7307#issuecomment-3446585278 > One concern I have with the approach here is how to provide exact row numbers if we start selectively reading row group metadata. If we don't have metadata for all preceding row groups, we can't know the starting row number. This at least argues for reverting back to using an `Option` for the start index. I don't think we will be able to provide row numbers if we don't have all the preceding row group metadata Given the main usecase I have heard so far is indexing and delete vectors, which require exact and accurate row numbers, I think it would be better if the reader simply returned an error if it was configured to read row numbers but didn't have enough information to do so -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]
etseidl commented on PR #7307: URL: https://github.com/apache/arrow-rs/pull/7307#issuecomment-3443712138 I agree that before putting too much effort into this PR we agree on the correct way to implement row numbers (I defer to @alamb and others for the arrow side of this). One concern I have with the approach here is how to provide exact row numbers if we start selectively reading row group metadata. If we don't have metadata for all preceding row groups, we can't know the starting row number. This at least argues for reverting back to using an `Option` for the start index. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]
vustef commented on PR #7307: URL: https://github.com/apache/arrow-rs/pull/7307#issuecomment-3443679111 @jkylling at some point I'm going to need to change the description of the PR, not sure how best to do that. Also perhaps it'd be better fit for the current stage to be a draft, until the API is agreed upon (although that is not too important). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]
jkylling commented on PR #7307: URL: https://github.com/apache/arrow-rs/pull/7307#issuecomment-3436960965 > @jkylling what tests did you use to run on each push? Wondering about the minimal dev loop I can do before pushing. It's been a while since I worked on this, but this is what showed up in my shell history: ``` # variations of the below for running tests cargo test -p parquet --lib --features async cargo test -p parquet --lib --features async -- test_read_only_row_numbers # Also this to format the crate https://github.com/apache/arrow-rs/issues/6179 cargo fmt -p parquet -- --config skip_children=true `find ./parquet -name "*.rs" \! -name format.rs` ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]
vustef commented on PR #7307: URL: https://github.com/apache/arrow-rs/pull/7307#issuecomment-3436858937 @jkylling what tests did you use to run on each push? Wondering about the minimal dev loop I can do before pushing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]
vustef commented on code in PR #7307:
URL: https://github.com/apache/arrow-rs/pull/7307#discussion_r2455083318
##
parquet/src/file/metadata/mod.rs:
##
@@ -700,7 +708,11 @@ impl RowGroupMetaData {
}
/// Method to convert from Thrift.
-pub fn from_thrift(schema_descr: SchemaDescPtr, mut rg: RowGroup) ->
Result {
+pub fn from_thrift(
+schema_descr: SchemaDescPtr,
+mut rg: RowGroup,
+first_row_number: Option,
Review Comment:
I merged main now and resolved these conflicts (next step is addressing the
API questions). @jkylling could we make `first_row_number` non-optional now?
Also now I think that it's not a breaking change, when it comes to metadata.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]
scovich commented on code in PR #7307:
URL: https://github.com/apache/arrow-rs/pull/7307#discussion_r2445582645
##
parquet/src/file/metadata/mod.rs:
##
@@ -700,7 +708,11 @@ impl RowGroupMetaData {
}
/// Method to convert from Thrift.
-pub fn from_thrift(schema_descr: SchemaDescPtr, mut rg: RowGroup) ->
Result {
+pub fn from_thrift(
+schema_descr: SchemaDescPtr,
+mut rg: RowGroup,
+first_row_number: Option,
Review Comment:
If this lands in arrow-57, it can be a breaking change. Otherwise... agree
we need to be careful.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]
vustef commented on code in PR #7307: URL: https://github.com/apache/arrow-rs/pull/7307#discussion_r2445251543 ## parquet/src/arrow/array_reader/builder.rs: ## Review Comment: Hey @jkylling, yes, please do that if you can, happy to continue where you left. I'd also need some guidance from @scovich and @alamb on the preferred path forward. And potentially help from @etseidl if I hit a wall with merging metadata changes that happened in the meanwhile (but more on that once I try it out). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]
jkylling commented on code in PR #7307: URL: https://github.com/apache/arrow-rs/pull/7307#discussion_r2445234561 ## parquet/src/arrow/array_reader/builder.rs: ## Review Comment: Hey @vustef! I'd be very happy if you want to help get row number support into the Parquet reader, either with this PR or through other alternatives. If you want to pick up this PR I can give you commit rights to the branch? Sadly, I don't have capacity to work on this PR at the moment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]
vustef commented on code in PR #7307:
URL: https://github.com/apache/arrow-rs/pull/7307#discussion_r2445150435
##
parquet/src/arrow/arrow_reader/mod.rs:
##
@@ -89,6 +91,7 @@ impl ArrowReaderBuilder {
selection: None,
limit: None,
offset: None,
+row_number_column: None,
Review Comment:
Should we also change `schema` and/or `parquet_schema` functions to include
the `row_number_column`?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]
vustef commented on code in PR #7307: URL: https://github.com/apache/arrow-rs/pull/7307#discussion_r2444804619 ## parquet/src/arrow/array_reader/builder.rs: ## @@ -52,12 +70,13 @@ fn build_reader( field: &ParquetField, mask: &ProjectionMask, row_groups: &dyn RowGroups, +row_number_column: Option<&str>, Review Comment: Any chance we could also return this column with a field ID specified by the user? For Iceberg, we would like the field ID to be according to the spec, and ideally the record batches come already with such schema, rather than having to post-process them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]
vustef commented on code in PR #7307:
URL: https://github.com/apache/arrow-rs/pull/7307#discussion_r2444641611
##
parquet/src/arrow/array_reader/builder.rs:
##
Review Comment:
Thrilled to see this!
Please let me know if I can help in any way. I can make it my top priority
to work on this, as we need to make use of it in the next few weeks.
Our use-case is to leverage this from iceberg-rust, which uses
`ParquetRecordBatchStreamBuilder`. The API seems to work for that, but I
understand from other comments that it may not be the most desirable one -
happy to help either with research/proposal or with the implementation of the
chosen option.
##
parquet/src/file/metadata/mod.rs:
##
@@ -700,7 +708,11 @@ impl RowGroupMetaData {
}
/// Method to convert from Thrift.
-pub fn from_thrift(schema_descr: SchemaDescPtr, mut rg: RowGroup) ->
Result {
+pub fn from_thrift(
+schema_descr: SchemaDescPtr,
+mut rg: RowGroup,
+first_row_number: Option,
Review Comment:
@etseidl seems to have refactored quite a lot of metadata / thrift
functionality, but I'm wondering, since this function was used directly by
delta-rs, and is thus considered an API, shouldn't we avoid changing it, and
rather factor out the code that does the core of conversion, and add another
entry-point that has `first_row_number`?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]
scovich commented on code in PR #7307: URL: https://github.com/apache/arrow-rs/pull/7307#discussion_r2380158074 ## parquet/src/arrow/array_reader/builder.rs: ## @@ -52,12 +70,13 @@ fn build_reader( field: &ParquetField, mask: &ProjectionMask, row_groups: &dyn RowGroups, +row_number_column: Option<&str>, Review Comment: That could potentially work, but the problem is a row number column is only meaningful in a read request schema. Once row numbers hit the output, they're just normal int64 values from then on. Things get a lot harder to reason about if the extension type persists. For example, in a join of multiple tables, where each scan is producing row numbers for its respective files, one could easily end up with two row number columns in the join's output. And the parquet writer would definitely need to block writing such columns, or at least strip away the metadata? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]
alamb commented on code in PR #7307: URL: https://github.com/apache/arrow-rs/pull/7307#discussion_r2380110050 ## parquet/src/arrow/array_reader/builder.rs: ## @@ -52,12 +70,13 @@ fn build_reader( field: &ParquetField, mask: &ProjectionMask, row_groups: &dyn RowGroups, +row_number_column: Option<&str>, Review Comment: 🤔 maybe we could add an arrow extension type, similarly to what we are doing with Variant and Geometry -- so someone would request a column "foo: int64" with an arrow extension type of "row_number" or something 🤔 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]
scovich commented on code in PR #7307: URL: https://github.com/apache/arrow-rs/pull/7307#discussion_r2364598624 ## parquet/src/arrow/array_reader/builder.rs: ## @@ -52,12 +70,13 @@ fn build_reader( field: &ParquetField, mask: &ProjectionMask, row_groups: &dyn RowGroups, +row_number_column: Option<&str>, Review Comment: > What do other parquet readers do to represent row numbers in their output schema? https://github.com/apache/arrow-rs/pull/7307#issuecomment-2808130256, posted Apr 15, might be a starting point? > AFAIK, most parquet readers now support row numbers. We can add [DuckDB](https://github.com/duckdb/duckdb/blob/main/extension/parquet/include/reader/row_number_column_reader.hpp) and [Iceberg](https://github.com/apache/iceberg/blob/main/parquet/src/main/java/org/apache/iceberg/parquet/ParquetValueReaders.java#L292) to the ones already mentioned above. Duckdb uses a [column schema type](https://github.com/duckdb/duckdb/blob/main/extension/parquet/parquet_reader.cpp#L411-L412) approach. Interestingly, that's new -- last time I looked (nearly a year go) it required the reader to pass options along with the schema, and one of the options was to request row numbers (which then became an extra unnamed column at the end of the regular schema). I think that approach didn't scale as they started needing more and more special column types. I see geometry, variant, and non-materialiaed expressions, for example. Iceberg's parquet reader works almost exclusively from field ids, and row index has a baked in field id from the range of metadata row ids. Spark uses a metadata column approach, identified by a special name (`_metadata._rowid`); I don't remember how precisely that maps to the underlying parquet reader. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]
scovich commented on PR #7307: URL: https://github.com/apache/arrow-rs/pull/7307#issuecomment-3313868161 > I think the biggest thing we need to do is to sort out the API for "how does a user request the (virtual) row number column" as todays `ProjectionMask` is insufficient > > @scovich and @etseidl 's idea to use some sort of Arrow metadata is interesting, but I am not quite sure how it would look The "standard" way in most engines I've seen would (in arrow-rs) include an extension type, that the parquet reader recognizes, in the parquet reader's read schema. Nice, because other readers could choose to honor the same extension type and produce row indexes as well. But it does open the question of whether the parquet reader's output should strip away the metadata -- since arguably the row indexes are just normal data once they've been produced -- and if not, how to prevent e.g. writing the values of a "row index" field back to parquet. Is that a bearable approach? Of should we keep thinking of other ways? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]
alamb commented on PR #7307: URL: https://github.com/apache/arrow-rs/pull/7307#issuecomment-3312598827 > > row numbers are a pretty fundamental feature that's very hard to emulate in higher layers if the parquet reader doesn't support them > > +1 on this painpoint, working around this lack of capability from a client perspective is very challenging and comes with a bunch of correctness risks (e.g. we can write some rowIndex column client-side, but then we have to be 100% sure that the emitted rowIndex will perfectly match the Parquet files, which can get quite tricky especially in multi-threaded executions etc.) > > Would love to this feature landing in Arrow-rs + Datafusion Thanks @16pierre -- I agree there is no good workaround for adding row numbers to the output of the parquet reader I think the biggest thing we need to do is to sort out the API for "how does a user request the (virtual) row number column" as todays `ProjectionMask` is insufficient @scovich and @etseidl 's idea to use some sort of Arrow metadata is interesting, but I am not quite sure how it would look -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]
alamb commented on code in PR #7307: URL: https://github.com/apache/arrow-rs/pull/7307#discussion_r2363247491 ## parquet/src/arrow/array_reader/builder.rs: ## @@ -52,12 +70,13 @@ fn build_reader( field: &ParquetField, mask: &ProjectionMask, row_groups: &dyn RowGroups, +row_number_column: Option<&str>, Review Comment: What do other parquet readers do to represent row numbers in their output schema? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]
16pierre commented on PR #7307: URL: https://github.com/apache/arrow-rs/pull/7307#issuecomment-3307738470 > row numbers are a pretty fundamental feature that's very hard to emulate in higher layers if the parquet reader doesn't support them +1 on this painpoint, working around this lack of capability from a client perspective is very challenging and comes with a bunch of correctness risks (e.g. we can write some rowIndex column client-side, but then we have to be 100% sure that the emitted rowIndex will perfectly match the Parquet files, which can get quite tricky especially in multi-threaded executions etc.) Would love to this feature landing in Arrow-rs + Datafusion -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]
alamb commented on code in PR #7307:
URL: https://github.com/apache/arrow-rs/pull/7307#discussion_r2084966343
##
parquet/src/arrow/array_reader/row_number.rs:
##
@@ -0,0 +1,154 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::arrow::array_reader::ArrayReader;
+use crate::errors::{ParquetError, Result};
+use crate::file::metadata::RowGroupMetaData;
+use arrow_array::{ArrayRef, Int64Array};
+use arrow_schema::DataType;
+use std::any::Any;
+use std::collections::VecDeque;
+use std::sync::Arc;
+
+pub(crate) struct RowNumberReader {
+row_numbers: Vec,
+row_groups: RowGroupSizeIterator,
+}
+
+impl RowNumberReader {
+pub(crate) fn try_new(row_groups: impl IntoIterator) ->
Result
+where
+I: TryInto,
+{
+let row_groups = RowGroupSizeIterator::try_new(row_groups)?;
+Ok(Self {
+row_numbers: Vec::new(),
+row_groups,
+})
+}
+}
+
+impl ArrayReader for RowNumberReader {
+fn as_any(&self) -> &dyn Any {
+self
+}
+
+fn get_data_type(&self) -> &DataType {
+&DataType::Int64
+}
+
+fn read_records(&mut self, batch_size: usize) -> Result {
+let read = self
+.row_groups
+.read_records(batch_size, &mut self.row_numbers);
+Ok(read)
+}
+
+fn consume_batch(&mut self) -> Result {
+Ok(Arc::new(Int64Array::from_iter(self.row_numbers.drain(..
+}
+
+fn skip_records(&mut self, num_records: usize) -> Result {
+let skipped = self.row_groups.skip_records(num_records);
+Ok(skipped)
+}
+
+fn get_def_levels(&self) -> Option<&[i16]> {
+None
+}
+
+fn get_rep_levels(&self) -> Option<&[i16]> {
+None
+}
+}
+
+struct RowGroupSizeIterator {
+row_groups: VecDeque,
+}
+
+impl RowGroupSizeIterator {
+fn try_new(row_groups: impl IntoIterator) -> Result
+where
+I: TryInto,
Review Comment:
FWIW @zhuqi-lucas and I are working on improvements to the filter
application here, which may result in some additional API churn:
- https://github.com/apache/arrow-rs/pull/7454
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]
jkylling commented on code in PR #7307:
URL: https://github.com/apache/arrow-rs/pull/7307#discussion_r2083606207
##
parquet/src/arrow/array_reader/row_number.rs:
##
@@ -0,0 +1,154 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::arrow::array_reader::ArrayReader;
+use crate::errors::{ParquetError, Result};
+use crate::file::metadata::RowGroupMetaData;
+use arrow_array::{ArrayRef, Int64Array};
+use arrow_schema::DataType;
+use std::any::Any;
+use std::collections::VecDeque;
+use std::sync::Arc;
+
+pub(crate) struct RowNumberReader {
+row_numbers: Vec,
+row_groups: RowGroupSizeIterator,
+}
+
+impl RowNumberReader {
+pub(crate) fn try_new(row_groups: impl IntoIterator) ->
Result
+where
+I: TryInto,
+{
+let row_groups = RowGroupSizeIterator::try_new(row_groups)?;
+Ok(Self {
+row_numbers: Vec::new(),
+row_groups,
+})
+}
+}
+
+impl ArrayReader for RowNumberReader {
+fn as_any(&self) -> &dyn Any {
+self
+}
+
+fn get_data_type(&self) -> &DataType {
+&DataType::Int64
+}
+
+fn read_records(&mut self, batch_size: usize) -> Result {
+let read = self
+.row_groups
+.read_records(batch_size, &mut self.row_numbers);
+Ok(read)
+}
+
+fn consume_batch(&mut self) -> Result {
+Ok(Arc::new(Int64Array::from_iter(self.row_numbers.drain(..
+}
+
+fn skip_records(&mut self, num_records: usize) -> Result {
+let skipped = self.row_groups.skip_records(num_records);
+Ok(skipped)
+}
+
+fn get_def_levels(&self) -> Option<&[i16]> {
+None
+}
+
+fn get_rep_levels(&self) -> Option<&[i16]> {
+None
+}
+}
+
+struct RowGroupSizeIterator {
+row_groups: VecDeque,
+}
+
+impl RowGroupSizeIterator {
+fn try_new(row_groups: impl IntoIterator) -> Result
+where
+I: TryInto,
Review Comment:
@scovich I see you are involved in the maintenance of delta-kernel-rs. If
you are interested, I've started on an implementation of deletion vector read
support in delta-rs in [this
branch](https://github.com/delta-io/delta-rs/compare/main...jkylling:delta-rs:feature/deletion-vectors),
based on a back port of an early version of this PR to arrow-54.2.1. The PR is
still very rough, but the read path has got okay test coverage and it's able to
read tables with deletion vectors produced by Spark correctly. The write
support for deletion vectors is rudimentary (deletion vectors are only used for
deletes when configured, and deleting from the same file twice is unsupported),
and is mostly there to be able to unit test the read support. Unfortunately,
I've not had time to wokr on this lately.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]
jkylling commented on code in PR #7307: URL: https://github.com/apache/arrow-rs/pull/7307#discussion_r2083595646 ## parquet/src/arrow/array_reader/builder.rs: ## @@ -52,12 +70,13 @@ fn build_reader( field: &ParquetField, mask: &ProjectionMask, row_groups: &dyn RowGroups, +row_number_column: Option<&str>, Review Comment: This would simplify usage of the feature. Having to keep track of the additional row number column is quite cumbersome in clients of this API. One option could be to extend [ParquetFieldType](https://github.com/apache/arrow-rs/blob/fc6936aff50b0b1c8bea984bba2d57dac68a98b3/parquet/src/arrow/schema/complex.rs#L84) with an additional row number type and add it based on the extension type in [ArrowReaderMetadata::with_supplied_metadata](https://github.com/apache/arrow-rs/blob/5350728bb61a3a0a93a8fe797c969a1e6d264169/parquet/src/arrow/arrow_reader/mod.rs#L452)? @etseidl @alamb what do you think about this approach? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]
jkylling commented on code in PR #7307:
URL: https://github.com/apache/arrow-rs/pull/7307#discussion_r2083589878
##
parquet/src/arrow/array_reader/row_number.rs:
##
@@ -0,0 +1,154 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::arrow::array_reader::ArrayReader;
+use crate::errors::{ParquetError, Result};
+use crate::file::metadata::RowGroupMetaData;
+use arrow_array::{ArrayRef, Int64Array};
+use arrow_schema::DataType;
+use std::any::Any;
+use std::collections::VecDeque;
+use std::sync::Arc;
+
+pub(crate) struct RowNumberReader {
+row_numbers: Vec,
+row_groups: RowGroupSizeIterator,
+}
+
+impl RowNumberReader {
+pub(crate) fn try_new(row_groups: impl IntoIterator) ->
Result
+where
+I: TryInto,
+{
+let row_groups = RowGroupSizeIterator::try_new(row_groups)?;
+Ok(Self {
+row_numbers: Vec::new(),
+row_groups,
+})
+}
+}
+
+impl ArrayReader for RowNumberReader {
+fn as_any(&self) -> &dyn Any {
+self
+}
+
+fn get_data_type(&self) -> &DataType {
+&DataType::Int64
+}
+
+fn read_records(&mut self, batch_size: usize) -> Result {
+let read = self
+.row_groups
+.read_records(batch_size, &mut self.row_numbers);
+Ok(read)
+}
+
+fn consume_batch(&mut self) -> Result {
+Ok(Arc::new(Int64Array::from_iter(self.row_numbers.drain(..
+}
+
+fn skip_records(&mut self, num_records: usize) -> Result {
+let skipped = self.row_groups.skip_records(num_records);
+Ok(skipped)
+}
+
+fn get_def_levels(&self) -> Option<&[i16]> {
+None
+}
+
+fn get_rep_levels(&self) -> Option<&[i16]> {
+None
+}
+}
+
+struct RowGroupSizeIterator {
+row_groups: VecDeque,
+}
+
+impl RowGroupSizeIterator {
+fn try_new(row_groups: impl IntoIterator) -> Result
+where
+I: TryInto,
Review Comment:
This is much simpler. Thank you! I suspect we are missing out on some
performance in `skip_records` with this, but the bulk of the data pruning will
likely have happened by pruning Parquet row groups already.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]
scovich commented on code in PR #7307:
URL: https://github.com/apache/arrow-rs/pull/7307#discussion_r2072470885
##
parquet/src/arrow/array_reader/builder.rs:
##
@@ -52,12 +70,13 @@ fn build_reader(
field: &ParquetField,
mask: &ProjectionMask,
row_groups: &dyn RowGroups,
+row_number_column: Option<&str>,
Review Comment:
Maybe a crazy idea, but wouldn't the implementation be simpler (and more
flexible) with a `RowNumber` [extension
type](https://arrow.apache.org/rust/arrow_schema/extension/trait.ExtensionType.html)?
Then users could do e.g.
```rust
Field::new("row_index", DataType::Int64,
false).with_extension_type(RowNumber))
```
and `build_primitive_reader` could just check for it, no matter where in the
schema it hides, instead of implicitly adding an extra column to the schema?
##
parquet/src/arrow/array_reader/row_number.rs:
##
@@ -0,0 +1,154 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::arrow::array_reader::ArrayReader;
+use crate::errors::{ParquetError, Result};
+use crate::file::metadata::RowGroupMetaData;
+use arrow_array::{ArrayRef, Int64Array};
+use arrow_schema::DataType;
+use std::any::Any;
+use std::collections::VecDeque;
+use std::sync::Arc;
+
+pub(crate) struct RowNumberReader {
+row_numbers: Vec,
+row_groups: RowGroupSizeIterator,
+}
+
+impl RowNumberReader {
+pub(crate) fn try_new(row_groups: impl IntoIterator) ->
Result
+where
+I: TryInto,
+{
+let row_groups = RowGroupSizeIterator::try_new(row_groups)?;
+Ok(Self {
+row_numbers: Vec::new(),
+row_groups,
+})
+}
+}
+
+impl ArrayReader for RowNumberReader {
+fn as_any(&self) -> &dyn Any {
+self
+}
+
+fn get_data_type(&self) -> &DataType {
+&DataType::Int64
+}
+
+fn read_records(&mut self, batch_size: usize) -> Result {
+let read = self
+.row_groups
+.read_records(batch_size, &mut self.row_numbers);
+Ok(read)
+}
+
+fn consume_batch(&mut self) -> Result {
+Ok(Arc::new(Int64Array::from_iter(self.row_numbers.drain(..
+}
+
+fn skip_records(&mut self, num_records: usize) -> Result {
+let skipped = self.row_groups.skip_records(num_records);
+Ok(skipped)
+}
+
+fn get_def_levels(&self) -> Option<&[i16]> {
+None
+}
+
+fn get_rep_levels(&self) -> Option<&[i16]> {
+None
+}
+}
+
+struct RowGroupSizeIterator {
+row_groups: VecDeque,
+}
+
+impl RowGroupSizeIterator {
+fn try_new(row_groups: impl IntoIterator) -> Result
+where
+I: TryInto,
Review Comment:
It seems like this whole `RowGroupSizeIterator` thing is a complicated and
error-prone way of chaining several `Range`? Can we use standard iterator
machinery instead?
```rust
pub(crate) struct RowNumberReader {
buffered_row_numbers: Vec,
remaining_row_numbers:
std::iter::Flatten>>,
}
impl RowNumberReader {
pub(crate) fn try_new<'a>(
row_groups: impl Iterator,
) -> Result {
let ranges = row_groups
.map(|rg| {
let first_row_number =
rg.first_row_index().ok_or(ParquetError::General(
"Row group missing row number".to_string(),
))?;
Ok(first_row_number..first_row_number + rg.num_rows())
})
.collect::>>()?;
Ok(Self {
buffered_row_numbers: Vec::new(),
remaining_row_numbers: ranges.into_iter().flatten(),
})
}
// Use `take` on a `&mut Iterator` to consume a number of elements
without consuming the iterator.
fn take(&mut self, batch_size: usize) -> impl Iterator {
(&mut self.remaining_row_numbers).take(batch_size)
}
}
impl ArrayReader for RowNumberReader {
fn read_records(&mut self, batch_size: usize) -> Result {
let starting_len = self.buffered_row_numbers.len();
self.buffered_row_numbers.extend(self.take(batch_size));
Ok(self.buffered_row_numbers.len() - starting_len
Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]
scovich commented on PR #7307: URL: https://github.com/apache/arrow-rs/pull/7307#issuecomment-2808130256 > I think we need to be very careful to balance adding new features in the parquet reader with keeping it fast and maintainable. I haven't had a chance to look at this PR yet, but I do worry about performance and complexity 100% agreed that simplicity and maintainability are paramount... but row numbers are a pretty fundamental feature that's very hard to emulate in higher layers if the parquet reader doesn't support them. Back when https://github.com/delta-io/delta first took a dependency on row numbers, spark's parquet reader did not yet support them; we had to disable row group pruning and other optimizations in order to make it (mostly) safe to manually compute row numbers in the query engine. It was really painful. AFAIK, most parquet readers now support row numbers. We can add [DuckDB](https://github.com/duckdb/duckdb/blob/main/extension/parquet/include/reader/row_number_column_reader.hpp) and [Iceberg](https://github.com/apache/iceberg/blob/main/parquet/src/main/java/org/apache/iceberg/parquet/ParquetValueReaders.java#L292) to the ones already mentioned above. I was actually surprised to trip over this PR and learn that arrow-parquet does not yet support row numbers. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]
etseidl commented on code in PR #7307:
URL: https://github.com/apache/arrow-rs/pull/7307#discussion_r2043080056
##
parquet/src/arrow/array_reader/row_number.rs:
##
@@ -0,0 +1,154 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::arrow::array_reader::ArrayReader;
+use crate::errors::{ParquetError, Result};
+use crate::file::metadata::RowGroupMetaData;
+use arrow_array::{ArrayRef, Int64Array};
+use arrow_schema::DataType;
+use std::any::Any;
+use std::collections::VecDeque;
+use std::sync::Arc;
+
+pub(crate) struct RowNumberReader {
+row_numbers: Vec,
+row_groups: RowGroupSizeIterator,
+}
+
+impl RowNumberReader {
+pub(crate) fn try_new(row_groups: impl IntoIterator) ->
Result
+where
+I: TryInto,
+{
+let row_groups = RowGroupSizeIterator::try_new(row_groups)?;
+Ok(Self {
+row_numbers: Vec::new(),
+row_groups,
+})
+}
+}
+
+impl ArrayReader for RowNumberReader {
+fn as_any(&self) -> &dyn Any {
+self
+}
+
+fn get_data_type(&self) -> &DataType {
+&DataType::Int64
+}
+
+fn read_records(&mut self, batch_size: usize) -> Result {
+let read = self
+.row_groups
+.read_records(batch_size, &mut self.row_numbers);
+Ok(read)
+}
+
+fn consume_batch(&mut self) -> Result {
+Ok(Arc::new(Int64Array::from_iter(self.row_numbers.drain(..
+}
+
+fn skip_records(&mut self, num_records: usize) -> Result {
+let skipped = self.row_groups.skip_records(num_records);
+Ok(skipped)
+}
+
+fn get_def_levels(&self) -> Option<&[i16]> {
+None
+}
+
+fn get_rep_levels(&self) -> Option<&[i16]> {
+None
+}
+}
+
+struct RowGroupSizeIterator {
+row_groups: VecDeque,
+}
+
+impl RowGroupSizeIterator {
+fn try_new(row_groups: impl IntoIterator) -> Result
+where
+I: TryInto,
+{
+Ok(Self {
+row_groups: VecDeque::from(
+row_groups
+.into_iter()
+.map(TryInto::try_into)
+.collect::>>()?,
+),
+})
Review Comment:
I don't think this is necessary...the full `ParquetMetaData` is already
available everywhere this trait is implemented, so I don't see a need to worry
about adding another metadata structure here.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]
etseidl commented on code in PR #7307:
URL: https://github.com/apache/arrow-rs/pull/7307#discussion_r2043160346
##
parquet/src/file/metadata/reader.rs:
##
@@ -767,13 +767,17 @@ impl ParquetMetaDataReader {
file_decryptor = Some(get_file_decryptor(algo,
file_decryption_properties)?);
}
+let mut first_row_number = 0;
let mut row_groups = Vec::new();
+t_file_metadata.row_groups.sort_by_key(|rg| rg.ordinal);
Review Comment:
Are these sorts necessary? Would the ordinal ever be out of order? They
shouldn't be if I understand the [encryption
spec](https://github.com/apache/parquet-format/blob/master/Encryption.md#52-crypto-structures)
correctly.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]
etseidl commented on PR #7307: URL: https://github.com/apache/arrow-rs/pull/7307#issuecomment-2803285654 Yes, a benchmark that shows minimal impact with no row numbers would be nice (and hopefully adding row numbers won't be bad either :smile:). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]
jkylling commented on PR #7307: URL: https://github.com/apache/arrow-rs/pull/7307#issuecomment-2798867336 Thank you for the update, and totally understand other responsibilities are taking up your time. I'll keep on being patient, and maybe do some minor improvements to this PR (use a smaller struct than the full RowGroupMetadata, and add some benchmarks for the RowNumberReader). Just want to make sure we have this PR ready before the next major release approaches. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]
alamb commented on PR #7307: URL: https://github.com/apache/arrow-rs/pull/7307#issuecomment-2794415909 Yeah, sorry I also have been slammed with many other projects. I'll try and find time to look but I suspect it may be a while -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]
etseidl commented on PR #7307: URL: https://github.com/apache/arrow-rs/pull/7307#issuecomment-2790791553 > @etseidl @alamb what would it take to move this PR forward? We could add benchmarks to settle concerns about performance, or refactor to try to reduce complexity? Sorry @jkylling, things have been rather hectic lately. I'll try to give it another look this week, along with some benchmarking (but I don't expect any perf hit). I'll just note that since this is a breaking change, it won't be able to be merged until the next major release (July-ish IIRC), so there's plenty of time to get this right. Also, I'll be deferring to those with more project history (e.g. @alamb @tustvold) as to whether the approach here is the best way to achieve the goal. Thank you for your contribution and your patience! 😄 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]
jkylling commented on PR #7307: URL: https://github.com/apache/arrow-rs/pull/7307#issuecomment-2790217458 @etseidl @alamb what would it take to move this PR forward? We could add benchmarks to settle concerns about performance, or refactor to try to reduce complexity? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]
jkylling commented on code in PR #7307:
URL: https://github.com/apache/arrow-rs/pull/7307#discussion_r2019443876
##
parquet/src/arrow/array_reader/row_number.rs:
##
@@ -0,0 +1,154 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::arrow::array_reader::ArrayReader;
+use crate::errors::{ParquetError, Result};
+use crate::file::metadata::RowGroupMetaData;
+use arrow_array::{ArrayRef, Int64Array};
+use arrow_schema::DataType;
+use std::any::Any;
+use std::collections::VecDeque;
+use std::sync::Arc;
+
+pub(crate) struct RowNumberReader {
+row_numbers: Vec,
+row_groups: RowGroupSizeIterator,
+}
+
+impl RowNumberReader {
+pub(crate) fn try_new(row_groups: impl IntoIterator) ->
Result
+where
+I: TryInto,
+{
+let row_groups = RowGroupSizeIterator::try_new(row_groups)?;
+Ok(Self {
+row_numbers: Vec::new(),
+row_groups,
+})
+}
+}
+
+impl ArrayReader for RowNumberReader {
+fn as_any(&self) -> &dyn Any {
+self
+}
+
+fn get_data_type(&self) -> &DataType {
+&DataType::Int64
+}
+
+fn read_records(&mut self, batch_size: usize) -> Result {
+let read = self
+.row_groups
+.read_records(batch_size, &mut self.row_numbers);
+Ok(read)
+}
+
+fn consume_batch(&mut self) -> Result {
+Ok(Arc::new(Int64Array::from_iter(self.row_numbers.drain(..
+}
+
+fn skip_records(&mut self, num_records: usize) -> Result {
+let skipped = self.row_groups.skip_records(num_records);
+Ok(skipped)
+}
+
+fn get_def_levels(&self) -> Option<&[i16]> {
+None
+}
+
+fn get_rep_levels(&self) -> Option<&[i16]> {
+None
+}
+}
+
+struct RowGroupSizeIterator {
+row_groups: VecDeque,
+}
+
+impl RowGroupSizeIterator {
+fn try_new(row_groups: impl IntoIterator) -> Result
+where
+I: TryInto,
+{
+Ok(Self {
+row_groups: VecDeque::from(
+row_groups
+.into_iter()
+.map(TryInto::try_into)
+.collect::>>()?,
+),
+})
Review Comment:
Yes, I believe we don't have access to all row groups when creating the
array readers.
I took a quick look at the corresponding Parquet reader implementations for
Trino and parquet-java.
Trino:
* Has a boolean to include a row number column,
https://github.com/trinodb/trino/blob/a54d38a30e486a94a365c7f12a94e47beb30b0fa/lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java#L112
* Includes this column when the boolean is set:
https://github.com/trinodb/trino/blob/a54d38a30e486a94a365c7f12a94e47beb30b0fa/lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java#L337
* Has a special block reader for reading row indexes
https://github.com/trinodb/trino/blob/a54d38a30e486a94a365c7f12a94e47beb30b0fa/lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java#L385-L393
I believe the positions play a similar role to our `RowSelectors`.
* Gets row indexes from `RowGroupInfo`, a pruned version of
https://github.com/trinodb/trino/blob/a54d38a30e486a94a365c7f12a94e47beb30b0fa/lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java#L456
* Populates the `fileRowOffset` by iterating through the row groups:
https://github.com/trinodb/trino/blob/master/lib/trino-parquet/src/main/java/io/trino/parquet/metadata/ParquetMetadata.java#L107-L111
parquet-java:
* Has a method for tracking the current row index:
https://github.com/apache/parquet-java/blob/7d1fe32c8c972710a9d780ec5e7d1f95d871374d/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetReader.java#L150-L155
* This row index is based on an iterator which starts form a row group row
index,
https://github.com/apache/parquet-java/blob/7d1fe32c8c972710a9d780ec5e7d1f95d871374d/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java#L311-L339
* This row group row index is initialized by iterating through the row
groups:
https://github.com/apache/parquet-java/blob/7d1fe32c8c972710a9d780ec5e7d1
Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]
jkylling commented on code in PR #7307:
URL: https://github.com/apache/arrow-rs/pull/7307#discussion_r2019451277
##
parquet/src/file/metadata/mod.rs:
##
@@ -584,6 +585,11 @@ impl RowGroupMetaData {
self.num_rows
}
+/// Returns the first row number in this row group.
Review Comment:
Agree. Updated.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]
jkylling commented on code in PR #7307:
URL: https://github.com/apache/arrow-rs/pull/7307#discussion_r2019443876
##
parquet/src/arrow/array_reader/row_number.rs:
##
@@ -0,0 +1,154 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::arrow::array_reader::ArrayReader;
+use crate::errors::{ParquetError, Result};
+use crate::file::metadata::RowGroupMetaData;
+use arrow_array::{ArrayRef, Int64Array};
+use arrow_schema::DataType;
+use std::any::Any;
+use std::collections::VecDeque;
+use std::sync::Arc;
+
+pub(crate) struct RowNumberReader {
+row_numbers: Vec,
+row_groups: RowGroupSizeIterator,
+}
+
+impl RowNumberReader {
+pub(crate) fn try_new(row_groups: impl IntoIterator) ->
Result
+where
+I: TryInto,
+{
+let row_groups = RowGroupSizeIterator::try_new(row_groups)?;
+Ok(Self {
+row_numbers: Vec::new(),
+row_groups,
+})
+}
+}
+
+impl ArrayReader for RowNumberReader {
+fn as_any(&self) -> &dyn Any {
+self
+}
+
+fn get_data_type(&self) -> &DataType {
+&DataType::Int64
+}
+
+fn read_records(&mut self, batch_size: usize) -> Result {
+let read = self
+.row_groups
+.read_records(batch_size, &mut self.row_numbers);
+Ok(read)
+}
+
+fn consume_batch(&mut self) -> Result {
+Ok(Arc::new(Int64Array::from_iter(self.row_numbers.drain(..
+}
+
+fn skip_records(&mut self, num_records: usize) -> Result {
+let skipped = self.row_groups.skip_records(num_records);
+Ok(skipped)
+}
+
+fn get_def_levels(&self) -> Option<&[i16]> {
+None
+}
+
+fn get_rep_levels(&self) -> Option<&[i16]> {
+None
+}
+}
+
+struct RowGroupSizeIterator {
+row_groups: VecDeque,
+}
+
+impl RowGroupSizeIterator {
+fn try_new(row_groups: impl IntoIterator) -> Result
+where
+I: TryInto,
+{
+Ok(Self {
+row_groups: VecDeque::from(
+row_groups
+.into_iter()
+.map(TryInto::try_into)
+.collect::>>()?,
+),
+})
Review Comment:
Yes, I believe we don't have access to all row groups when creating the
array readers.
I took a quick look at the corresponding Parquet reader implementations for
Trino and parquet-java.
Trino:
* Has a boolean to include a row number column,
https://github.com/trinodb/trino/blob/a54d38a30e486a94a365c7f12a94e47beb30b0fa/lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java#L112
* Includes this column when the boolean is set:
https://github.com/trinodb/trino/blob/a54d38a30e486a94a365c7f12a94e47beb30b0fa/lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java#L337
* Has a special block reader for reading row indexes
https://github.com/trinodb/trino/blob/a54d38a30e486a94a365c7f12a94e47beb30b0fa/lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java#L385-L393
I believe the positions play a similar role to our `RowSelectors`.
* Gets row indexes from `RowGroupInfo`, a pruned version of
https://github.com/trinodb/trino/blob/a54d38a30e486a94a365c7f12a94e47beb30b0fa/lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java#L456
* Populates the `fileRowOffset` by iterating through the row groups:
https://github.com/trinodb/trino/blob/master/lib/trino-parquet/src/main/java/io/trino/parquet/metadata/ParquetMetadata.java#L107-L111
parquet-java:
* Has a method for tracking the current row index:
https://github.com/apache/parquet-java/blob/7d1fe32c8c972710a9d780ec5e7d1f95d871374d/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetReader.java#L150-L155
* This row index is based on an iterator which starts form a row group row
index,
https://github.com/apache/parquet-java/blob/7d1fe32c8c972710a9d780ec5e7d1f95d871374d/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java#L311-L339
* This row group row index is initialized by iterating through the row
groups:
https://github.com/apache/parquet-java/blob/7d1fe32c8c972710a9d780ec5e7d1f95d
Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]
etseidl commented on code in PR #7307:
URL: https://github.com/apache/arrow-rs/pull/7307#discussion_r2019127891
##
parquet/src/arrow/array_reader/row_number.rs:
##
@@ -0,0 +1,154 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::arrow::array_reader::ArrayReader;
+use crate::errors::{ParquetError, Result};
+use crate::file::metadata::RowGroupMetaData;
+use arrow_array::{ArrayRef, Int64Array};
+use arrow_schema::DataType;
+use std::any::Any;
+use std::collections::VecDeque;
+use std::sync::Arc;
+
+pub(crate) struct RowNumberReader {
+row_numbers: Vec,
+row_groups: RowGroupSizeIterator,
+}
+
+impl RowNumberReader {
+pub(crate) fn try_new(row_groups: impl IntoIterator) ->
Result
+where
+I: TryInto,
+{
+let row_groups = RowGroupSizeIterator::try_new(row_groups)?;
+Ok(Self {
+row_numbers: Vec::new(),
+row_groups,
+})
+}
+}
+
+impl ArrayReader for RowNumberReader {
+fn as_any(&self) -> &dyn Any {
+self
+}
+
+fn get_data_type(&self) -> &DataType {
+&DataType::Int64
+}
+
+fn read_records(&mut self, batch_size: usize) -> Result {
+let read = self
+.row_groups
+.read_records(batch_size, &mut self.row_numbers);
+Ok(read)
+}
+
+fn consume_batch(&mut self) -> Result {
+Ok(Arc::new(Int64Array::from_iter(self.row_numbers.drain(..
+}
+
+fn skip_records(&mut self, num_records: usize) -> Result {
+let skipped = self.row_groups.skip_records(num_records);
+Ok(skipped)
+}
+
+fn get_def_levels(&self) -> Option<&[i16]> {
+None
+}
+
+fn get_rep_levels(&self) -> Option<&[i16]> {
+None
+}
+}
+
+struct RowGroupSizeIterator {
+row_groups: VecDeque,
+}
+
+impl RowGroupSizeIterator {
+fn try_new(row_groups: impl IntoIterator) -> Result
+where
+I: TryInto,
+{
+Ok(Self {
+row_groups: VecDeque::from(
+row_groups
+.into_iter()
+.map(TryInto::try_into)
+.collect::>>()?,
+),
+})
Review Comment:
Answered my own question...it seems there's some complexity here at least
when using the async reader.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]
alamb commented on PR #7307: URL: https://github.com/apache/arrow-rs/pull/7307#issuecomment-2761739286 Thanks @jkylling and @etseidl -- I think we need to be very careful to balance adding new features in the parquet reader with keeping it fast and maintainable. I haven't had a chance to look at this PR yet, but I do worry about performance and complexity -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]
etseidl commented on code in PR #7307:
URL: https://github.com/apache/arrow-rs/pull/7307#discussion_r2017697049
##
parquet/src/arrow/array_reader/row_number.rs:
##
@@ -0,0 +1,154 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::arrow::array_reader::ArrayReader;
+use crate::errors::{ParquetError, Result};
+use crate::file::metadata::RowGroupMetaData;
+use arrow_array::{ArrayRef, Int64Array};
+use arrow_schema::DataType;
+use std::any::Any;
+use std::collections::VecDeque;
+use std::sync::Arc;
+
+pub(crate) struct RowNumberReader {
+row_numbers: Vec,
+row_groups: RowGroupSizeIterator,
+}
+
+impl RowNumberReader {
+pub(crate) fn try_new(row_groups: impl IntoIterator) ->
Result
+where
+I: TryInto,
+{
+let row_groups = RowGroupSizeIterator::try_new(row_groups)?;
+Ok(Self {
+row_numbers: Vec::new(),
+row_groups,
+})
+}
+}
+
+impl ArrayReader for RowNumberReader {
+fn as_any(&self) -> &dyn Any {
+self
+}
+
+fn get_data_type(&self) -> &DataType {
+&DataType::Int64
+}
+
+fn read_records(&mut self, batch_size: usize) -> Result {
+let read = self
+.row_groups
+.read_records(batch_size, &mut self.row_numbers);
+Ok(read)
+}
+
+fn consume_batch(&mut self) -> Result {
+Ok(Arc::new(Int64Array::from_iter(self.row_numbers.drain(..
+}
+
+fn skip_records(&mut self, num_records: usize) -> Result {
+let skipped = self.row_groups.skip_records(num_records);
+Ok(skipped)
+}
+
+fn get_def_levels(&self) -> Option<&[i16]> {
+None
+}
+
+fn get_rep_levels(&self) -> Option<&[i16]> {
+None
+}
+}
+
+struct RowGroupSizeIterator {
+row_groups: VecDeque,
+}
+
+impl RowGroupSizeIterator {
+fn try_new(row_groups: impl IntoIterator) -> Result
+where
+I: TryInto,
+{
+Ok(Self {
+row_groups: VecDeque::from(
+row_groups
+.into_iter()
+.map(TryInto::try_into)
+.collect::>>()?,
+),
+})
Review Comment:
I'm finding myself a bit uneasy with adding the first row number to the
`RowGroupMetaData`. Rather than that, could this bit here instead be changed to
keep track of the first row number while populating the deque? Is there some
wrinkle I'm missing? Might the row groups be filtered before instantiating the
`RowNumberReader`?
##
parquet/src/file/metadata/mod.rs:
##
@@ -584,6 +585,11 @@ impl RowGroupMetaData {
self.num_rows
}
+/// Returns the first row number in this row group.
Review Comment:
```suggestion
/// Returns the global index number for the first row in this row group.
```
And perhaps use `first_row_index` instead? That may be clearer.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]
etseidl commented on code in PR #7307:
URL: https://github.com/apache/arrow-rs/pull/7307#discussion_r2015070079
##
parquet/src/arrow/arrow_reader/mod.rs:
##
@@ -239,6 +242,16 @@ impl ArrowReaderBuilder {
..self
}
}
+
+/// Include file row numbers in the output with the given column name
+///
+/// This will add a column to the output record batch with the file row
number
+pub fn with_row_number_column(self, row_number_column: Option) ->
Self {
+Self {
+row_number_column,
Review Comment:
```suggestion
pub fn with_row_number_column(self, row_number_column: String) -> Self {
Self {
row_number_column: Some(row_number_column),
```
Since `row_number_column` defaults to `None` anyway.
##
parquet/src/errors.rs:
##
@@ -52,6 +52,9 @@ pub enum ParquetError {
/// Returned when a function needs more data to complete properly. The
`usize` field indicates
/// the total number of bytes required, not the number of additional bytes.
NeedMoreData(usize),
+/// Returned when an operation needs to know the first row number of a row
group, but the row
+/// number is unknown.
+RowGroupMetaDataMissingRowNumber,
Review Comment:
I'm not convinced we need a new error type here. I think you can just use
`General` for this purpose.
##
parquet/src/arrow/array_reader/builder.rs:
##
@@ -356,14 +392,23 @@ mod tests {
)
.unwrap();
-let array_reader = build_array_reader(fields.as_ref(), &mask,
&file_reader).unwrap();
+let array_reader = build_array_reader(
Review Comment:
I think there should be two tests. `test_create_array_reader` should just
pass `None` for the row number column and remain otherwise unchanged, and then
a second `test_create_array_reader_with_row_numbers` which would be what you
have here.
##
parquet/src/arrow/array_reader/builder.rs:
##
@@ -39,9 +39,29 @@ pub fn build_array_reader(
field: Option<&ParquetField>,
mask: &ProjectionMask,
row_groups: &dyn RowGroups,
+row_number_column: Option,
Review Comment:
I wonder if it would be better to make this `Option<&str>` and then use
`.as_deref()` rather than cloning where this (and the other `build_X`
functions) are called.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]
jkylling commented on PR #7307: URL: https://github.com/apache/arrow-rs/pull/7307#issuecomment-2753462478 > Thanks for you submission @jkylling, I'll try to get a first pass review done this week. In the meantime please add the Apache license to row_number.rs and correct the other lint errors. 🙏 Updated. Looking forward to the first review! I was very confused as to why cargo format did not work properly, but looks like you are already aware of this (https://github.com/apache/arrow-rs/issues/6179) :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]
etseidl commented on PR #7307: URL: https://github.com/apache/arrow-rs/pull/7307#issuecomment-2752690645 Thanks for you submission @jkylling, I'll try to get a first pass review done this week. In the meantime please add the Apache license to row_number.rs and correct the other lint errors. 🙏 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
