Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]

2025-05-12 Thread via GitHub


alamb commented on code in PR #7307:
URL: https://github.com/apache/arrow-rs/pull/7307#discussion_r2084966343


##
parquet/src/arrow/array_reader/row_number.rs:
##
@@ -0,0 +1,154 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::arrow::array_reader::ArrayReader;
+use crate::errors::{ParquetError, Result};
+use crate::file::metadata::RowGroupMetaData;
+use arrow_array::{ArrayRef, Int64Array};
+use arrow_schema::DataType;
+use std::any::Any;
+use std::collections::VecDeque;
+use std::sync::Arc;
+
+pub(crate) struct RowNumberReader {
+row_numbers: Vec,
+row_groups: RowGroupSizeIterator,
+}
+
+impl RowNumberReader {
+pub(crate) fn try_new(row_groups: impl IntoIterator) -> 
Result
+where
+I: TryInto,
+{
+let row_groups = RowGroupSizeIterator::try_new(row_groups)?;
+Ok(Self {
+row_numbers: Vec::new(),
+row_groups,
+})
+}
+}
+
+impl ArrayReader for RowNumberReader {
+fn as_any(&self) -> &dyn Any {
+self
+}
+
+fn get_data_type(&self) -> &DataType {
+&DataType::Int64
+}
+
+fn read_records(&mut self, batch_size: usize) -> Result {
+let read = self
+.row_groups
+.read_records(batch_size, &mut self.row_numbers);
+Ok(read)
+}
+
+fn consume_batch(&mut self) -> Result {
+Ok(Arc::new(Int64Array::from_iter(self.row_numbers.drain(..
+}
+
+fn skip_records(&mut self, num_records: usize) -> Result {
+let skipped = self.row_groups.skip_records(num_records);
+Ok(skipped)
+}
+
+fn get_def_levels(&self) -> Option<&[i16]> {
+None
+}
+
+fn get_rep_levels(&self) -> Option<&[i16]> {
+None
+}
+}
+
+struct RowGroupSizeIterator {
+row_groups: VecDeque,
+}
+
+impl RowGroupSizeIterator {
+fn try_new(row_groups: impl IntoIterator) -> Result
+where
+I: TryInto,

Review Comment:
   FWIW @zhuqi-lucas  and I are working on improvements to the filter 
application here, which may result in some additional API churn:
   -  https://github.com/apache/arrow-rs/pull/7454



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]

2025-05-11 Thread via GitHub


jkylling commented on code in PR #7307:
URL: https://github.com/apache/arrow-rs/pull/7307#discussion_r2083606207


##
parquet/src/arrow/array_reader/row_number.rs:
##
@@ -0,0 +1,154 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::arrow::array_reader::ArrayReader;
+use crate::errors::{ParquetError, Result};
+use crate::file::metadata::RowGroupMetaData;
+use arrow_array::{ArrayRef, Int64Array};
+use arrow_schema::DataType;
+use std::any::Any;
+use std::collections::VecDeque;
+use std::sync::Arc;
+
+pub(crate) struct RowNumberReader {
+row_numbers: Vec,
+row_groups: RowGroupSizeIterator,
+}
+
+impl RowNumberReader {
+pub(crate) fn try_new(row_groups: impl IntoIterator) -> 
Result
+where
+I: TryInto,
+{
+let row_groups = RowGroupSizeIterator::try_new(row_groups)?;
+Ok(Self {
+row_numbers: Vec::new(),
+row_groups,
+})
+}
+}
+
+impl ArrayReader for RowNumberReader {
+fn as_any(&self) -> &dyn Any {
+self
+}
+
+fn get_data_type(&self) -> &DataType {
+&DataType::Int64
+}
+
+fn read_records(&mut self, batch_size: usize) -> Result {
+let read = self
+.row_groups
+.read_records(batch_size, &mut self.row_numbers);
+Ok(read)
+}
+
+fn consume_batch(&mut self) -> Result {
+Ok(Arc::new(Int64Array::from_iter(self.row_numbers.drain(..
+}
+
+fn skip_records(&mut self, num_records: usize) -> Result {
+let skipped = self.row_groups.skip_records(num_records);
+Ok(skipped)
+}
+
+fn get_def_levels(&self) -> Option<&[i16]> {
+None
+}
+
+fn get_rep_levels(&self) -> Option<&[i16]> {
+None
+}
+}
+
+struct RowGroupSizeIterator {
+row_groups: VecDeque,
+}
+
+impl RowGroupSizeIterator {
+fn try_new(row_groups: impl IntoIterator) -> Result
+where
+I: TryInto,

Review Comment:
   @scovich I see you are involved in the maintenance of delta-kernel-rs. If 
you are interested, I've started on an implementation of deletion vector read 
support in delta-rs in [this 
branch](https://github.com/delta-io/delta-rs/compare/main...jkylling:delta-rs:feature/deletion-vectors),
 based on a back port of an early version of this PR to arrow-54.2.1. The PR is 
still very rough, but the read path has got okay test coverage and it's able to 
read tables with deletion vectors produced by Spark correctly. The write 
support for deletion vectors is rudimentary (deletion vectors are only used for 
deletes when configured, and deleting from the same file twice is unsupported), 
and is mostly there to be able to unit test the read support. Unfortunately, 
I've not had time to wokr on this lately.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]

2025-05-11 Thread via GitHub


jkylling commented on code in PR #7307:
URL: https://github.com/apache/arrow-rs/pull/7307#discussion_r2083595646


##
parquet/src/arrow/array_reader/builder.rs:
##
@@ -52,12 +70,13 @@ fn build_reader(
 field: &ParquetField,
 mask: &ProjectionMask,
 row_groups: &dyn RowGroups,
+row_number_column: Option<&str>,

Review Comment:
   This would simplify usage of the feature. Having to keep track of the 
additional row number column is quite cumbersome in clients of this API. One 
option could be to extend 
[ParquetFieldType](https://github.com/apache/arrow-rs/blob/fc6936aff50b0b1c8bea984bba2d57dac68a98b3/parquet/src/arrow/schema/complex.rs#L84)
 with an additional row number type and add it based on the extension type in 
[ArrowReaderMetadata::with_supplied_metadata](https://github.com/apache/arrow-rs/blob/5350728bb61a3a0a93a8fe797c969a1e6d264169/parquet/src/arrow/arrow_reader/mod.rs#L452)?
 @etseidl @alamb what do you think about this approach?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]

2025-05-11 Thread via GitHub


jkylling commented on code in PR #7307:
URL: https://github.com/apache/arrow-rs/pull/7307#discussion_r2083589878


##
parquet/src/arrow/array_reader/row_number.rs:
##
@@ -0,0 +1,154 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::arrow::array_reader::ArrayReader;
+use crate::errors::{ParquetError, Result};
+use crate::file::metadata::RowGroupMetaData;
+use arrow_array::{ArrayRef, Int64Array};
+use arrow_schema::DataType;
+use std::any::Any;
+use std::collections::VecDeque;
+use std::sync::Arc;
+
+pub(crate) struct RowNumberReader {
+row_numbers: Vec,
+row_groups: RowGroupSizeIterator,
+}
+
+impl RowNumberReader {
+pub(crate) fn try_new(row_groups: impl IntoIterator) -> 
Result
+where
+I: TryInto,
+{
+let row_groups = RowGroupSizeIterator::try_new(row_groups)?;
+Ok(Self {
+row_numbers: Vec::new(),
+row_groups,
+})
+}
+}
+
+impl ArrayReader for RowNumberReader {
+fn as_any(&self) -> &dyn Any {
+self
+}
+
+fn get_data_type(&self) -> &DataType {
+&DataType::Int64
+}
+
+fn read_records(&mut self, batch_size: usize) -> Result {
+let read = self
+.row_groups
+.read_records(batch_size, &mut self.row_numbers);
+Ok(read)
+}
+
+fn consume_batch(&mut self) -> Result {
+Ok(Arc::new(Int64Array::from_iter(self.row_numbers.drain(..
+}
+
+fn skip_records(&mut self, num_records: usize) -> Result {
+let skipped = self.row_groups.skip_records(num_records);
+Ok(skipped)
+}
+
+fn get_def_levels(&self) -> Option<&[i16]> {
+None
+}
+
+fn get_rep_levels(&self) -> Option<&[i16]> {
+None
+}
+}
+
+struct RowGroupSizeIterator {
+row_groups: VecDeque,
+}
+
+impl RowGroupSizeIterator {
+fn try_new(row_groups: impl IntoIterator) -> Result
+where
+I: TryInto,

Review Comment:
   This is much simpler. Thank you! I suspect we are missing out on some 
performance in `skip_records` with this, but the bulk of the data pruning will 
likely have happened by pruning Parquet row groups already. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]

2025-05-06 Thread via GitHub


scovich commented on code in PR #7307:
URL: https://github.com/apache/arrow-rs/pull/7307#discussion_r2072470885


##
parquet/src/arrow/array_reader/builder.rs:
##
@@ -52,12 +70,13 @@ fn build_reader(
 field: &ParquetField,
 mask: &ProjectionMask,
 row_groups: &dyn RowGroups,
+row_number_column: Option<&str>,

Review Comment:
   Maybe a crazy idea, but wouldn't the implementation be simpler (and more 
flexible) with a `RowNumber` [extension 
type](https://arrow.apache.org/rust/arrow_schema/extension/trait.ExtensionType.html)?
 Then users could do e.g.
   ```rust
   Field::new("row_index", DataType::Int64, 
false).with_extension_type(RowNumber))
   ```
   and `build_primitive_reader` could just check for it, no matter where in the 
schema it hides, instead of implicitly adding an extra column to the schema?



##
parquet/src/arrow/array_reader/row_number.rs:
##
@@ -0,0 +1,154 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::arrow::array_reader::ArrayReader;
+use crate::errors::{ParquetError, Result};
+use crate::file::metadata::RowGroupMetaData;
+use arrow_array::{ArrayRef, Int64Array};
+use arrow_schema::DataType;
+use std::any::Any;
+use std::collections::VecDeque;
+use std::sync::Arc;
+
+pub(crate) struct RowNumberReader {
+row_numbers: Vec,
+row_groups: RowGroupSizeIterator,
+}
+
+impl RowNumberReader {
+pub(crate) fn try_new(row_groups: impl IntoIterator) -> 
Result
+where
+I: TryInto,
+{
+let row_groups = RowGroupSizeIterator::try_new(row_groups)?;
+Ok(Self {
+row_numbers: Vec::new(),
+row_groups,
+})
+}
+}
+
+impl ArrayReader for RowNumberReader {
+fn as_any(&self) -> &dyn Any {
+self
+}
+
+fn get_data_type(&self) -> &DataType {
+&DataType::Int64
+}
+
+fn read_records(&mut self, batch_size: usize) -> Result {
+let read = self
+.row_groups
+.read_records(batch_size, &mut self.row_numbers);
+Ok(read)
+}
+
+fn consume_batch(&mut self) -> Result {
+Ok(Arc::new(Int64Array::from_iter(self.row_numbers.drain(..
+}
+
+fn skip_records(&mut self, num_records: usize) -> Result {
+let skipped = self.row_groups.skip_records(num_records);
+Ok(skipped)
+}
+
+fn get_def_levels(&self) -> Option<&[i16]> {
+None
+}
+
+fn get_rep_levels(&self) -> Option<&[i16]> {
+None
+}
+}
+
+struct RowGroupSizeIterator {
+row_groups: VecDeque,
+}
+
+impl RowGroupSizeIterator {
+fn try_new(row_groups: impl IntoIterator) -> Result
+where
+I: TryInto,

Review Comment:
   It seems like this whole `RowGroupSizeIterator` thing is a complicated and 
error-prone way of chaining several `Range`? Can we use standard iterator 
machinery instead?
   
   ```rust
   pub(crate) struct RowNumberReader {
   buffered_row_numbers: Vec,
   remaining_row_numbers: 
std::iter::Flatten>>,
   }
   
   impl RowNumberReader {
   pub(crate) fn try_new<'a>(
   row_groups: impl Iterator,
   ) -> Result {
   let ranges = row_groups
   .map(|rg| {
   let first_row_number = 
rg.first_row_index().ok_or(ParquetError::General(
   "Row group missing row number".to_string(),
   ))?;
   Ok(first_row_number..first_row_number + rg.num_rows())
   })
   .collect::>>()?;
   Ok(Self {
   buffered_row_numbers: Vec::new(),
   remaining_row_numbers: ranges.into_iter().flatten(),
   })
   }
   
   // Use `take` on a `&mut Iterator` to consume a number of elements 
without consuming the iterator.
   fn take(&mut self, batch_size: usize) -> impl Iterator {
   (&mut self.remaining_row_numbers).take(batch_size)
   }
   }
   
   impl ArrayReader for RowNumberReader {
   fn read_records(&mut self, batch_size: usize) -> Result {
   let starting_len = self.buffered_row_numbers.len();
   self.buffered_row_numbers.extend(self.take(batch_size));
   Ok(self.buffered_row_numbers.len() - starting_len

Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]

2025-04-15 Thread via GitHub


scovich commented on PR #7307:
URL: https://github.com/apache/arrow-rs/pull/7307#issuecomment-2808130256

   > I think we need to be very careful to balance adding new features in the 
parquet reader with keeping it fast and maintainable. I haven't had a chance to 
look at this PR yet, but I do worry about performance and complexity
   
   100% agreed that simplicity and maintainability are paramount... but row 
numbers are a pretty fundamental feature that's very hard to emulate in higher 
layers if the parquet reader doesn't support them. Back when 
https://github.com/delta-io/delta first took a dependency on row numbers, 
spark's parquet reader did not yet support them; we had to disable row group 
pruning and other optimizations in order to make it (mostly) safe to manually 
compute row numbers in the query engine. It was really painful. 
   
   AFAIK, most parquet readers now support row numbers. We can add 
[DuckDB](https://github.com/duckdb/duckdb/blob/main/extension/parquet/include/reader/row_number_column_reader.hpp)
 and 
[Iceberg](https://github.com/apache/iceberg/blob/main/parquet/src/main/java/org/apache/iceberg/parquet/ParquetValueReaders.java#L292)
 to the ones already mentioned above. I was actually surprised to trip over 
this PR and learn that arrow-parquet does not yet support row numbers. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]

2025-04-14 Thread via GitHub


etseidl commented on code in PR #7307:
URL: https://github.com/apache/arrow-rs/pull/7307#discussion_r2043080056


##
parquet/src/arrow/array_reader/row_number.rs:
##
@@ -0,0 +1,154 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::arrow::array_reader::ArrayReader;
+use crate::errors::{ParquetError, Result};
+use crate::file::metadata::RowGroupMetaData;
+use arrow_array::{ArrayRef, Int64Array};
+use arrow_schema::DataType;
+use std::any::Any;
+use std::collections::VecDeque;
+use std::sync::Arc;
+
+pub(crate) struct RowNumberReader {
+row_numbers: Vec,
+row_groups: RowGroupSizeIterator,
+}
+
+impl RowNumberReader {
+pub(crate) fn try_new(row_groups: impl IntoIterator) -> 
Result
+where
+I: TryInto,
+{
+let row_groups = RowGroupSizeIterator::try_new(row_groups)?;
+Ok(Self {
+row_numbers: Vec::new(),
+row_groups,
+})
+}
+}
+
+impl ArrayReader for RowNumberReader {
+fn as_any(&self) -> &dyn Any {
+self
+}
+
+fn get_data_type(&self) -> &DataType {
+&DataType::Int64
+}
+
+fn read_records(&mut self, batch_size: usize) -> Result {
+let read = self
+.row_groups
+.read_records(batch_size, &mut self.row_numbers);
+Ok(read)
+}
+
+fn consume_batch(&mut self) -> Result {
+Ok(Arc::new(Int64Array::from_iter(self.row_numbers.drain(..
+}
+
+fn skip_records(&mut self, num_records: usize) -> Result {
+let skipped = self.row_groups.skip_records(num_records);
+Ok(skipped)
+}
+
+fn get_def_levels(&self) -> Option<&[i16]> {
+None
+}
+
+fn get_rep_levels(&self) -> Option<&[i16]> {
+None
+}
+}
+
+struct RowGroupSizeIterator {
+row_groups: VecDeque,
+}
+
+impl RowGroupSizeIterator {
+fn try_new(row_groups: impl IntoIterator) -> Result
+where
+I: TryInto,
+{
+Ok(Self {
+row_groups: VecDeque::from(
+row_groups
+.into_iter()
+.map(TryInto::try_into)
+.collect::>>()?,
+),
+})

Review Comment:
   I don't think this is necessary...the full `ParquetMetaData` is already 
available everywhere this trait is implemented, so I don't see a need to worry 
about adding another metadata structure here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]

2025-04-14 Thread via GitHub


etseidl commented on code in PR #7307:
URL: https://github.com/apache/arrow-rs/pull/7307#discussion_r2043160346


##
parquet/src/file/metadata/reader.rs:
##
@@ -767,13 +767,17 @@ impl ParquetMetaDataReader {
 file_decryptor = Some(get_file_decryptor(algo, 
file_decryption_properties)?);
 }
 
+let mut first_row_number = 0;
 let mut row_groups = Vec::new();
+t_file_metadata.row_groups.sort_by_key(|rg| rg.ordinal);

Review Comment:
   Are these sorts necessary? Would the ordinal ever be out of order? They 
shouldn't be if I understand the [encryption 
spec](https://github.com/apache/parquet-format/blob/master/Encryption.md#52-crypto-structures)
 correctly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]

2025-04-14 Thread via GitHub


etseidl commented on PR #7307:
URL: https://github.com/apache/arrow-rs/pull/7307#issuecomment-2803285654

   Yes, a benchmark that shows minimal impact with no row numbers would be nice 
(and hopefully adding row numbers won't be bad either :smile:).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]

2025-04-12 Thread via GitHub


jkylling commented on PR #7307:
URL: https://github.com/apache/arrow-rs/pull/7307#issuecomment-2798867336

   Thank you for the update, and totally understand other responsibilities are 
taking up your time. I'll keep on being patient, and maybe do some minor 
improvements to this PR (use a smaller struct than the full RowGroupMetadata, 
and add some benchmarks for the RowNumberReader). Just want to make sure we 
have this PR ready before the next major release approaches.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]

2025-04-10 Thread via GitHub


alamb commented on PR #7307:
URL: https://github.com/apache/arrow-rs/pull/7307#issuecomment-2794415909

   Yeah, sorry I also have been slammed with many other projects. I'll try and 
find time to look but I suspect it may be a while


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]

2025-04-09 Thread via GitHub


etseidl commented on PR #7307:
URL: https://github.com/apache/arrow-rs/pull/7307#issuecomment-2790791553

   > @etseidl @alamb what would it take to move this PR forward? We could add 
benchmarks to settle concerns about performance, or refactor to try to reduce 
complexity?
   
   Sorry @jkylling, things have been rather hectic lately. I'll try to give it 
another look this week, along with some benchmarking (but I don't expect any 
perf hit). I'll just note that since this is a breaking change, it won't be 
able to be merged until the next major release (July-ish IIRC), so there's 
plenty of time to get this right. Also, I'll be deferring to those with more 
project history (e.g. @alamb @tustvold) as to whether the approach here is the 
best way to achieve the goal. Thank you for your contribution and your 
patience! 😄


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]

2025-04-09 Thread via GitHub


jkylling commented on PR #7307:
URL: https://github.com/apache/arrow-rs/pull/7307#issuecomment-2790217458

   @etseidl @alamb what would it take to move this PR forward? We could add 
benchmarks to settle concerns about performance, or refactor to try to reduce 
complexity?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]

2025-03-28 Thread via GitHub


jkylling commented on code in PR #7307:
URL: https://github.com/apache/arrow-rs/pull/7307#discussion_r2019443876


##
parquet/src/arrow/array_reader/row_number.rs:
##
@@ -0,0 +1,154 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::arrow::array_reader::ArrayReader;
+use crate::errors::{ParquetError, Result};
+use crate::file::metadata::RowGroupMetaData;
+use arrow_array::{ArrayRef, Int64Array};
+use arrow_schema::DataType;
+use std::any::Any;
+use std::collections::VecDeque;
+use std::sync::Arc;
+
+pub(crate) struct RowNumberReader {
+row_numbers: Vec,
+row_groups: RowGroupSizeIterator,
+}
+
+impl RowNumberReader {
+pub(crate) fn try_new(row_groups: impl IntoIterator) -> 
Result
+where
+I: TryInto,
+{
+let row_groups = RowGroupSizeIterator::try_new(row_groups)?;
+Ok(Self {
+row_numbers: Vec::new(),
+row_groups,
+})
+}
+}
+
+impl ArrayReader for RowNumberReader {
+fn as_any(&self) -> &dyn Any {
+self
+}
+
+fn get_data_type(&self) -> &DataType {
+&DataType::Int64
+}
+
+fn read_records(&mut self, batch_size: usize) -> Result {
+let read = self
+.row_groups
+.read_records(batch_size, &mut self.row_numbers);
+Ok(read)
+}
+
+fn consume_batch(&mut self) -> Result {
+Ok(Arc::new(Int64Array::from_iter(self.row_numbers.drain(..
+}
+
+fn skip_records(&mut self, num_records: usize) -> Result {
+let skipped = self.row_groups.skip_records(num_records);
+Ok(skipped)
+}
+
+fn get_def_levels(&self) -> Option<&[i16]> {
+None
+}
+
+fn get_rep_levels(&self) -> Option<&[i16]> {
+None
+}
+}
+
+struct RowGroupSizeIterator {
+row_groups: VecDeque,
+}
+
+impl RowGroupSizeIterator {
+fn try_new(row_groups: impl IntoIterator) -> Result
+where
+I: TryInto,
+{
+Ok(Self {
+row_groups: VecDeque::from(
+row_groups
+.into_iter()
+.map(TryInto::try_into)
+.collect::>>()?,
+),
+})

Review Comment:
   Yes, I believe we don't have access to all row groups when creating the 
array readers.
   
   I took a quick look at the corresponding Parquet reader implementations for 
Trino and parquet-java.
   
   Trino:
   * Has a boolean to include a row number column, 
https://github.com/trinodb/trino/blob/a54d38a30e486a94a365c7f12a94e47beb30b0fa/lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java#L112
   * Includes this column when the boolean is set: 
https://github.com/trinodb/trino/blob/a54d38a30e486a94a365c7f12a94e47beb30b0fa/lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java#L337
   * Has a special block reader for reading row indexes 
https://github.com/trinodb/trino/blob/a54d38a30e486a94a365c7f12a94e47beb30b0fa/lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java#L385-L393
 I believe the positions play a similar role to our `RowSelectors`.
   * Gets row indexes from `RowGroupInfo`, a pruned version of  
https://github.com/trinodb/trino/blob/a54d38a30e486a94a365c7f12a94e47beb30b0fa/lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java#L456
   * Populates the `fileRowOffset` by iterating through the row groups: 
https://github.com/trinodb/trino/blob/master/lib/trino-parquet/src/main/java/io/trino/parquet/metadata/ParquetMetadata.java#L107-L111
   
   parquet-java:
   * Has a method for tracking the current row index: 
https://github.com/apache/parquet-java/blob/7d1fe32c8c972710a9d780ec5e7d1f95d871374d/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetReader.java#L150-L155
   * This row index is based on an iterator which starts form a row group row 
index, 
https://github.com/apache/parquet-java/blob/7d1fe32c8c972710a9d780ec5e7d1f95d871374d/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java#L311-L339
   * This row group row index is initialized by iterating through the row 
groups: 
https://github.com/apache/parquet-java/blob/7d1fe32c8c972710a9d780ec5e7d1

Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]

2025-03-28 Thread via GitHub


jkylling commented on code in PR #7307:
URL: https://github.com/apache/arrow-rs/pull/7307#discussion_r2019451277


##
parquet/src/file/metadata/mod.rs:
##
@@ -584,6 +585,11 @@ impl RowGroupMetaData {
 self.num_rows
 }
 
+/// Returns the first row number in this row group.

Review Comment:
   Agree. Updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]

2025-03-28 Thread via GitHub


jkylling commented on code in PR #7307:
URL: https://github.com/apache/arrow-rs/pull/7307#discussion_r2019443876


##
parquet/src/arrow/array_reader/row_number.rs:
##
@@ -0,0 +1,154 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::arrow::array_reader::ArrayReader;
+use crate::errors::{ParquetError, Result};
+use crate::file::metadata::RowGroupMetaData;
+use arrow_array::{ArrayRef, Int64Array};
+use arrow_schema::DataType;
+use std::any::Any;
+use std::collections::VecDeque;
+use std::sync::Arc;
+
+pub(crate) struct RowNumberReader {
+row_numbers: Vec,
+row_groups: RowGroupSizeIterator,
+}
+
+impl RowNumberReader {
+pub(crate) fn try_new(row_groups: impl IntoIterator) -> 
Result
+where
+I: TryInto,
+{
+let row_groups = RowGroupSizeIterator::try_new(row_groups)?;
+Ok(Self {
+row_numbers: Vec::new(),
+row_groups,
+})
+}
+}
+
+impl ArrayReader for RowNumberReader {
+fn as_any(&self) -> &dyn Any {
+self
+}
+
+fn get_data_type(&self) -> &DataType {
+&DataType::Int64
+}
+
+fn read_records(&mut self, batch_size: usize) -> Result {
+let read = self
+.row_groups
+.read_records(batch_size, &mut self.row_numbers);
+Ok(read)
+}
+
+fn consume_batch(&mut self) -> Result {
+Ok(Arc::new(Int64Array::from_iter(self.row_numbers.drain(..
+}
+
+fn skip_records(&mut self, num_records: usize) -> Result {
+let skipped = self.row_groups.skip_records(num_records);
+Ok(skipped)
+}
+
+fn get_def_levels(&self) -> Option<&[i16]> {
+None
+}
+
+fn get_rep_levels(&self) -> Option<&[i16]> {
+None
+}
+}
+
+struct RowGroupSizeIterator {
+row_groups: VecDeque,
+}
+
+impl RowGroupSizeIterator {
+fn try_new(row_groups: impl IntoIterator) -> Result
+where
+I: TryInto,
+{
+Ok(Self {
+row_groups: VecDeque::from(
+row_groups
+.into_iter()
+.map(TryInto::try_into)
+.collect::>>()?,
+),
+})

Review Comment:
   Yes, I believe we don't have access to all row groups when creating the 
array readers.
   
   I took a quick look at the corresponding Parquet reader implementations for 
Trino and parquet-java.
   Trino:
   * Has a boolean to include a row number column, 
https://github.com/trinodb/trino/blob/a54d38a30e486a94a365c7f12a94e47beb30b0fa/lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java#L112
   * Includes this column when the boolean is set: 
https://github.com/trinodb/trino/blob/a54d38a30e486a94a365c7f12a94e47beb30b0fa/lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java#L337
   * Has a special block reader for reading row indexes 
https://github.com/trinodb/trino/blob/a54d38a30e486a94a365c7f12a94e47beb30b0fa/lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java#L385-L393
 I believe the positions play a similar role to our `RowSelectors`.
   * Gets row indexes from `RowGroupInfo`, a pruned version of  
https://github.com/trinodb/trino/blob/a54d38a30e486a94a365c7f12a94e47beb30b0fa/lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java#L456
   * Populates the `fileRowOffset` by iterating through the row groups: 
https://github.com/trinodb/trino/blob/master/lib/trino-parquet/src/main/java/io/trino/parquet/metadata/ParquetMetadata.java#L107-L111
   
   parquet-java:
   * Has a method for tracking the current row index: 
https://github.com/apache/parquet-java/blob/7d1fe32c8c972710a9d780ec5e7d1f95d871374d/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetReader.java#L150-L155
   * This row index is based on an iterator which starts form a row group row 
index, 
https://github.com/apache/parquet-java/blob/7d1fe32c8c972710a9d780ec5e7d1f95d871374d/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java#L311-L339
   * This row group row index is initialized by iterating through the row 
groups: 
https://github.com/apache/parquet-java/blob/7d1fe32c8c972710a9d780ec5e7d1f95d

Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]

2025-03-28 Thread via GitHub


etseidl commented on code in PR #7307:
URL: https://github.com/apache/arrow-rs/pull/7307#discussion_r2019127891


##
parquet/src/arrow/array_reader/row_number.rs:
##
@@ -0,0 +1,154 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::arrow::array_reader::ArrayReader;
+use crate::errors::{ParquetError, Result};
+use crate::file::metadata::RowGroupMetaData;
+use arrow_array::{ArrayRef, Int64Array};
+use arrow_schema::DataType;
+use std::any::Any;
+use std::collections::VecDeque;
+use std::sync::Arc;
+
+pub(crate) struct RowNumberReader {
+row_numbers: Vec,
+row_groups: RowGroupSizeIterator,
+}
+
+impl RowNumberReader {
+pub(crate) fn try_new(row_groups: impl IntoIterator) -> 
Result
+where
+I: TryInto,
+{
+let row_groups = RowGroupSizeIterator::try_new(row_groups)?;
+Ok(Self {
+row_numbers: Vec::new(),
+row_groups,
+})
+}
+}
+
+impl ArrayReader for RowNumberReader {
+fn as_any(&self) -> &dyn Any {
+self
+}
+
+fn get_data_type(&self) -> &DataType {
+&DataType::Int64
+}
+
+fn read_records(&mut self, batch_size: usize) -> Result {
+let read = self
+.row_groups
+.read_records(batch_size, &mut self.row_numbers);
+Ok(read)
+}
+
+fn consume_batch(&mut self) -> Result {
+Ok(Arc::new(Int64Array::from_iter(self.row_numbers.drain(..
+}
+
+fn skip_records(&mut self, num_records: usize) -> Result {
+let skipped = self.row_groups.skip_records(num_records);
+Ok(skipped)
+}
+
+fn get_def_levels(&self) -> Option<&[i16]> {
+None
+}
+
+fn get_rep_levels(&self) -> Option<&[i16]> {
+None
+}
+}
+
+struct RowGroupSizeIterator {
+row_groups: VecDeque,
+}
+
+impl RowGroupSizeIterator {
+fn try_new(row_groups: impl IntoIterator) -> Result
+where
+I: TryInto,
+{
+Ok(Self {
+row_groups: VecDeque::from(
+row_groups
+.into_iter()
+.map(TryInto::try_into)
+.collect::>>()?,
+),
+})

Review Comment:
   Answered my own question...it seems there's some complexity here at least 
when using the async reader.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]

2025-03-28 Thread via GitHub


alamb commented on PR #7307:
URL: https://github.com/apache/arrow-rs/pull/7307#issuecomment-2761739286

   Thanks @jkylling and @etseidl  -- I think we need to be very careful to 
balance adding new features in the parquet reader with keeping it fast and 
maintainable. I haven't had a chance to look at this PR yet, but I do worry 
about performance and complexity


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]

2025-03-27 Thread via GitHub


etseidl commented on code in PR #7307:
URL: https://github.com/apache/arrow-rs/pull/7307#discussion_r2017697049


##
parquet/src/arrow/array_reader/row_number.rs:
##
@@ -0,0 +1,154 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::arrow::array_reader::ArrayReader;
+use crate::errors::{ParquetError, Result};
+use crate::file::metadata::RowGroupMetaData;
+use arrow_array::{ArrayRef, Int64Array};
+use arrow_schema::DataType;
+use std::any::Any;
+use std::collections::VecDeque;
+use std::sync::Arc;
+
+pub(crate) struct RowNumberReader {
+row_numbers: Vec,
+row_groups: RowGroupSizeIterator,
+}
+
+impl RowNumberReader {
+pub(crate) fn try_new(row_groups: impl IntoIterator) -> 
Result
+where
+I: TryInto,
+{
+let row_groups = RowGroupSizeIterator::try_new(row_groups)?;
+Ok(Self {
+row_numbers: Vec::new(),
+row_groups,
+})
+}
+}
+
+impl ArrayReader for RowNumberReader {
+fn as_any(&self) -> &dyn Any {
+self
+}
+
+fn get_data_type(&self) -> &DataType {
+&DataType::Int64
+}
+
+fn read_records(&mut self, batch_size: usize) -> Result {
+let read = self
+.row_groups
+.read_records(batch_size, &mut self.row_numbers);
+Ok(read)
+}
+
+fn consume_batch(&mut self) -> Result {
+Ok(Arc::new(Int64Array::from_iter(self.row_numbers.drain(..
+}
+
+fn skip_records(&mut self, num_records: usize) -> Result {
+let skipped = self.row_groups.skip_records(num_records);
+Ok(skipped)
+}
+
+fn get_def_levels(&self) -> Option<&[i16]> {
+None
+}
+
+fn get_rep_levels(&self) -> Option<&[i16]> {
+None
+}
+}
+
+struct RowGroupSizeIterator {
+row_groups: VecDeque,
+}
+
+impl RowGroupSizeIterator {
+fn try_new(row_groups: impl IntoIterator) -> Result
+where
+I: TryInto,
+{
+Ok(Self {
+row_groups: VecDeque::from(
+row_groups
+.into_iter()
+.map(TryInto::try_into)
+.collect::>>()?,
+),
+})

Review Comment:
   I'm finding myself a bit uneasy with adding the first row number to the 
`RowGroupMetaData`. Rather than that, could this bit here instead be changed to 
keep track of the first row number while populating the deque? Is there some 
wrinkle I'm missing? Might the row groups be filtered before instantiating the 
`RowNumberReader`?



##
parquet/src/file/metadata/mod.rs:
##
@@ -584,6 +585,11 @@ impl RowGroupMetaData {
 self.num_rows
 }
 
+/// Returns the first row number in this row group.

Review Comment:
   ```suggestion
   /// Returns the global index number for the first row in this row group.
   ```
   And perhaps use `first_row_index` instead? That may be clearer.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]

2025-03-26 Thread via GitHub


etseidl commented on code in PR #7307:
URL: https://github.com/apache/arrow-rs/pull/7307#discussion_r2015070079


##
parquet/src/arrow/arrow_reader/mod.rs:
##
@@ -239,6 +242,16 @@ impl ArrowReaderBuilder {
 ..self
 }
 }
+
+/// Include file row numbers in the output with the given column name
+///
+/// This will add a column to the output record batch with the file row 
number
+pub fn with_row_number_column(self, row_number_column: Option) -> 
Self {
+Self {
+row_number_column,

Review Comment:
   ```suggestion
   pub fn with_row_number_column(self, row_number_column: String) -> Self {
   Self {
   row_number_column: Some(row_number_column),
   ```
   Since `row_number_column` defaults to `None` anyway.



##
parquet/src/errors.rs:
##
@@ -52,6 +52,9 @@ pub enum ParquetError {
 /// Returned when a function needs more data to complete properly. The 
`usize` field indicates
 /// the total number of bytes required, not the number of additional bytes.
 NeedMoreData(usize),
+/// Returned when an operation needs to know the first row number of a row 
group, but the row
+/// number is unknown.
+RowGroupMetaDataMissingRowNumber,

Review Comment:
   I'm not convinced we need a new error type here. I think you can just use 
`General` for this purpose.



##
parquet/src/arrow/array_reader/builder.rs:
##
@@ -356,14 +392,23 @@ mod tests {
 )
 .unwrap();
 
-let array_reader = build_array_reader(fields.as_ref(), &mask, 
&file_reader).unwrap();
+let array_reader = build_array_reader(

Review Comment:
   I think there should be two tests. `test_create_array_reader` should just 
pass `None` for the row number column and remain otherwise unchanged, and then 
a second `test_create_array_reader_with_row_numbers` which would be what you 
have here.



##
parquet/src/arrow/array_reader/builder.rs:
##
@@ -39,9 +39,29 @@ pub fn build_array_reader(
 field: Option<&ParquetField>,
 mask: &ProjectionMask,
 row_groups: &dyn RowGroups,
+row_number_column: Option,

Review Comment:
   I wonder if it would be better to make this `Option<&str>` and then use 
`.as_deref()` rather than cloning where this (and the other `build_X` 
functions) are called.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]

2025-03-26 Thread via GitHub


jkylling commented on PR #7307:
URL: https://github.com/apache/arrow-rs/pull/7307#issuecomment-2753462478

   > Thanks for you submission @jkylling, I'll try to get a first pass review 
done this week. In the meantime please add the Apache license to row_number.rs 
and correct the other lint errors. 🙏
   
   Updated. Looking forward to the first review!
   
   I was very confused as to why cargo format did not work properly, but looks 
like you are already aware of this 
(https://github.com/apache/arrow-rs/issues/6179) :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Add support for file row numbers in Parquet readers [arrow-rs]

2025-03-25 Thread via GitHub


etseidl commented on PR #7307:
URL: https://github.com/apache/arrow-rs/pull/7307#issuecomment-2752690645

   Thanks for you submission @jkylling, I'll try to get a first pass review 
done this week. In the meantime please add the Apache license to row_number.rs 
and correct the other lint errors. 🙏 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org