Re: [PR] Add benchmark for parquet reader with row_filter and project settings [arrow-rs]

2025-04-16 Thread via GitHub


zhuqi-lucas commented on PR #7401:
URL: https://github.com/apache/arrow-rs/pull/7401#issuecomment-2811691474

   Thank you @alamb @XiangpengHao for double checking!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add benchmark for parquet reader with row_filter and project settings [arrow-rs]

2025-04-16 Thread via GitHub


XiangpengHao commented on PR #7401:
URL: https://github.com/apache/arrow-rs/pull/7401#issuecomment-2809810111

   This is great work, thank you @zhuqi-lucas 
   
   > And still no performance improvement compare the page cache PR to main 
branch. I am confused why datafusion benchmark will be improved but the 
benchmark here will not show performance improvement for the page cache branch.
   
   I plan to take a closer look at this as well. Sorry I was occupied by other 
stuff recently.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add benchmark for parquet reader with row_filter and project settings [arrow-rs]

2025-04-15 Thread via GitHub


alamb merged PR #7401:
URL: https://github.com/apache/arrow-rs/pull/7401


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add benchmark for parquet reader with row_filter and project settings [arrow-rs]

2025-04-15 Thread via GitHub


alamb commented on PR #7401:
URL: https://github.com/apache/arrow-rs/pull/7401#issuecomment-2806880649

   > And still no performance improvement compare the page cache PR to main 
branch. I am confused why datafusion benchmark will be improved but the 
benchmark here will not show performance improvement for the page cache branch.
   
   I agree -- let's figure it out. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add benchmark for parquet reader with row_filter and project settings [arrow-rs]

2025-04-15 Thread via GitHub


alamb commented on PR #7401:
URL: https://github.com/apache/arrow-rs/pull/7401#issuecomment-2802658761

   I also verified the filter patterns like this:
   Patch
   
   
   ```diff
   diff --git a/parquet/src/arrow/arrow_reader/mod.rs 
b/parquet/src/arrow/arrow_reader/mod.rs
   index 8bbe175daf..11ceaed569 100644
   --- a/parquet/src/arrow/arrow_reader/mod.rs
   +++ b/parquet/src/arrow/arrow_reader/mod.rs
   @@ -976,7 +976,9 @@ pub(crate) fn evaluate_predicate(
input_selection: Option,
predicate: &mut dyn ArrowPredicate,
) -> Result {
   +println!("Evaluating predicate, batch_size: {batch_size}, 
input_selection: {:?}", input_selection);
let reader = ParquetRecordBatchReader::new(batch_size, array_reader, 
input_selection.clone());
   +let mut total_input_rows = 0;
let mut filters = vec![];
for maybe_batch in reader {
let maybe_batch = maybe_batch?;
   @@ -993,9 +995,15 @@ pub(crate) fn evaluate_predicate(
0 => filters.push(filter),
_ => filters.push(prep_null_mask_filter(&filter)),
};
   +total_input_rows += input_rows;
}
   
let raw = RowSelection::from_filters(&filters);
   +let selected_rows = raw.row_count();
   +let num_selections = raw.iter().count();
   +let selectivity =  100.0* (selected_rows as f64 / total_input_rows as 
f64);
   +println!("  Selected {selected_rows} rows in {num_selections} 
selections ({selectivity:.3}%)", );
   +println!("  RowSelection: {}", raw);
Ok(match input_selection {
Some(selection) => selection.and_then(&raw),
None => raw,
   diff --git a/parquet/src/arrow/arrow_reader/selection.rs 
b/parquet/src/arrow/arrow_reader/selection.rs
   index c53d47be2e..475b06315d 100644
   --- a/parquet/src/arrow/arrow_reader/selection.rs
   +++ b/parquet/src/arrow/arrow_reader/selection.rs
   @@ -19,6 +19,7 @@ use arrow_array::{Array, BooleanArray};
use arrow_select::filter::SlicesIterator;
use std::cmp::Ordering;
use std::collections::VecDeque;
   +use std::fmt::{Display, Formatter};
use std::ops::Range;
   
/// [`RowSelection`] is a collection of [`RowSelector`] used to skip rows 
when
   @@ -32,6 +33,16 @@ pub struct RowSelector {
pub skip: bool,
}
   
   +impl Display for RowSelector {
   +fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
   +if self.skip {
   +write!(f, "skip({})", self.row_count)
   +} else {
   +write!(f, "select({})", self.row_count)
   +}
   +}
   +}
   +
impl RowSelector {
/// Select `row_count` rows
pub fn select(row_count: usize) -> Self {
   @@ -101,6 +112,22 @@ pub struct RowSelection {
selectors: Vec,
}
   
   +/// Prints a human understandable representation of the RowSelection
   +impl Display for RowSelection {
   +fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
   +write!(f, "[")?;
   +let mut selectors = self.selectors.iter();
   +
   +if let Some(first) = selectors.next() {
   +write!(f, "{}", first)?;
   +for selector in selectors {
   +write!(f, " {}", selector)?;
   +}
   +}
   +write!(f, "]")
   +}
   +}
   +
impl RowSelection {
/// Creates a [`RowSelection`] from a slice of [`BooleanArray`]
///
   ```
   
   

   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add benchmark for parquet reader with row_filter and project settings [arrow-rs]

2025-04-15 Thread via GitHub


zhuqi-lucas commented on PR #7401:
URL: https://github.com/apache/arrow-rs/pull/7401#issuecomment-2804424885

   And still no performance improvement compare the page cache PR to main 
branch. I am confused why datafusion benchmark will be improved but the 
benchmark here will not show performance improvement for the page cache branch.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add benchmark for parquet reader with row_filter and project settings [arrow-rs]

2025-04-14 Thread via GitHub


zhuqi-lucas commented on PR #7401:
URL: https://github.com/apache/arrow-rs/pull/7401#issuecomment-2803761732

   Thanks a lot @alamb ! It looks great!
   
   I also created a follow-up ticket for the sync read for page cache:
   
   https://github.com/apache/arrow-rs/issues/7415
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add benchmark for parquet reader with row_filter and project settings [arrow-rs]

2025-04-14 Thread via GitHub


alamb commented on PR #7401:
URL: https://github.com/apache/arrow-rs/pull/7401#issuecomment-2802674819

   
   ## `PointLookup`
   
   >   Selected 1 rows in 3 selections (0.001%)
   >   RowSelection: [skip(25159) select(1) skip(74840)]
   
   ## `SelectiveUnclustered`
   
   >  Selected 1029 rows in 2037 selections (1.029%)
   >   RowSelection: [skip(5) select(1) skip(17) select(1) skip(533) select(1) 
skip(20) select(1) skip(228) select(1) skip(107) select(1) skip(31) select(1) 
skip(10) select(1) skip(25) select(1) skip(198) select(1) skip(61) select(1) 
skip(114) select(1) skip(45) select(1) skip(115) select(1) skip(10) select(1) 
skip(97) select(1) skip(36) select(1) skip(480) select(1) skip(105) select(1) 
skip(53) select(1) skip(130) select(1) skip(29) select(1) skip(90) select(1) 
skip(125) select(1) skip(12) select(1) skip(15) select(1) skip(5) select(1) 
skip(233) select(1) skip(395) select(1) skip(89) select(1) skip(199) select(1) 
skip(139) select(1) skip(114) select(1) skip(62) select(1) skip(75) select(1) 
...
   
   ## `ModeratelySelectiveClustered`
   
   >   Selected 1 rows in 20 selections (10.000%)
   >   RowSelection: [skip(9000) select(1000) skip(9000) select(1000) 
skip(9000) select(1000) skip(9000) select(1000) skip(9000) select(1000) 
skip(9000) select(1000) skip(9000) select(1000) skip(9000) select(1000) 
skip(9000) select(1000) skip(9000) select(1000)]
   Evaluating predicate, batch_size: 8192, input_selection: None
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add benchmark for parquet reader with row_filter and project settings [arrow-rs]

2025-04-14 Thread via GitHub


alamb commented on PR #7401:
URL: https://github.com/apache/arrow-rs/pull/7401#issuecomment-2802552709

   Thanks @zhuqi-lucas  -- I took the liberty of pushing several commits 
directly to this branch. I tried to keep them independent so you can see what I 
changed: use in memory buffer to avoid file IO duirng benchmarking, updated 
docs, some refactoring
   
   I also added benchmarks for the sync reader as well


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add benchmark for parquet reader with row_filter and project settings [arrow-rs]

2025-04-13 Thread via GitHub


zhuqi-lucas commented on PR #7401:
URL: https://github.com/apache/arrow-rs/pull/7401#issuecomment-2800558854

   Still can't see performance improvement, need to investigate...
   
   ```rust
   critcmp better_decode main
   group
 better_decode  main
   -
 -  
   arrow_reader_row_filter/filter: float64 <= 99.0 proj: all_columns/   
 1.02  2.6±0.03ms? ?/sec1.00  2.5±0.20ms
? ?/sec
   arrow_reader_row_filter/filter: float64 <= 99.0 proj: exclude_filter_column/ 
 1.10  2.5±0.17ms? ?/sec1.00  2.3±0.02ms
? ?/sec
   arrow_reader_row_filter/filter: float64 > 99.0 AND ts >= 9000 proj: 
all_columns/  1.06  2.3±0.03ms? ?/sec1.00  
2.2±0.04ms? ?/sec
   arrow_reader_row_filter/filter: float64 > 99.0 AND ts >= 9000 proj: 
exclude_filter_column/1.09  2.3±0.03ms? ?/sec1.00  
2.1±0.15ms? ?/sec
   arrow_reader_row_filter/filter: float64 > 99.0 proj: all_columns/
 1.02  2.6±0.03ms? ?/sec1.00  2.5±0.16ms
? ?/sec
   arrow_reader_row_filter/filter: float64 > 99.0 proj: exclude_filter_column/  
 1.14  2.6±0.27ms? ?/sec1.00  2.2±0.02ms
? ?/sec
   arrow_reader_row_filter/filter: int64 ==  proj: all_columns/ 
 1.31  2.1±0.04ms? ?/sec1.00  1571.4±48.50µs
? ?/sec
   arrow_reader_row_filter/filter: int64 ==  proj: exclude_filter_column/   
 1.26  2.0±0.02ms? ?/sec1.00  1607.4±135.41µs   
 ? ?/sec
   arrow_reader_row_filter/filter: int64 > 90 proj: all_columns/
 1.06  5.1±0.24ms? ?/sec1.00  4.8±0.04ms
? ?/sec
   arrow_reader_row_filter/filter: int64 > 90 proj: exclude_filter_column/  
 1.05  4.4±0.04ms? ?/sec1.00  4.2±0.06ms
? ?/sec
   arrow_reader_row_filter/filter: ts < 9000 proj: all_columns/ 
 1.07  2.8±0.18ms? ?/sec1.00  2.6±0.04ms
? ?/sec
   arrow_reader_row_filter/filter: ts < 9000 proj: exclude_filter_column/   
 1.05  2.6±0.03ms? ?/sec1.00  2.5±0.15ms
? ?/sec
   arrow_reader_row_filter/filter: ts >= 9000 proj: all_columns/
 1.04  2.2±0.07ms? ?/sec1.00  2.1±0.02ms
? ?/sec
   arrow_reader_row_filter/filter: ts >= 9000 proj: exclude_filter_column/  
 1.07  2.2±0.11ms? ?/sec1.00  2.0±0.15ms
? ?/sec
   arrow_reader_row_filter/filter: utf8View <> '' proj: all_columns/
 1.00 10.5±0.13ms? ?/sec1.03 10.8±0.43ms
? ?/sec
   arrow_reader_row_filter/filter: utf8View <> '' proj: exclude_filter_column/  
 1.03  8.1±0.31ms? ?/sec1.00  7.9±0.12ms
? ?/sec
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add benchmark for parquet reader with row_filter and project settings [arrow-rs]

2025-04-13 Thread via GitHub


zhuqi-lucas commented on code in PR #7401:
URL: https://github.com/apache/arrow-rs/pull/7401#discussion_r2041434834


##
parquet/benches/arrow_reader_row_filter.rs:
##
@@ -0,0 +1,606 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Benchmark for evaluating row filters and projections on a Parquet file.
+//!
+//! # Background:
+//!
+//! As described in [Efficient Filter Pushdown in Parquet], evaluating
+//! pushdown filters is a two-step process:
+//!
+//! 1. Build a filter mask by decoding and evaluating filter functions on
+//!the filter column(s).
+//!
+//! 2. Decode the rows that match the filter mask from the projected columns.
+//!
+//! The performance depends on factors such as the number of rows selected,
+//! the clustering of results (which affects the efficiency of the filter 
mask),
+//! and whether the same column is used for both filtering and projection.
+//!
+//! This benchmark helps measure the performance of these operations.
+//!
+//! [Efficient Filter Pushdown in Parquet]: 
https://datafusion.apache.org/blog/2025/03/21/parquet-pushdown/
+//!
+//! The benchmark creates an in-memory Parquet file with 100K rows and ten 
columns.
+//! The first four columns are:
+//!   - int64: random integers (range: 0..100) generated with a fixed seed.
+//!   - float64: random floating-point values (range: 0.0..100.0) generated 
with a fixed seed.
+//!   - utf8View: random strings with some empty values and occasional 
constant "const" values.
+//!   - ts: sequential timestamps in milliseconds.
+//!
+//! The following six columns (for filtering) are generated to mimic different
+//! filter selectivity and clustering patterns:
+//!   - pt: for Point Lookup – exactly one row is set to "unique_point", all 
others are random strings.
+//!   - sel: for Selective Unclustered – exactly 1% of rows (those with i % 
100 == 0) are "selected".
+//!   - mod_clustered: for Moderately Selective Clustered – in each 10K-row 
block, the first 10 rows are "mod_clustered".
+//!   - mod_unclustered: for Moderately Selective Unclustered – exactly 10% of 
rows (those with i % 10 == 1) are "mod_unclustered".
+//!   - unsel_unclustered: for Unselective Unclustered – exactly 99% of rows 
(those with i % 100 != 0) are "unsel_unclustered".
+//!   - unsel_clustered: for Unselective Clustered – in each 10K-row block, 
rows with an offset >= 1000 are "unsel_clustered".
+//!
+//! As a side note, an additional composite benchmark is provided which 
demonstrates
+//! the performance when applying two filters simultaneously (i.e. chaining 
row selectors).
+
+use arrow::array::{ArrayRef, BooleanArray, Float64Array, Int64Array, 
TimestampMillisecondArray};
+use arrow::compute::kernels::cmp::{eq, gt, neq};
+use arrow::datatypes::{DataType, Field, Schema, TimeUnit};
+use arrow::record_batch::RecordBatch;
+use arrow_array::builder::StringViewBuilder;
+use arrow_array::StringViewArray;
+use arrow_cast::pretty::pretty_format_batches;
+use criterion::{criterion_group, criterion_main, BenchmarkId, Criterion};
+use futures::TryStreamExt;
+use parquet::arrow::arrow_reader::{ArrowPredicateFn, ArrowReaderOptions, 
RowFilter};
+use parquet::arrow::{ArrowWriter, ParquetRecordBatchStreamBuilder, 
ProjectionMask};
+use parquet::file::properties::WriterProperties;
+use rand::{rngs::StdRng, Rng, SeedableRng};
+use std::sync::Arc;
+use tempfile::NamedTempFile;
+use tokio::fs::File;
+
+/// Generates a random string (either short: 3–11 bytes or long: 13–20 bytes) 
with 50% probability.
+/// This is used to fill non-selected rows in the filter columns.
+fn random_string(rng: &mut StdRng) -> String {
+let charset = 
b"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
+let is_long = rng.random_bool(0.5);
+let len = if is_long {
+rng.random_range(13..21)
+} else {
+rng.random_range(3..12)
+};
+(0..len)
+.map(|_| charset[rng.random_range(0..charset.len())] as char)
+.collect()
+}
+
+/// Create a random array for a given field, generating data with fixed seed 
reproducibility.
+/// - For Int64, random integers in [0, 100).
+/// - For Float64, random floats in [0.0, 100.0).
+/// - For Utf8View, a mix of empty strin

Re: [PR] Add benchmark for parquet reader with row_filter and project settings [arrow-rs]

2025-04-13 Thread via GitHub


zhuqi-lucas commented on PR #7401:
URL: https://github.com/apache/arrow-rs/pull/7401#issuecomment-2800504825

   It looks like i forget to set the compression when writing parquet file. It 
may cause the result don't show performance improvement for page cache. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add benchmark for parquet reader with row_filter and project settings [arrow-rs]

2025-04-13 Thread via GitHub


zhuqi-lucas commented on PR #7401:
URL: https://github.com/apache/arrow-rs/pull/7401#issuecomment-277923

   > Thank you @zhuqi-lucas -- I am sorry for the back and forth on this PR but 
I think once we have this benchmark sorted out making the filter pushdown 
performance better will be quite easy.
   > 
   > I have a few more comments on the structure of this PR -- notably I think 
we should reduce the number of filters / columns. I am happy to make these 
changes in the PR myself too, but I wanted to ask you first.
   
   Thank you @alamb for patient review and good suggestion! Addressed it in 
latest PR.
   
   And feel free to open your PR to my branch if  i am missing anything, i am 
happy with it, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add benchmark for parquet reader with row_filter and project settings [arrow-rs]

2025-04-13 Thread via GitHub


zhuqi-lucas commented on code in PR #7401:
URL: https://github.com/apache/arrow-rs/pull/7401#discussion_r2041140455


##
parquet/benches/arrow_reader_row_filter.rs:
##
@@ -0,0 +1,606 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Benchmark for evaluating row filters and projections on a Parquet file.
+//!
+//! # Background:
+//!
+//! As described in [Efficient Filter Pushdown in Parquet], evaluating
+//! pushdown filters is a two-step process:
+//!
+//! 1. Build a filter mask by decoding and evaluating filter functions on
+//!the filter column(s).
+//!
+//! 2. Decode the rows that match the filter mask from the projected columns.
+//!
+//! The performance depends on factors such as the number of rows selected,
+//! the clustering of results (which affects the efficiency of the filter 
mask),
+//! and whether the same column is used for both filtering and projection.
+//!
+//! This benchmark helps measure the performance of these operations.
+//!
+//! [Efficient Filter Pushdown in Parquet]: 
https://datafusion.apache.org/blog/2025/03/21/parquet-pushdown/
+//!
+//! The benchmark creates an in-memory Parquet file with 100K rows and ten 
columns.
+//! The first four columns are:
+//!   - int64: random integers (range: 0..100) generated with a fixed seed.
+//!   - float64: random floating-point values (range: 0.0..100.0) generated 
with a fixed seed.
+//!   - utf8View: random strings with some empty values and occasional 
constant "const" values.
+//!   - ts: sequential timestamps in milliseconds.

Review Comment:
   Good point @alamb , i will try to address soon.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add benchmark for parquet reader with row_filter and project settings [arrow-rs]

2025-04-13 Thread via GitHub


zhuqi-lucas commented on code in PR #7401:
URL: https://github.com/apache/arrow-rs/pull/7401#discussion_r2041140354


##
parquet/benches/arrow_reader_row_filter.rs:
##
@@ -0,0 +1,606 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Benchmark for evaluating row filters and projections on a Parquet file.
+//!
+//! # Background:
+//!
+//! As described in [Efficient Filter Pushdown in Parquet], evaluating
+//! pushdown filters is a two-step process:
+//!
+//! 1. Build a filter mask by decoding and evaluating filter functions on
+//!the filter column(s).
+//!
+//! 2. Decode the rows that match the filter mask from the projected columns.
+//!
+//! The performance depends on factors such as the number of rows selected,
+//! the clustering of results (which affects the efficiency of the filter 
mask),
+//! and whether the same column is used for both filtering and projection.
+//!
+//! This benchmark helps measure the performance of these operations.
+//!
+//! [Efficient Filter Pushdown in Parquet]: 
https://datafusion.apache.org/blog/2025/03/21/parquet-pushdown/
+//!
+//! The benchmark creates an in-memory Parquet file with 100K rows and ten 
columns.
+//! The first four columns are:
+//!   - int64: random integers (range: 0..100) generated with a fixed seed.
+//!   - float64: random floating-point values (range: 0.0..100.0) generated 
with a fixed seed.
+//!   - utf8View: random strings with some empty values and occasional 
constant "const" values.
+//!   - ts: sequential timestamps in milliseconds.
+//!
+//! The following six columns (for filtering) are generated to mimic different
+//! filter selectivity and clustering patterns:
+//!   - pt: for Point Lookup – exactly one row is set to "unique_point", all 
others are random strings.
+//!   - sel: for Selective Unclustered – exactly 1% of rows (those with i % 
100 == 0) are "selected".
+//!   - mod_clustered: for Moderately Selective Clustered – in each 10K-row 
block, the first 10 rows are "mod_clustered".
+//!   - mod_unclustered: for Moderately Selective Unclustered – exactly 10% of 
rows (those with i % 10 == 1) are "mod_unclustered".
+//!   - unsel_unclustered: for Unselective Unclustered – exactly 99% of rows 
(those with i % 100 != 0) are "unsel_unclustered".
+//!   - unsel_clustered: for Unselective Clustered – in each 10K-row block, 
rows with an offset >= 1000 are "unsel_clustered".
+//!
+//! As a side note, an additional composite benchmark is provided which 
demonstrates
+//! the performance when applying two filters simultaneously (i.e. chaining 
row selectors).
+
+use arrow::array::{ArrayRef, BooleanArray, Float64Array, Int64Array, 
TimestampMillisecondArray};
+use arrow::compute::kernels::cmp::{eq, gt, neq};
+use arrow::datatypes::{DataType, Field, Schema, TimeUnit};
+use arrow::record_batch::RecordBatch;
+use arrow_array::builder::StringViewBuilder;
+use arrow_array::StringViewArray;
+use arrow_cast::pretty::pretty_format_batches;
+use criterion::{criterion_group, criterion_main, BenchmarkId, Criterion};
+use futures::TryStreamExt;
+use parquet::arrow::arrow_reader::{ArrowPredicateFn, ArrowReaderOptions, 
RowFilter};
+use parquet::arrow::{ArrowWriter, ParquetRecordBatchStreamBuilder, 
ProjectionMask};
+use parquet::file::properties::WriterProperties;
+use rand::{rngs::StdRng, Rng, SeedableRng};
+use std::sync::Arc;
+use tempfile::NamedTempFile;
+use tokio::fs::File;
+
+/// Generates a random string (either short: 3–11 bytes or long: 13–20 bytes) 
with 50% probability.
+/// This is used to fill non-selected rows in the filter columns.
+fn random_string(rng: &mut StdRng) -> String {
+let charset = 
b"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
+let is_long = rng.random_bool(0.5);
+let len = if is_long {
+rng.random_range(13..21)
+} else {
+rng.random_range(3..12)
+};
+(0..len)
+.map(|_| charset[rng.random_range(0..charset.len())] as char)
+.collect()
+}
+
+/// Create a random array for a given field, generating data with fixed seed 
reproducibility.
+/// - For Int64, random integers in [0, 100).
+/// - For Float64, random floats in [0.0, 100.0).
+/// - For Utf8View, a mix of empty strin

Re: [PR] Add benchmark for parquet reader with row_filter and project settings [arrow-rs]

2025-04-13 Thread via GitHub


alamb commented on code in PR #7401:
URL: https://github.com/apache/arrow-rs/pull/7401#discussion_r2041113079


##
parquet/benches/arrow_reader_row_filter.rs:
##
@@ -0,0 +1,606 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Benchmark for evaluating row filters and projections on a Parquet file.
+//!
+//! # Background:
+//!
+//! As described in [Efficient Filter Pushdown in Parquet], evaluating
+//! pushdown filters is a two-step process:
+//!
+//! 1. Build a filter mask by decoding and evaluating filter functions on
+//!the filter column(s).
+//!
+//! 2. Decode the rows that match the filter mask from the projected columns.
+//!
+//! The performance depends on factors such as the number of rows selected,
+//! the clustering of results (which affects the efficiency of the filter 
mask),
+//! and whether the same column is used for both filtering and projection.
+//!
+//! This benchmark helps measure the performance of these operations.
+//!
+//! [Efficient Filter Pushdown in Parquet]: 
https://datafusion.apache.org/blog/2025/03/21/parquet-pushdown/
+//!
+//! The benchmark creates an in-memory Parquet file with 100K rows and ten 
columns.
+//! The first four columns are:
+//!   - int64: random integers (range: 0..100) generated with a fixed seed.
+//!   - float64: random floating-point values (range: 0.0..100.0) generated 
with a fixed seed.
+//!   - utf8View: random strings with some empty values and occasional 
constant "const" values.
+//!   - ts: sequential timestamps in milliseconds.

Review Comment:
   Thank you very much @zhuqi-lucas  - this is looking really nice. 
   
   I am worried about a few things:
   1. The overlap / duplication of the first four column and original 
predicates: there is duplication across the cases with the specific columns
   2. The over representation of `StringView` -- in the benchmark now there are 
7 StringView columns. I think that will skew the benchmark results much more 
heavily towards string columns
   
   In order to resove this, I suggest we try and keep the four original columns 
and pick predicates that implement the filter patterns in terms of that data
   
   For example, the `point` lookup filter can can be implemented as picking a 
single value from the `int64` column rather than creating an entirely new column
   
   The "unsel_clustered" could be modeled as a predicate on the `ts` column 
(would have to update ts column to be values 0...10k, 1...10k, etc - then the 
predicate `ts >= 9000` would generate the correct pattern I think
   
   



##
parquet/benches/arrow_reader_row_filter.rs:
##
@@ -0,0 +1,325 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Benchmark for evaluating row filters and projections on a Parquet file.
+//!
+//! This benchmark creates a Parquet file in memory with 100K rows and four 
columns:
+//!  - int64: sequential integers
+//!  - float64: floating-point values (derived from the integers)
+//!  - utf8View: string values where about half are non-empty,
+//!and a few rows (every 10Kth row) are the constant "const"
+//!  - ts: timestamp values (using, e.g., a millisecond epoch)
+//!
+//! It then applies several filter functions and projections, benchmarking the 
read-back speed.
+//!
+//! Filters tested:
+//!  - A string filter: `utf8View <> ''` (non-empty)
+//!  - A string filter: `utf8View = 'const'` (selective)
+//!  - An integer non-selective filter (e.g. even

Re: [PR] Add benchmark for parquet reader with row_filter and project settings [arrow-rs]

2025-04-12 Thread via GitHub


zhuqi-lucas commented on PR #7401:
URL: https://github.com/apache/arrow-rs/pull/7401#issuecomment-2799825990

   Benchmark Result compare after the fix:
   
   ```rust
   group
  better_decode  main
   -
  -  
   arrow_reader_row_filter/filter: 1% Unclustered Filter proj: all_columns/ 
  1.07  7.7±0.09ms? ?/sec1.00  
7.2±0.13ms? ?/sec
   arrow_reader_row_filter/filter: 1% Unclustered Filter proj: 
exclude_filter_column/ 1.08  7.0±0.08ms? 
?/sec1.00  6.5±0.10ms? ?/sec
   arrow_reader_row_filter/filter: 10% Clustered Filter proj: all_columns/  
  1.12  6.8±0.15ms? ?/sec1.00  
6.0±0.07ms? ?/sec
   arrow_reader_row_filter/filter: 10% Clustered Filter proj: 
exclude_filter_column/  1.12  6.0±0.08ms? 
?/sec1.00  5.4±0.07ms? ?/sec
   arrow_reader_row_filter/filter: 10% Unclustered Filter proj: all_columns/
  1.06 14.5±0.13ms? ?/sec1.00 
13.7±0.20ms? ?/sec
   arrow_reader_row_filter/filter: 10% Unclustered Filter proj: 
exclude_filter_column/1.05 13.2±0.08ms? 
?/sec1.00 12.6±0.22ms? ?/sec
   arrow_reader_row_filter/filter: 90% Clustered Filter proj: all_columns/  
  1.10  9.8±0.32ms? ?/sec1.00  
9.0±0.10ms? ?/sec
   arrow_reader_row_filter/filter: 90% Clustered Filter proj: 
exclude_filter_column/  1.07  9.4±0.09ms? 
?/sec1.00  8.8±0.10ms? ?/sec
   arrow_reader_row_filter/filter: 99% Unclustered Filter proj: all_columns/
  1.09 12.0±0.07ms? ?/sec1.00 
11.0±0.11ms? ?/sec
   arrow_reader_row_filter/filter: 99% Unclustered Filter proj: 
exclude_filter_column/1.09 11.7±0.09ms? 
?/sec1.00 10.7±0.09ms? ?/sec
   arrow_reader_row_filter/filter: Point Lookup proj: all_columns/  
  1.27  6.5±0.10ms? ?/sec1.00  
5.1±0.21ms? ?/sec
   arrow_reader_row_filter/filter: Point Lookup proj: exclude_filter_column/
  1.31  5.9±0.07ms? ?/sec1.00  
4.5±0.06ms? ?/sec
   arrow_reader_row_filter/filter: float64 > 50.0 proj: all_columns/
  1.02 26.6±0.21ms? ?/sec1.00 
26.0±0.23ms? ?/sec
   arrow_reader_row_filter/filter: float64 > 50.0 proj: exclude_filter_column/  
  1.04 23.8±0.15ms? ?/sec1.00 
22.8±0.17ms? ?/sec
   arrow_reader_row_filter/filter: int64 > 0 proj: all_columns/ 
  1.05 10.7±0.12ms? ?/sec1.00 
10.1±0.18ms? ?/sec
   arrow_reader_row_filter/filter: int64 > 0 proj: exclude_filter_column/   
  1.07 10.4±0.11ms? ?/sec1.00  
9.7±0.10ms? ?/sec
   arrow_reader_row_filter/filter: ts > 50_000 proj: all_columns/   
  1.12  7.4±0.05ms? ?/sec1.00  
6.6±0.08ms? ?/sec
   arrow_reader_row_filter/filter: ts > 50_000 proj: exclude_filter_column/ 
  1.13  7.3±0.05ms? ?/sec1.00  
6.5±0.08ms? ?/sec
   arrow_reader_row_filter/filter: utf8View <> '' proj: all_columns/
  1.00 22.9±0.21ms? ?/sec1.03 
23.5±0.70ms? ?/sec
   arrow_reader_row_filter/filter: utf8View <> '' proj: exclude_filter_column/  
  1.01 20.6±0.15ms? ?/sec1.00 
20.3±0.17ms? ?/sec
   arrow_reader_row_filter/filter: utf8View = 'const' proj: all_columns/
  1.04 10.3±0.21ms? ?/sec1.00  
9.9±0.10ms? ?/sec
   arrow_reader_row_filter/filter: utf8View = 'const' proj: 
exclude_filter_column/1.04  9.4±0.11ms? 
?/sec1.00  9.0±0.09ms? ?/sec
   arrow_reader_row_filter/filter_case: int64 = 0 project_case: all_columns/
  1.00   821.4±37.47µs? ?/sec1.00   
820.8±12.99µs? ?/sec
   arrow_reader_row_filter/filter_case: int64 = 0 project_case: 
exclude_filter_column/1.00   754.9±29.52µs? 
?/sec1.03779.4±8.15µs? ?/sec
   arrow_reader_row_filter/filt

Re: [PR] Add benchmark for parquet reader with row_filter and project settings [arrow-rs]

2025-04-12 Thread via GitHub


zhuqi-lucas commented on PR #7401:
URL: https://github.com/apache/arrow-rs/pull/7401#issuecomment-2798797706

   I created a fix, it seems can fix the problem:
   
   
https://github.com/zhuqi-lucas/arrow-rs/commit/d0ab2fe851babe158452104e823f8b57f8b3df01
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add benchmark for parquet reader with row_filter and project settings [arrow-rs]

2025-04-12 Thread via GitHub


zhuqi-lucas commented on PR #7401:
URL: https://github.com/apache/arrow-rs/pull/7401#issuecomment-2798744870

   I found the deadlock happen in the following code for page cache branch:
   
   1. When we call has_next:
   ```rust
   while total_records_read < max_records && self.has_next()? {
   }
   ```
   2. And we will call read_new_page:
   
   ```rust
   #[inline]
   pub(crate) fn has_next(&mut self) -> Result {
   if self.num_buffered_values == 0 || self.num_buffered_values == 
self.num_decoded_values {
   // TODO: should we return false if read_new_page() = true and
   // num_buffered_values = 0?
   println!("num_buffered_values: {}, num_decoded_values: {}", 
self.num_buffered_values, self.num_decoded_values);
   if !self.read_new_page()? {
   Ok(false)
   } else {
   Ok(self.num_buffered_values != 0)
   }
   } else {
   Ok(true)
   }
   }
   ```
   
   3. We will call read_new_page, and the loop will cause dead lock because the 
Page::DictionaryPage  contine:
   ```rust
  /// Reads a new page and set up the decoders for levels, values or 
dictionary.
   /// Returns false if there's no page left.
   fn read_new_page(&mut self) -> Result {
   println!("GenericColumnReader read_new_page");
   loop {
   match self.page_reader.get_next_page()? {
   // No more page to read
   None => return Ok(false),
   Some(current_page) => {
   //println!("GenericColumnReader read_new_page 
current_page: {:?}", current_page.page_type());
   match current_page {
   // 1. Dictionary page: configure dictionary for this 
page.
   Page::DictionaryPage {
   buf,
   num_values,
   encoding,
   is_sorted,
   } => {
   self.values_decoder
   .set_dict(buf, num_values, encoding, 
is_sorted)?;
   continue;
   }
}
}
   }
   ```
   
   4. The root cause is we will always get the cached dict page in the 
following logic, this is the corner case for this benchmark with page cache 
branch:
   
   ```rust
   impl PageReader for CachedPageReader {
   fn get_next_page(&mut self) -> Result, ParquetError> {
   //println!("CachedPageReader get next page");
   let next_page_offset = self.inner.peek_next_page_offset()?;
   //println!("CachedPageReader next page offset: {:?}", 
next_page_offset);
   
   let Some(offset) = next_page_offset else {
   return Ok(None);
   };
   
   let mut cache = self.cache.get();
   
   let page = cache.get_page(self.col_id, offset);
   if let Some(page) = page {
   self.inner.skip_next_page()?;
   //println!("CachedPageReader skip next page");
   Ok(Some(page))
   } else {
   //println!("CachedPageReader insert page");
   let inner_page = self.inner.get_next_page()?;
   let Some(inner_page) = inner_page else {
   return Ok(None);
   };
   cache.insert_page(self.col_id, offset, inner_page.clone());
   Ok(Some(inner_page))
   }
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add benchmark for parquet reader with row_filter and project settings [arrow-rs]

2025-04-11 Thread via GitHub


alamb commented on PR #7401:
URL: https://github.com/apache/arrow-rs/pull/7401#issuecomment-2797389687

   Thanks again @zhuqi-lucas  -- pleasant dreams!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add benchmark for parquet reader with row_filter and project settings [arrow-rs]

2025-04-11 Thread via GitHub


alamb commented on PR #7401:
URL: https://github.com/apache/arrow-rs/pull/7401#issuecomment-2797389267

   > After 1 done, i need to continue debug the cache page branch, investigate 
why it stuck for the benchmark testing.
   
   FWIW I think I saw something similar when I was testing the branch in 
datafusion -- so in other words I think it is a real bug


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add benchmark for parquet reader with row_filter and project settings [arrow-rs]

2025-04-11 Thread via GitHub


alamb commented on code in PR #7401:
URL: https://github.com/apache/arrow-rs/pull/7401#discussion_r2039657338


##
parquet/benches/arrow_reader_row_filter.rs:
##
@@ -0,0 +1,325 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Benchmark for evaluating row filters and projections on a Parquet file.
+//!
+//! This benchmark creates a Parquet file in memory with 100K rows and four 
columns:
+//!  - int64: sequential integers
+//!  - float64: floating-point values (derived from the integers)
+//!  - utf8View: string values where about half are non-empty,
+//!and a few rows (every 10Kth row) are the constant "const"
+//!  - ts: timestamp values (using, e.g., a millisecond epoch)
+//!
+//! It then applies several filter functions and projections, benchmarking the 
read-back speed.
+//!
+//! Filters tested:
+//!  - A string filter: `utf8View <> ''` (non-empty)
+//!  - A string filter: `utf8View = 'const'` (selective)
+//!  - An integer non-selective filter (e.g. even numbers)
+//!  - An integer selective filter (e.g. `int64 = 0`)
+//!  - A timestamp filter (e.g. `ts > threshold`)
+//!
+//! Projections tested:
+//!  - All 4 columns.
+//!  - All columns except the one used for the filter.
+//!
+//! To run the benchmark, use `cargo bench --bench bench_filter_projection`.
+
+use criterion::{criterion_group, criterion_main, BenchmarkId, Criterion};
+use std::sync::Arc;
+use tempfile::NamedTempFile;
+
+use arrow::array::{
+ArrayRef, BooleanArray, BooleanBuilder, Float64Array, Int64Array, 
TimestampMillisecondArray,
+};
+use arrow::datatypes::{DataType, Field, Schema, TimeUnit};
+use arrow::record_batch::RecordBatch;
+use arrow_array::builder::StringViewBuilder;
+use arrow_array::{Array, StringViewArray};
+use criterion::async_executor::FuturesExecutor;
+use futures::TryStreamExt;
+use parquet::arrow::arrow_reader::{ArrowPredicateFn, ArrowReaderOptions, 
RowFilter};
+use parquet::arrow::{ArrowWriter, ParquetRecordBatchStreamBuilder, 
ProjectionMask};
+use parquet::file::properties::WriterProperties;
+use tokio::fs::File;
+use tokio::runtime::Runtime;
+
+/// Create a RecordBatch with 100K rows and four columns.
+fn make_record_batch() -> RecordBatch {
+let num_rows = 100_000;
+
+// int64 column: sequential numbers 0..num_rows
+let int_values: Vec = (0..num_rows as i64).collect();
+let int_array = Arc::new(Int64Array::from(int_values)) as ArrayRef;
+
+// float64 column: derived from int64 (e.g., multiplied by 0.1)
+let float_values: Vec = (0..num_rows).map(|i| i as f64 * 
0.1).collect();
+let float_array = Arc::new(Float64Array::from(float_values)) as ArrayRef;
+
+// utf8View column: even rows get non-empty strings; odd rows get an empty 
string;
+// every 10Kth even row is "const" to be selective.
+let mut string_view_builder = StringViewBuilder::with_capacity(100_000);
+for i in 0..num_rows {
+if i % 2 == 0 {
+if i % 10_000 == 0 {
+string_view_builder.append_value("const");
+} else {
+string_view_builder.append_value("nonempty");
+}
+} else {
+string_view_builder.append_value("");
+}
+}
+let utf8_view_array = Arc::new(string_view_builder.finish()) as ArrayRef;
+
+// Timestamp column: using milliseconds from an epoch (simply using the 
row index)
+let ts_values: Vec = (0..num_rows as i64).collect();
+let ts_array = Arc::new(TimestampMillisecondArray::from(ts_values)) as 
ArrayRef;
+
+let schema = Arc::new(Schema::new(vec![
+Field::new("int64", DataType::Int64, false),
+Field::new("float64", DataType::Float64, false),
+Field::new("utf8View", DataType::Utf8View, false),
+Field::new(
+"ts",
+DataType::Timestamp(TimeUnit::Millisecond, None),
+false,
+),
+]));
+
+RecordBatch::try_new(
+schema,
+vec![int_array, float_array, utf8_view_array, ts_array],
+)
+.unwrap()
+}
+
+/// Writes the record batch to a temporary Parquet file.
+fn write_parquet_file() -> NamedTempFile {
+let batch = make_record_batch();
+let schema = batch.schema();
+let props = WriterP

Re: [PR] Add benchmark for parquet reader with row_filter and project settings [arrow-rs]

2025-04-11 Thread via GitHub


zhuqi-lucas commented on PR #7401:
URL: https://github.com/apache/arrow-rs/pull/7401#issuecomment-2797205577

   End of today, tomorrow plan:
   
   1. Add more cases to cover all the 6 more fine-grained testing, and make it 
working well for main branch.
   2. After 1 done, i need to continue debug the cache page branch, investigate 
why it stuck for the benchmark testing.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add benchmark for parquet reader with row_filter and project settings [arrow-rs]

2025-04-11 Thread via GitHub


zhuqi-lucas commented on PR #7401:
URL: https://github.com/apache/arrow-rs/pull/7401#issuecomment-2797187180

   > BTW if you have a moment to review (and hopefully merge) 
[zhuqi-lucas#1](https://github.com/zhuqi-lucas/arrow-rs/pull/1) into this PR I 
think it helps provide some additional backstory
   
   Thank you @alamb !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add benchmark for parquet reader with row_filter and project settings [arrow-rs]

2025-04-11 Thread via GitHub


alamb commented on PR #7401:
URL: https://github.com/apache/arrow-rs/pull/7401#issuecomment-2797175368

   BTW if you have a moment to review (and hopefully merge) 
https://github.com/zhuqi-lucas/arrow-rs/pull/1 into this PR I think it helps 
provide some additional backstory


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add benchmark for parquet reader with row_filter and project settings [arrow-rs]

2025-04-11 Thread via GitHub


zhuqi-lucas commented on code in PR #7401:
URL: https://github.com/apache/arrow-rs/pull/7401#discussion_r2039726637


##
parquet/benches/arrow_reader_row_filter.rs:
##
@@ -0,0 +1,325 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Benchmark for evaluating row filters and projections on a Parquet file.
+//!
+//! This benchmark creates a Parquet file in memory with 100K rows and four 
columns:
+//!  - int64: sequential integers
+//!  - float64: floating-point values (derived from the integers)
+//!  - utf8View: string values where about half are non-empty,
+//!and a few rows (every 10Kth row) are the constant "const"
+//!  - ts: timestamp values (using, e.g., a millisecond epoch)
+//!
+//! It then applies several filter functions and projections, benchmarking the 
read-back speed.
+//!
+//! Filters tested:
+//!  - A string filter: `utf8View <> ''` (non-empty)
+//!  - A string filter: `utf8View = 'const'` (selective)
+//!  - An integer non-selective filter (e.g. even numbers)
+//!  - An integer selective filter (e.g. `int64 = 0`)
+//!  - A timestamp filter (e.g. `ts > threshold`)
+//!
+//! Projections tested:
+//!  - All 4 columns.
+//!  - All columns except the one used for the filter.
+//!
+//! To run the benchmark, use `cargo bench --bench bench_filter_projection`.
+
+use criterion::{criterion_group, criterion_main, BenchmarkId, Criterion};
+use std::sync::Arc;
+use tempfile::NamedTempFile;
+
+use arrow::array::{
+ArrayRef, BooleanArray, BooleanBuilder, Float64Array, Int64Array, 
TimestampMillisecondArray,
+};
+use arrow::datatypes::{DataType, Field, Schema, TimeUnit};
+use arrow::record_batch::RecordBatch;
+use arrow_array::builder::StringViewBuilder;
+use arrow_array::{Array, StringViewArray};
+use criterion::async_executor::FuturesExecutor;
+use futures::TryStreamExt;
+use parquet::arrow::arrow_reader::{ArrowPredicateFn, ArrowReaderOptions, 
RowFilter};
+use parquet::arrow::{ArrowWriter, ParquetRecordBatchStreamBuilder, 
ProjectionMask};
+use parquet::file::properties::WriterProperties;
+use tokio::fs::File;
+use tokio::runtime::Runtime;
+
+/// Create a RecordBatch with 100K rows and four columns.
+fn make_record_batch() -> RecordBatch {
+let num_rows = 100_000;
+
+// int64 column: sequential numbers 0..num_rows
+let int_values: Vec = (0..num_rows as i64).collect();
+let int_array = Arc::new(Int64Array::from(int_values)) as ArrayRef;
+
+// float64 column: derived from int64 (e.g., multiplied by 0.1)
+let float_values: Vec = (0..num_rows).map(|i| i as f64 * 
0.1).collect();
+let float_array = Arc::new(Float64Array::from(float_values)) as ArrayRef;
+
+// utf8View column: even rows get non-empty strings; odd rows get an empty 
string;
+// every 10Kth even row is "const" to be selective.
+let mut string_view_builder = StringViewBuilder::with_capacity(100_000);
+for i in 0..num_rows {
+if i % 2 == 0 {
+if i % 10_000 == 0 {
+string_view_builder.append_value("const");
+} else {
+string_view_builder.append_value("nonempty");
+}
+} else {
+string_view_builder.append_value("");
+}
+}
+let utf8_view_array = Arc::new(string_view_builder.finish()) as ArrayRef;
+
+// Timestamp column: using milliseconds from an epoch (simply using the 
row index)
+let ts_values: Vec = (0..num_rows as i64).collect();
+let ts_array = Arc::new(TimestampMillisecondArray::from(ts_values)) as 
ArrayRef;
+
+let schema = Arc::new(Schema::new(vec![
+Field::new("int64", DataType::Int64, false),
+Field::new("float64", DataType::Float64, false),
+Field::new("utf8View", DataType::Utf8View, false),
+Field::new(
+"ts",
+DataType::Timestamp(TimeUnit::Millisecond, None),
+false,
+),
+]));
+
+RecordBatch::try_new(
+schema,
+vec![int_array, float_array, utf8_view_array, ts_array],
+)
+.unwrap()
+}
+
+/// Writes the record batch to a temporary Parquet file.
+fn write_parquet_file() -> NamedTempFile {
+let batch = make_record_batch();
+let schema = batch.schema();
+let props = W

Re: [PR] Add benchmark for parquet reader with row_filter and project settings [arrow-rs]

2025-04-11 Thread via GitHub


zhuqi-lucas commented on code in PR #7401:
URL: https://github.com/apache/arrow-rs/pull/7401#discussion_r2039672581


##
parquet/benches/arrow_reader_row_filter.rs:
##
@@ -0,0 +1,325 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Benchmark for evaluating row filters and projections on a Parquet file.
+//!
+//! This benchmark creates a Parquet file in memory with 100K rows and four 
columns:
+//!  - int64: sequential integers
+//!  - float64: floating-point values (derived from the integers)
+//!  - utf8View: string values where about half are non-empty,
+//!and a few rows (every 10Kth row) are the constant "const"
+//!  - ts: timestamp values (using, e.g., a millisecond epoch)
+//!
+//! It then applies several filter functions and projections, benchmarking the 
read-back speed.
+//!
+//! Filters tested:
+//!  - A string filter: `utf8View <> ''` (non-empty)
+//!  - A string filter: `utf8View = 'const'` (selective)
+//!  - An integer non-selective filter (e.g. even numbers)
+//!  - An integer selective filter (e.g. `int64 = 0`)
+//!  - A timestamp filter (e.g. `ts > threshold`)
+//!
+//! Projections tested:
+//!  - All 4 columns.
+//!  - All columns except the one used for the filter.
+//!
+//! To run the benchmark, use `cargo bench --bench bench_filter_projection`.
+
+use criterion::{criterion_group, criterion_main, BenchmarkId, Criterion};
+use std::sync::Arc;
+use tempfile::NamedTempFile;
+
+use arrow::array::{
+ArrayRef, BooleanArray, BooleanBuilder, Float64Array, Int64Array, 
TimestampMillisecondArray,
+};
+use arrow::datatypes::{DataType, Field, Schema, TimeUnit};
+use arrow::record_batch::RecordBatch;
+use arrow_array::builder::StringViewBuilder;
+use arrow_array::{Array, StringViewArray};
+use criterion::async_executor::FuturesExecutor;
+use futures::TryStreamExt;
+use parquet::arrow::arrow_reader::{ArrowPredicateFn, ArrowReaderOptions, 
RowFilter};
+use parquet::arrow::{ArrowWriter, ParquetRecordBatchStreamBuilder, 
ProjectionMask};
+use parquet::file::properties::WriterProperties;
+use tokio::fs::File;
+use tokio::runtime::Runtime;
+
+/// Create a RecordBatch with 100K rows and four columns.
+fn make_record_batch() -> RecordBatch {
+let num_rows = 100_000;
+
+// int64 column: sequential numbers 0..num_rows
+let int_values: Vec = (0..num_rows as i64).collect();
+let int_array = Arc::new(Int64Array::from(int_values)) as ArrayRef;
+
+// float64 column: derived from int64 (e.g., multiplied by 0.1)
+let float_values: Vec = (0..num_rows).map(|i| i as f64 * 
0.1).collect();
+let float_array = Arc::new(Float64Array::from(float_values)) as ArrayRef;
+
+// utf8View column: even rows get non-empty strings; odd rows get an empty 
string;
+// every 10Kth even row is "const" to be selective.
+let mut string_view_builder = StringViewBuilder::with_capacity(100_000);
+for i in 0..num_rows {
+if i % 2 == 0 {
+if i % 10_000 == 0 {
+string_view_builder.append_value("const");
+} else {
+string_view_builder.append_value("nonempty");
+}
+} else {
+string_view_builder.append_value("");
+}
+}
+let utf8_view_array = Arc::new(string_view_builder.finish()) as ArrayRef;
+
+// Timestamp column: using milliseconds from an epoch (simply using the 
row index)
+let ts_values: Vec = (0..num_rows as i64).collect();
+let ts_array = Arc::new(TimestampMillisecondArray::from(ts_values)) as 
ArrayRef;
+
+let schema = Arc::new(Schema::new(vec![
+Field::new("int64", DataType::Int64, false),
+Field::new("float64", DataType::Float64, false),
+Field::new("utf8View", DataType::Utf8View, false),
+Field::new(
+"ts",
+DataType::Timestamp(TimeUnit::Millisecond, None),
+false,
+),
+]));
+
+RecordBatch::try_new(
+schema,
+vec![int_array, float_array, utf8_view_array, ts_array],
+)
+.unwrap()
+}
+
+/// Writes the record batch to a temporary Parquet file.
+fn write_parquet_file() -> NamedTempFile {
+let batch = make_record_batch();
+let schema = batch.schema();
+let props = W

Re: [PR] Add benchmark for parquet reader with row_filter and project settings [arrow-rs]

2025-04-11 Thread via GitHub


alamb commented on code in PR #7401:
URL: https://github.com/apache/arrow-rs/pull/7401#discussion_r2039656300


##
parquet/benches/arrow_reader_row_filter.rs:
##
@@ -0,0 +1,325 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Benchmark for evaluating row filters and projections on a Parquet file.
+//!
+//! This benchmark creates a Parquet file in memory with 100K rows and four 
columns:
+//!  - int64: sequential integers
+//!  - float64: floating-point values (derived from the integers)
+//!  - utf8View: string values where about half are non-empty,
+//!and a few rows (every 10Kth row) are the constant "const"
+//!  - ts: timestamp values (using, e.g., a millisecond epoch)
+//!
+//! It then applies several filter functions and projections, benchmarking the 
read-back speed.
+//!
+//! Filters tested:
+//!  - A string filter: `utf8View <> ''` (non-empty)
+//!  - A string filter: `utf8View = 'const'` (selective)
+//!  - An integer non-selective filter (e.g. even numbers)
+//!  - An integer selective filter (e.g. `int64 = 0`)
+//!  - A timestamp filter (e.g. `ts > threshold`)
+//!
+//! Projections tested:
+//!  - All 4 columns.
+//!  - All columns except the one used for the filter.
+//!
+//! To run the benchmark, use `cargo bench --bench bench_filter_projection`.
+
+use criterion::{criterion_group, criterion_main, BenchmarkId, Criterion};
+use std::sync::Arc;
+use tempfile::NamedTempFile;
+
+use arrow::array::{
+ArrayRef, BooleanArray, BooleanBuilder, Float64Array, Int64Array, 
TimestampMillisecondArray,
+};
+use arrow::datatypes::{DataType, Field, Schema, TimeUnit};
+use arrow::record_batch::RecordBatch;
+use arrow_array::builder::StringViewBuilder;
+use arrow_array::{Array, StringViewArray};
+use criterion::async_executor::FuturesExecutor;
+use futures::TryStreamExt;
+use parquet::arrow::arrow_reader::{ArrowPredicateFn, ArrowReaderOptions, 
RowFilter};
+use parquet::arrow::{ArrowWriter, ParquetRecordBatchStreamBuilder, 
ProjectionMask};
+use parquet::file::properties::WriterProperties;
+use tokio::fs::File;
+use tokio::runtime::Runtime;
+
+/// Create a RecordBatch with 100K rows and four columns.
+fn make_record_batch() -> RecordBatch {
+let num_rows = 100_000;
+
+// int64 column: sequential numbers 0..num_rows
+let int_values: Vec = (0..num_rows as i64).collect();
+let int_array = Arc::new(Int64Array::from(int_values)) as ArrayRef;
+
+// float64 column: derived from int64 (e.g., multiplied by 0.1)
+let float_values: Vec = (0..num_rows).map(|i| i as f64 * 
0.1).collect();
+let float_array = Arc::new(Float64Array::from(float_values)) as ArrayRef;
+
+// utf8View column: even rows get non-empty strings; odd rows get an empty 
string;
+// every 10Kth even row is "const" to be selective.
+let mut string_view_builder = StringViewBuilder::with_capacity(100_000);
+for i in 0..num_rows {
+if i % 2 == 0 {
+if i % 10_000 == 0 {
+string_view_builder.append_value("const");
+} else {
+string_view_builder.append_value("nonempty");
+}
+} else {
+string_view_builder.append_value("");
+}
+}
+let utf8_view_array = Arc::new(string_view_builder.finish()) as ArrayRef;
+
+// Timestamp column: using milliseconds from an epoch (simply using the 
row index)
+let ts_values: Vec = (0..num_rows as i64).collect();
+let ts_array = Arc::new(TimestampMillisecondArray::from(ts_values)) as 
ArrayRef;
+
+let schema = Arc::new(Schema::new(vec![
+Field::new("int64", DataType::Int64, false),
+Field::new("float64", DataType::Float64, false),
+Field::new("utf8View", DataType::Utf8View, false),
+Field::new(
+"ts",
+DataType::Timestamp(TimeUnit::Millisecond, None),
+false,
+),
+]));
+
+RecordBatch::try_new(
+schema,
+vec![int_array, float_array, utf8_view_array, ts_array],
+)
+.unwrap()
+}
+
+/// Writes the record batch to a temporary Parquet file.
+fn write_parquet_file() -> NamedTempFile {
+let batch = make_record_batch();
+let schema = batch.schema();
+let props = WriterP

Re: [PR] Add benchmark for parquet reader with row_filter and project settings [arrow-rs]

2025-04-11 Thread via GitHub


alamb commented on PR #7401:
URL: https://github.com/apache/arrow-rs/pull/7401#issuecomment-2797052875

   @zhuqi-lucas  here is a proposed addition to this PR to improve the comments:
   - https://github.com/zhuqi-lucas/arrow-rs/pull/1
   
   Also after thinking about this more today, I think we should have 6 
predicates that generate filters in particular patterns, as described here: 
https://github.com/apache/arrow-rs/issues/7363#issuecomment-2797040089
   
   I also have some smaller code suggestions I will leave inline
   
   Again, thank you for your work on this PR


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add benchmark for parquet reader with row_filter and project settings [arrow-rs]

2025-04-11 Thread via GitHub


alamb commented on PR #7401:
URL: https://github.com/apache/arrow-rs/pull/7401#issuecomment-2796900273

   THANK YOU!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add benchmark for parquet reader with row_filter and project settings [arrow-rs]

2025-04-11 Thread via GitHub


zhuqi-lucas commented on PR #7401:
URL: https://github.com/apache/arrow-rs/pull/7401#issuecomment-2796667149

   > Thank you very much @zhuqi-lucas -- this is a great start and very much 
appreciated. I left some comments on the structure so far, and I am going to 
spend some time now writing up details of what we are testing. I'll have a 
proposal soon for your consideration
   > 
   > Thank you so much again. This project is really important, but I think 
requires focus and determination
   
   Thank you @alamb for review and good suggestion! I will address it soon. And 
meanwhile, the testing works well for main branch, but the testing seems stuck 
when change to the cache page branch. I am trying to debug it.
   
   I agree, and when we have a better testing, we can improve the code to make 
the performance better, i am looking forward to help it!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add benchmark for parquet reader with row_filter and project settings [arrow-rs]

2025-04-11 Thread via GitHub


alamb commented on code in PR #7401:
URL: https://github.com/apache/arrow-rs/pull/7401#discussion_r2039337556


##
parquet/benches/arrow_reader_row_filter.rs:
##
@@ -0,0 +1,325 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Benchmark for evaluating row filters and projections on a Parquet file.
+//!
+//! This benchmark creates a Parquet file in memory with 100K rows and four 
columns:
+//!  - int64: sequential integers
+//!  - float64: floating-point values (derived from the integers)
+//!  - utf8View: string values where about half are non-empty,
+//!and a few rows (every 10Kth row) are the constant "const"
+//!  - ts: timestamp values (using, e.g., a millisecond epoch)
+//!
+//! It then applies several filter functions and projections, benchmarking the 
read-back speed.
+//!
+//! Filters tested:
+//!  - A string filter: `utf8View <> ''` (non-empty)
+//!  - A string filter: `utf8View = 'const'` (selective)
+//!  - An integer non-selective filter (e.g. even numbers)
+//!  - An integer selective filter (e.g. `int64 = 0`)
+//!  - A timestamp filter (e.g. `ts > threshold`)
+//!
+//! Projections tested:
+//!  - All 4 columns.
+//!  - All columns except the one used for the filter.
+//!
+//! To run the benchmark, use `cargo bench --bench bench_filter_projection`.
+
+use criterion::{criterion_group, criterion_main, BenchmarkId, Criterion};
+use std::sync::Arc;
+use tempfile::NamedTempFile;
+
+use arrow::array::{
+ArrayRef, BooleanArray, BooleanBuilder, Float64Array, Int64Array, 
TimestampMillisecondArray,
+};
+use arrow::datatypes::{DataType, Field, Schema, TimeUnit};
+use arrow::record_batch::RecordBatch;
+use arrow_array::builder::StringViewBuilder;
+use arrow_array::{Array, StringViewArray};
+use criterion::async_executor::FuturesExecutor;
+use futures::TryStreamExt;
+use parquet::arrow::arrow_reader::{ArrowPredicateFn, ArrowReaderOptions, 
RowFilter};
+use parquet::arrow::{ArrowWriter, ParquetRecordBatchStreamBuilder, 
ProjectionMask};
+use parquet::file::properties::WriterProperties;
+use tokio::fs::File;
+use tokio::runtime::Runtime;
+
+/// Create a RecordBatch with 100K rows and four columns.
+fn make_record_batch() -> RecordBatch {
+let num_rows = 100_000;
+
+// int64 column: sequential numbers 0..num_rows
+let int_values: Vec = (0..num_rows as i64).collect();

Review Comment:
   I think it is more common to use fixed seeded random values to create values 
to avoid artifacts that such regular patterns may introduce
   
   There are some good examples here: 
https://github.com/apache/arrow-rs/blob/d0260fcffa07a4cb8650cc290ab29027a3a8e65c/parquet/benches/arrow_writer.rs#L101-L100
   
   



##
parquet/benches/arrow_reader_row_filter.rs:
##
@@ -0,0 +1,325 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Benchmark for evaluating row filters and projections on a Parquet file.
+//!
+//! This benchmark creates a Parquet file in memory with 100K rows and four 
columns:
+//!  - int64: sequential integers
+//!  - float64: floating-point values (derived from the integers)
+//!  - utf8View: string values where about half are non-empty,
+//!and a few rows (every 10Kth row) are the constant "const"
+//!  - ts: timestamp values (using, e.g., a millisecond epoch)
+//!
+//! It then applies several filter functions and projections, benchmarking the 
read-back speed.
+//!
+//! Filters tested:
+//!  - A string filter: `utf8View <> ''` (non