Re: [PR] Use `any` instead of `for_each` [datafusion]
berkaysynnada commented on PR #15289: URL: https://github.com/apache/datafusion/pull/15289#issuecomment-2732082181 > > Instead of either "try for all" or "skip at all", isn't better to only go over the columns which has statistics.is_some() ? > > Do you mean the following? yep, even withing a single iteration after zipping -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Use `any` instead of `for_each` [datafusion]
xudong963 commented on code in PR #15289:
URL: https://github.com/apache/datafusion/pull/15289#discussion_r2007912619
##
datafusion/datasource-parquet/src/file_format.rs:
##
@@ -839,9 +839,10 @@ pub fn statistics_from_parquet_meta_calc(
total_byte_size += row_group_meta.total_byte_size() as usize;
if !has_statistics {
-row_group_meta.columns().iter().for_each(|column| {
Review Comment:
Yes, because `even if a column in table_schema doesn't have the
corresponding statistics in row_group_meta, it also may need to go
summarize_min_max_null_counts to set statistics with null.`, I think it isn't
worth trying.
In fact, I have tried it, and it doesn't make much sense, they
(row_group_meta.columns() and table_schema.fields())are not always aligned.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Use `any` instead of `for_each` [datafusion]
xudong963 commented on code in PR #15289:
URL: https://github.com/apache/datafusion/pull/15289#discussion_r2007912619
##
datafusion/datasource-parquet/src/file_format.rs:
##
@@ -839,9 +839,10 @@ pub fn statistics_from_parquet_meta_calc(
total_byte_size += row_group_meta.total_byte_size() as usize;
if !has_statistics {
-row_group_meta.columns().iter().for_each(|column| {
Review Comment:
Because `even if a column in table_schema doesn't have the corresponding
statistics in row_group_meta, it also may need to go
summarize_min_max_null_counts to set statistics with null.`, I think it isn't
worth trying. In fact, I have tried it, and it doesn't make much sense, they
(row_group_meta.columns() and table_schema.fields())are not always aligned.
##
datafusion/datasource-parquet/src/file_format.rs:
##
@@ -839,9 +839,10 @@ pub fn statistics_from_parquet_meta_calc(
total_byte_size += row_group_meta.total_byte_size() as usize;
if !has_statistics {
-row_group_meta.columns().iter().for_each(|column| {
Review Comment:
Yes, because `even if a column in table_schema doesn't have the
corresponding statistics in row_group_meta, it also may need to go
summarize_min_max_null_counts to set statistics with null.`, I think it isn't
worth trying. In fact, I have tried it, and it doesn't make much sense, they
(row_group_meta.columns() and table_schema.fields())are not always aligned.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Use `any` instead of `for_each` [datafusion]
berkaysynnada commented on code in PR #15289:
URL: https://github.com/apache/datafusion/pull/15289#discussion_r2007300555
##
datafusion/datasource-parquet/src/file_format.rs:
##
@@ -839,9 +839,10 @@ pub fn statistics_from_parquet_meta_calc(
total_byte_size += row_group_meta.total_byte_size() as usize;
if !has_statistics {
-row_group_meta.columns().iter().for_each(|column| {
Review Comment:
You are right, I wasn't oppose to your change. I was just thinking a further
refactor of unifying these two iterations:
```rust
row_group_meta.columns().iter().for_each(|column| {
has_statistics = column.statistics().is_some();
});
```
and
```rust
table_schema
.fields()
.iter()
.enumerate()
.for_each(|(idx, field)| {
...
```
but maybe not worth to try
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Use `any` instead of `for_each` [datafusion]
xudong963 commented on code in PR #15289:
URL: https://github.com/apache/datafusion/pull/15289#discussion_r2007191299
##
datafusion/datasource-parquet/src/file_format.rs:
##
@@ -839,9 +839,10 @@ pub fn statistics_from_parquet_meta_calc(
total_byte_size += row_group_meta.total_byte_size() as usize;
if !has_statistics {
-row_group_meta.columns().iter().for_each(|column| {
Review Comment:
The key issue is how `has_statistics` is assigned within the `for_each`
loop. During iteration, the value of `has_statistics` is overwritten by each
column's statistics status.
### Example Scenario
Assume there are 2 columns with the following statistics status:
- Column 1: Has statistics (statistics is Some)
- Column 2: No statistics (statistics is None)
Execution process:
- Processing Column 1: has_statistics becomes true
- Processing Column 2: has_statistics becomes false
if the last column processed has no statistics, the final value of
has_statistics would be false.
if the goal is to check "whether any column has statistics", a more suitable
approach would be to use `any`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Use `any` instead of `for_each` [datafusion]
xudong963 merged PR #15289: URL: https://github.com/apache/datafusion/pull/15289 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Use `any` instead of `for_each` [datafusion]
xudong963 commented on PR #15289: URL: https://github.com/apache/datafusion/pull/15289#issuecomment-2743953225 Thanks for your review @berkaysynnada @alamb -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Use `any` instead of `for_each` [datafusion]
xudong963 commented on PR #15289:
URL: https://github.com/apache/datafusion/pull/15289#issuecomment-2732041851
> Instead of either "try for all" or "skip at all", isn't better to only go
over the columns which has statistics.is_some() ?
Do you mean the following?
```rust
// Instead of a single boolean flag
let columns_with_statistics: Vec<_> = row_group_meta.columns().iter()
.enumerate()
.filter_map(|(idx, column)| {
if column.statistics().is_some() {
Some(idx)
} else {
None
}
})
.collect();
// Then later, when processing column statistics
for (col_idx, field) in table_schema.fields().iter().enumerate() {
if columns_with_statistics.contains(&col_idx) {
// Process only columns that have statistics
// Use the statistics converter and update min/max accumulators
}
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
