efredine commented on code in PR #11200: URL: https://github.com/apache/datafusion/pull/11200#discussion_r1663308953
########## datafusion/core/src/datasource/physical_plan/parquet/statistics.rs: ########## @@ -903,6 +917,21 @@ macro_rules! get_data_page_statistics { new_empty_array(&DataType::Time64(unit.clone())) } }) + }, + Some(DataType::FixedSizeBinary(size)) => { + Ok(Arc::new( + FixedSizeBinaryArray::from( + [<$stat_type_prefix FixedLenByteArrayDataPageStatsIterator>]::new($iterator).map(|x| { + x.into_iter().filter_map(|x| x).map(|x| { + if x.len().try_into() == Ok(*size) { + Some(x.data().to_vec()) + } else { + None + } + }) + }).flatten().collect::<Vec<_>>() + ) + )) Review Comment: So, looking even further - this loop is just replicating the logic in `try_from_iter`. So this can be further simplified to: ```Rust Some(DataType::FixedSizeBinary(size)) => { Ok(Arc::new( FixedSizeBinaryArray::try_from_iter( [<$stat_type_prefix FixedLenByteArrayDataPageStatsIterator>]::new($iterator) .flat_map(|x| x.into_iter()) .filter_map(|x| x) ).unwrap_or_else(|e| { log::debug!("FixedSizeBinary statistics is invalid: {}", e); FixedSizeBinaryArray::new(*size, vec![].into(), None) }) )) }, ``` Which will also be more efficient because it's not performing the same check twice. And we should probably apply this change to the RowGroup statistics accumulator as well? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org