feniljain opened a new issue, #19334: URL: https://github.com/apache/datafusion/issues/19334
### Describe the bug Hey everyone! 👋🏻 I was reading through how metrics are exported to `datafusion` from `arrow-rs`, it seems to be happening with the help of [ArrowReaderMetrics](https://github.com/apache/arrow-rs/blob/e49c2edbde46c09cf19d2be344a841a041d416f0/parquet/src/arrow/arrow_reader/metrics.rs#L31-L38). We [init](https://github.com/apache/datafusion/blob/899a762230d0abc705482d8898d3793763b6afb4/datafusion/datasource-parquet/src/opener.rs#L498) this struct in `ParquetOpener` and then update metric using [map](https://github.com/apache/datafusion/blob/899a762230d0abc705482d8898d3793763b6afb4/datafusion/datasource-parquet/src/opener.rs#L526) over stream by fetching value from the `ArrowReaderMetrics` struct and [incrementing](https://github.com/apache/datafusion/blob/899a762230d0abc705482d8898d3793763b6afb4/datafusion/datasource-parquet/src/opener.rs#L580) value of datafusion metric. Problem I see with this flow is, `map` is called whenever stream returns anything, so lets say stream is polled 3 times. We get value of an ever increasing metric 3 times and add it over. ### To Reproduce I noticed this when trying to implement a different metric, its on an internal fork, but I can try to make a reproducer if this is getting hard to understand. ### Expected behavior For e.g. if it looked like this: start: there are 3 record batches worth of data in this file, and df_predicate_cache_inner_records = 0 - 1st poll: return 1st record batch, predicate_cache_inner_records: 5, adding to df_predicate_cache_inner_records: 5 - 2nd poll: return 2nd record batch, predicate_cache_inner_records: 10, adding to df_predicate_cache_inner_records: 15 - 3rd poll: return 3rd record batch, predicate_cache_inner_records: 15, adding to df_predicate_cache_inner_records: 30 So instead of metric in datafusion having a value of 15, it gets 30. In short, we would have wanted to store metric at the end of the stream, but as we implemented this using map we are getting a compounded value of the metric over time. ### Additional context Do correct me if I am understanding this correctly, if this indeed sounds like a bug, I would love to make a fix for the same :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
