gene-bordegaray opened a new issue, #19973: URL: https://github.com/apache/datafusion/issues/19973
### Is your feature request related to a problem or challenge? Right now Datafusion relies heavily on row count and total size statistics to estimate costs during planning. The statistics used come at two levels: 1. Relational (modeled by `Statistics`): contains `num_rows` (amount of rows to be scanned) and `total_byte_size` (total output bytes) 2. Column (modeled by `ColumnStatistics`): contains `byte_size` (estimated size of column in bytes) While these are useful metrics for stable operations, like reading files, they become unstable and inaccurate during query execution. To show how, let's walk through an example. ```text A query initially reads 1 million rows from a Parquet file. On the initial read it finds that the `byte_size` for col_a is 1GB. Now there is a filter that reduces the row count to 10 rows. DataFusion now has to guess the new `byte_size` and because it lacks a stable metric it often loses this information or a linear scaling. This issue can be solved by introducing an average row size statistic. This way, on a file read we mark the average row size in the column, which no matter how the column is filtered, joined, etc will remain stable. In our example, knowing that we have 10 rows after a filter, we can just calculate the column data size via `num_rows * avg_byte_size` ``` ### Describe the solution you'd like I would like to introduce a new `avg_byte_size: Precision<usize>` column to the `ColumnStatistics` struct which tracks the average size of a row in a given column. This field would give DataFusion a new model of statistic: "data density" compared to the existing "data totals". This field would allow ever operator in the plan to recalculate output size accurately via: ```math Estimated Output Bytes = Row Count * Sum(avg_byte_size) ``` The first issue to tackle this issue is just adding the `avg_byte_size` field to `ColumnStatistics`, basic supporting methods and functionality: merging partitions' `avg_byte_size` fields and reconstructing total byte size for a relation through `avg_byte_size`, and updating parquet datasources to derive and calculate this field. There will follow-up work to propagate and utilize these statistics. ### Describe alternatives you've considered I considered just using `byte_size / num_rows` because it's "the same thing" but this is not true. While this can be calculated on a file scan, it is a fragile value that can easily be corrupt. To show why here are some examples: - When a filter reduces the row count, we have to "guess" how `byte_size` will scale. Often, this cannot guarantee uniformity and will set it to `Absent`. Once `byte_size` is lost, there is no recovery and can't derive these statistics for subsequent operators. - Re-calculating the average from two unstable totals at every operator is redundant and more importantly introduces cascading rounding errors. ### Additional context This statistics will be impactful for both single-node DataFusion and distributed DataFusion. Single-Node: We will be able to more reliably and consistently calculate data size and get more accurate results. This will have rippling effects in how query are planned. Distributed: This will reap all the benefits of better single-node statistics and planning. It will also open up the doors to make better network cost decisions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
