gene-bordegaray opened a new issue, #19973:
URL: https://github.com/apache/datafusion/issues/19973

   ### Is your feature request related to a problem or challenge?
   
   Right now Datafusion relies heavily on row count and total size statistics 
to estimate costs during planning. The statistics used come at two levels: 
   1. Relational (modeled by `Statistics`): contains `num_rows` (amount of rows 
to be scanned) and `total_byte_size` (total output bytes)
   2. Column (modeled by `ColumnStatistics`): contains `byte_size` (estimated 
size of column in bytes)
   
   While these are useful metrics for stable operations, like reading files, 
they become unstable and inaccurate during query execution.
   
   To show how, let's walk through an example. 
   
   ```text
   A query initially reads 1 million rows from a Parquet file. On the initial 
read it finds that the `byte_size` for col_a is 1GB.
   
   Now there is a filter that reduces the row count to 10 rows. DataFusion now 
has to guess the new `byte_size` and because it lacks a stable metric it often 
loses this information or a linear scaling.
   
   This issue can be solved by introducing an average row size statistic. This 
way, on a file read we mark the average row size in the column, which no matter 
how the column is filtered, joined, etc will remain stable. In our example, 
knowing that we have 10 rows after a filter, we can just calculate the column 
data size via `num_rows * avg_byte_size`
   ```
   
   ### Describe the solution you'd like
   
   I would like to introduce a new `avg_byte_size: Precision<usize>` column to 
the `ColumnStatistics` struct which tracks the average size of a row in a given 
column. This field would give DataFusion a new model of statistic: "data 
density" compared to the existing "data totals".
   
   This field would allow ever operator in the plan to recalculate output size 
accurately via:
   ```math
   Estimated Output Bytes = Row Count * Sum(avg_byte_size)
   ```
   
   The first issue to tackle this issue is just adding the `avg_byte_size` 
field to `ColumnStatistics`, basic supporting methods and functionality: 
merging partitions' `avg_byte_size` fields and reconstructing total byte size 
for a relation through `avg_byte_size`, and updating parquet datasources to 
derive and calculate this field.
   
   There will follow-up work to propagate and utilize these statistics.
   
   ### Describe alternatives you've considered
   
   I considered just using `byte_size / num_rows` because it's "the same thing" 
but this is not true. While this can be calculated on a file scan, it is a 
fragile value that can easily be corrupt. To show why here are some examples:
   - When a filter reduces the row count, we have to "guess" how `byte_size` 
will scale. Often, this cannot guarantee uniformity and will set it to 
`Absent`. Once `byte_size` is lost, there is no recovery and can't derive these 
statistics for subsequent operators.
   - Re-calculating the average from two unstable totals at every operator is 
redundant and more importantly introduces cascading rounding errors.
   
   ### Additional context
   
   This statistics will be impactful for both single-node DataFusion and 
distributed DataFusion.
   
   Single-Node: We will be able to more reliably and consistently calculate 
data size and get more accurate results. This will have rippling effects in how 
query are planned.
   
   Distributed: This will reap all the benefits of better single-node 
statistics and planning. It will also open up the doors to make better network 
cost decisions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to