Rachelint commented on PR #11261:
URL: https://github.com/apache/datafusion/pull/11261#issuecomment-2211608009
> Thank you @Rachelint ❤️
>
> > It is mainly due to AggregateUDFImpl is defined in expr crate, and the
needed PhysicalExpr is defined in datafusion-physical-expr-common crate which
will refer exps a dependency. And if we import PhysicalExpr in expr, that will
lead to cyclic dependecies...
>
> This makes sense -- I agree the ` AggregateUDFImpl` API should not be in
terms of `PhysicalExpr` (for the dependency reasons you describe)
>
> One way I could think to get around this would be to figure out the column
statistics in the physical planner and pass it directly (rather than making the
UDF figure them out). Something like
>
> ```rust
> pub trait AggregateUDFImpl {
> ...
> /// If the output of this aggregate function can be determined
> /// only from input statistics (e.g. COUNT(..) with 0 rows, is always 0)
> /// return that value
> ///
> /// # Arguments
> /// stats: Overall statistics for the input (including row count)
> /// arg_stats: Statistics, if known, for each input argument
> fn output_from_stats(
> &self,
> stats: &Statistics,
> arg_stats: &[Option<&ColumnStatistics>]
> ) -> Option<ScalarValue> {
> None
> }
> ...
> }
> ```
>
> ANother option might be to use the narrower API described on #11153
>
> ```rust
> impl AggregateExpr {
>
> /// Return the value of the aggregate function, if known, given the
number of input rows.
> ///
> /// Return None if the value can not be determined solely from the input.
> ///
> /// # Examples
> /// * The `COUNT` aggregate would return `Some(11)` given `num_rows = 11`
> /// * The `MIN` aggregate would return `Some(Null)` given `num_rows = 0
> /// * The `MIN` aggregate would return `None` given num_rows = 11
> fn output_from_rows(&self, num_rows: usize) -> Option<ScalarValue> { None
}
> ...
> }
> ```
>
> Though I think your idea is better (pass in statistics in generla)
> Thank you @Rachelint ❤️
>
> > It is mainly due to AggregateUDFImpl is defined in expr crate, and the
needed PhysicalExpr is defined in datafusion-physical-expr-common crate which
will refer exps a dependency. And if we import PhysicalExpr in expr, that will
lead to cyclic dependecies...
>
> This makes sense -- I agree the ` AggregateUDFImpl` API should not be in
terms of `PhysicalExpr` (for the dependency reasons you describe)
>
> One way I could think to get around this would be to figure out the column
statistics in the physical planner and pass it directly (rather than making the
UDF figure them out). Something like
>
> ```rust
> pub trait AggregateUDFImpl {
> ...
> /// If the output of this aggregate function can be determined
> /// only from input statistics (e.g. COUNT(..) with 0 rows, is always 0)
> /// return that value
> ///
> /// # Arguments
> /// stats: Overall statistics for the input (including row count)
> /// arg_stats: Statistics, if known, for each input argument
> fn output_from_stats(
> &self,
> stats: &Statistics,
> arg_stats: &[Option<&ColumnStatistics>]
> ) -> Option<ScalarValue> {
> None
> }
> ...
> }
> ```
>
> ANother option might be to use the narrower API described on #11153
>
> ```rust
> impl AggregateExpr {
>
> /// Return the value of the aggregate function, if known, given the
number of input rows.
> ///
> /// Return None if the value can not be determined solely from the input.
> ///
> /// # Examples
> /// * The `COUNT` aggregate would return `Some(11)` given `num_rows = 11`
> /// * The `MIN` aggregate would return `Some(Null)` given `num_rows = 0
> /// * The `MIN` aggregate would return `None` given num_rows = 11
> fn output_from_rows(&self, num_rows: usize) -> Option<ScalarValue> { None
}
> ...
> }
> ```
>
> Though I think your idea is better (pass in statistics in generla)
One thing I worry about the narrow api is that, it seems can't be used to
support the origin optimization of min/max?
https://github.com/apache/datafusion/blob/7df000a333a7d4018e1446ef900f652288b1f104/datafusion/core/src/physical_optimizer/aggregate_statistics.rs#L195-L221
Maybe I misunderstand about it?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]