Re: [I] The `limit` info lost in the AggregateExec when ser/deser the physical plan [datafusion]

via GitHub Mon, 27 May 2024 02:58:26 -0700


liukun4515 commented on issue #10630:
URL: https://github.com/apache/datafusion/issues/10630#issuecomment-2133119074


   I think pr https://github.com/apache/datafusion/pull/7192 introduced the 
top_k agg with the `priority queue` in the `AggregateExec` , and it is used to 
optimize the case like bellow pattern:
   
   ```
   select column, sum(xx) from table group by column order by column
   ```
   
   But in the pr https://github.com/apache/datafusion/pull/8038 introduced the 
new rule of  `push limit for distinct column` which use the 
`is_unordered_unfiltered_group_by_distinct`  to check the condition without the 
`sort` condition in the plan. This rule is used to optimize the case like:
   
   ```
   select distinct column from table
   select column from table group by column
   ```
   
   But the pr https://github.com/apache/datafusion/pull/8038 has no ability to 
reduce the output data of the `AggregateExec` in that cases, because the 
`GroupedHashAggregateStream` has no ability to handle the cases with `limit` 
output. I think we can implement this feature.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] The `limit` info lost in the AggregateExec when ser/deser the physical plan [datafusion]

Reply via email to