avantgardnerio commented on issue #7198: URL: https://github.com/apache/arrow-datafusion/issues/7198#issuecomment-1666485998
> if we can get a general purpose optimization that also handles queries with different aggregates I don't think #7192 can handle your example. It works by "evicting" (nice term @tustvold ) groups from the accumulator unless they are the current min/max. So if we run your example: ``` SELECT tag, field, max(time), min(other_field) FROM t GROUP BY tag ORDER BY max(time) DESC LIMIT 10 ``` on: ``` +-----+-------+-------+-------------+ | tag | time | field | other_field | +-----+-------+-------+-------------+ | 1 | 01:00 | | 0 | | 2 | 02:00 | | 1 | | 3 | 03:00 | | 1 | | 4 | 04:00 | | 1 | | 5 | 05:00 | | 1 | | 6 | 06:00 | | 1 | | 7 | 07:00 | | 1 | | 8 | 08:00 | | 1 | | 9 | 09:00 | | 1 | | 10 | 10:00 | | 1 | | 11 | 11:00 | | 1 | | 1 | 12:00 | | 999 | +-----+-------+-------+-------------+ ``` We will: 1. accumulate 10 tag-groups 2. run into tag-group 11 with a greater `time` that group-tag 1 (`11:00` vs `01:00`) 3. evict tag-group 1 from our accumulator, along with it's `other_field=0` value 4. run into tag-group 1 again with a greater `time` than tag-group 2 (`12:00` vs `02:00`) 5. evict tag-group 2, replace it with tag-group 1, and it's new "min" of `999` Hopefully this example makes it clear why we can only accumulate values present in the `order by` clause, given the approach in #7192 based on the functional requirements (not-sorting) of #7191 . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
