ctsk commented on issue #13729: URL: https://github.com/apache/datafusion/issues/13729#issuecomment-2594124886
I checked out your second point and was able to replicate the difference in
speed between the two queries. I don't yet see how a TopK operator would help
the non-ordered query.
When comparing the query plans, it seems like the "limit" is not propagated
to AggregateExec:
Query with OrderBy:
| SortPreservingMergeExec: [min(hits.AdvEngineID)@1 ASC NULLS LAST],
fetch=10
| SortExec: TopK(fetch=10), expr=[min(hits.AdvEngineID)@1 ASC NULLS LAST],
preserve_partitioning=[true]
| AggregateExec: mode=FinalPartitioned, gby=[UserID@0 as UserID],
aggr=[min(hits.AdvEngineID)], lim=[10]
| CoalesceBatchesExec: target_batch_size=8192
| RepartitionExec: partitioning=Hash([UserID@0], 8),
input_partitions=8
| AggregateExec: mode=Partial, gby=[UserID@0 as UserID],
aggr=[min(hits.AdvEngineID)], lim=[10]
Query without OrderBy:
| GlobalLimitExec: skip=0, fetch=10
| CoalescePartitionsExec
| AggregateExec: mode=FinalPartitioned, gby=[UserID@0 as UserID],
aggr=[min(hits.AdvEngineID)]
| CoalesceBatchesExec: target_batch_size=8192
| RepartitionExec: partitioning=Hash([UserID@0], 8),
input_partitions=8
| AggregateExec: mode=Partial, gby=[UserID@0 as UserID],
aggr=[min(hits.AdvEngineID)]
After hackily adding the limit to the aggregation, the query was faster
without the orderby (1.4 seconds vs 1.55 seconds).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
