pepijnve commented on PR #16196:
URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2945139509
Changing hats to DataFusion user mode where I need to make sure that the end
users of our system can press 'cancel' at any time and that works as expected.
From that perspective here's a possible useful test case (and maybe an
illustration of a more general problem): suppose you have a query like `select
sum(size) as sum from t group by name order by sum` that produces a large
number of distinct groups. The query plan for this today is:
```
SortPreservingMergeExec: [sum@0 ASC NULLS LAST]
SortExec: expr=[sum@0 ASC NULLS LAST], preserve_partitioning=[true]
ProjectionExec: expr=[sum(t.size)@1 as sum]
AggregateExec: mode=FinalPartitioned, gby=[name@0 as name],
aggr=[sum(t.size)]
CoalesceBatchesExec: target_batch_size=8192
RepartitionExec: partitioning=Hash([name@0], 10),
input_partitions=10
AggregateExec: mode=Partial, gby=[name@0 as name],
aggr=[sum(t.size)]
RepartitionExec: partitioning=RoundRobinBatch(10),
input_partitions=1
DataSourceExec: file_groups={1 group: [[<file name>]]},
projection=[name, size], file_type=...
```
If I'm reading the code correctly, once the `FinalPartitioned` aggregation
has drained the original input it may switch over to reading back spill files.
At that point the original input (and yield exec wrapping it) are taken out of
the picture. Unless I'm mistaken, once the query hits that phase it may not be
interruptible again unless some yield guarantee is injected again.
I don't have a good idea for how to write a practical test case for this
though. You would have to drive a sufficiently large query all the way to this
point to be able to observer the behavior.
I wonder if this illustrates that only analyzing the static picture of the
query at planning time is insufficient because it does not (and probably
cannot) take the dynamic behavior of the query into account. The actual tree of
streams and the points where you might need yield wrappers can change as the
query is executing.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]