pepijnve commented on PR #18152: URL: https://github.com/apache/datafusion/pull/18152#issuecomment-3448639483
> Your solution **assume** that case expression evaluation are cheaper than copy record batch, right? I don't understand what you mean. Could you clarify where you see that assumption? The current code on `main` is already copying record batches on every `evaluate_selection` call. `evaluate_selection(rb, selection)` is basically `scatter(evaluate(filter_record_batch(rb, selection), selection)`. What I'm trying to do here is actually to reduce the amount of data that's processed. The implementation on `main` always starts from the full input record batch, while the implementation here reduces the size of the record batch as it goes through the case branches. #18275 takes this one step further by projecting away (and as consequence not filtering) unused columns. Additionally on the result processing side, the current implementation zips arrays with length `record_batch.num_rows()` for each branch. The merge operation tries to reduce that to just a single pass instead that's even avoided if possible. > in case a IS NULL filtered 10% for example, do you evaluate a > 1 for the remaining 90% or 100%? 90%. 100% would not work in general. There were already SLTs related to lazy evaluation of the 'then' expressions. I've added a couple extra for the 'when' expressions/predicates as well. See the second diagram in https://github.com/apache/datafusion/pull/18152#issuecomment-3447841401 for a worked example of the exact evaluation strategy. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
