cloud-fan commented on PR #42223: URL: https://github.com/apache/spark/pull/42223#issuecomment-1664916455
For merging `func1(...) ... WHERE cond1` and `func2(...) ... WHERE cond2`, we got ``` func1(...) FILTER cond1, func2(...) FILTER cond2 ... WHERE cond1 OR cond2 ``` Assuming there is no overlapped scan (so almost no benefit), and the table has N rows. Previously, both `cond1` and `cond2` get evaluated at most N times (the scan gets prunned). If they are partition filters, then they are evaluated 0 times. Now, they get evaluated at most 2N times. It's likely more than 2 times than before as we prune less data from the scan. The worst case is partition filters. Before, they get evaluated 0 times, now they get evaluated (numRows matching `cond1 OR cond2`) times, plus some extra evaluation in the aggregate filter. I don't think putting the predicates in a `Project` helps as the problem is from scan prunning. Given that it's hard to estimate the overlapped scan size, the only idea I can think of is to estimate the cost of the predicate. We only do the merge if the predicate is cheap to evaluate: only contains simple comparison and the expression tree size is small. We can define some patterns when the overlapped scan size must be large, e.g. one side has no filter, or the filters in two aggregates are the same. For these cases, we can always apply the merge. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org