cloud-fan commented on PR #42223:
URL: https://github.com/apache/spark/pull/42223#issuecomment-1664916455

   For merging `func1(...) ... WHERE cond1` and `func2(...) ... WHERE cond2`, 
we got
   ```
   func1(...) FILTER cond1, func2(...) FILTER cond2 ... WHERE cond1 OR cond2
   ```
   
   Assuming there is no overlapped scan (so almost no benefit), and the table 
has N rows. Previously, both `cond1` and `cond2` get evaluated at most N times 
(the scan gets prunned). If they are partition filters, then they are evaluated 
0 times. Now, they get evaluated at most 2N times. It's likely more than 2 
times than before as we prune less data from the scan. The worst case is 
partition filters. Before, they get evaluated 0 times, now they get evaluated 
(numRows matching `cond1 OR cond2`) times, plus some extra evaluation in the 
aggregate filter. I don't think putting the predicates in a `Project` helps as 
the problem is from scan prunning.
   
   Given that it's hard to estimate the overlapped scan size, the only idea I 
can think of is to estimate the cost of the predicate. We only do the merge if 
the predicate is cheap to evaluate: only contains simple comparison and the 
expression tree size is small.
   
   We can define some patterns when the overlapped scan size must be large, 
e.g. one side has no filter, or the filters in two aggregates are the same. For 
these cases, we can always apply the merge.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to