alamb opened a new issue #217:
URL: https://github.com/apache/arrow-datafusion/issues/217


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   As @Dandandan  and I were discussing on 
https://github.com/apache/arrow-datafusion/issues/78#issuecomment-827528710
   
   The good news is that after #78, DataFusion can run TPCH Q19 🎉 . The 
downside is that Q19 currently has abysmal (basically it will never finish) 
because DataFusion plans it as a CROSS JOIN followed by filter. A more optimal 
plan would recognize a join predicate (so INNER JOIN) can be used, as well as 
several "single column predicates" and "single table predicates" which could be 
pushed down to the scans (aka applied prior to the joins)
   
   For reference, TPCH Q19 looks like this
   
   ```sql
   select
   sum(l_extendedprice * (1 - l_discount) ) as revenue
   from
   lineitem,
   part
   where
   (
   p_partkey = l_partkey
   and p_brand = ‘[BRAND1]’
   and p_container in ( ‘SM CASE’, ‘SM BOX’, ‘SM PACK’, ‘SM PKG’)
   and l_quantity >= [QUANTITY1] and l_quantity <= [QUANTITY1] + 10
   and p_size between 1 and 5
   and l_shipmode in (‘AIR’, ‘AIR REG’)
   and l_shipinstruct = ‘DELIVER IN PERSON’
   )
   or
   (
   p_partkey = l_partkey
   and p_brand = ‘[BRAND2]’
   and p_container in (‘MED BAG’, ‘MED BOX’, ‘MED PKG’, ‘MED PACK’)
   and l_quantity >= [QUANTITY2] and l_quantity <= [QUANTITY2] + 10
   and p_size between 1 and 10
   and l_shipmode in (‘AIR’, ‘AIR REG’)
   and l_shipinstruct = ‘DELIVER IN PERSON’
   )
   or
   (
   p_partkey = l_partkey
   and p_brand = ‘[BRAND3]’
   and p_container in ( ‘LG CASE’, ‘LG BOX’, ‘LG PACK’, ‘LG PKG’)
   and l_quantity >= [QUANTITY3] and l_quantity <= [QUANTITY3] + 10
   and p_size between 1 and 15
   and l_shipmode in (‘AIR’, ‘AIR REG’)
   and l_shipinstruct = ‘DELIVER IN PERSON’
   );
   ```
   
   Note that while the predicate is one big `OR`, it can be rewritten like:
   ```
   where
     p_partkey = l_partkey
     and l_shipmode in (‘AIR’, ‘AIR REG’)
     and l_shipinstruct = ‘DELIVER IN PERSON’
     and (
       (
         and p_brand = ‘[BRAND1]’
         and p_container in ( ‘SM CASE’, ‘SM BOX’, ‘SM PACK’, ‘SM PKG’)
         and l_quantity >= [QUANTITY1] and l_quantity <= [QUANTITY1] + 10
         and p_size between 1 and 5
       )
       or
       (
         and p_brand = ‘[BRAND2]’
         and p_container in (‘MED BAG’, ‘MED BOX’, ‘MED PKG’, ‘MED PACK’)
         and l_quantity >= [QUANTITY2] and l_quantity <= [QUANTITY2] + 10
         and p_size between 1 and 10
       )
       or
       (
         and p_brand = ‘[BRAND3]’
         and p_container in ( ‘LG CASE’, ‘LG BOX’, ‘LG PACK’, ‘LG PKG’)
         and l_quantity >= [QUANTITY3] and l_quantity <= [QUANTITY3] + 10
         and p_size between 1 and 15
       )
   )
   ```
   in which case the input cardinality into the join would be much lower. 
   
   Note there are further rewrites possible (aka introducing additional single 
table predicates like `p_size between 1 and 15` that can filter the input to 
the joins even further (although the final filter is also still needed). 
   
   **Describe the solution you'd like**
   The "classic" way to implement this is as a "predicate rewrite" pass that 
rearranges predicates for further downstream operations
   
   The goal is basically to get the predicate into a form of 
   
   `good_predicate1` AND `good_predicate2` AND ...
   
   Where `good_predicate` means the predicate has special support in the 
execution engine. 
   
   Since OR is not typically handled specially, rewrites to AND are helpful. 
Some common rewrites:
   * Rewrite 1 (needed for TPCH 19) : (p and q1) OR (p and q2) OR (p and ..) 
==>  p AND (q1 or q2)
   * Rewrite 2 (not in Q19, but useful elsewhere) : (col1 = A) OR (col1 = B) OR 
(col1 = C) ==> col1 IN (A, B, C)
   
   Which then the execution engine can treat like a single column predicate 
(push down to scan) and build a hash table for `(A, B, C)` and do fast 
filtering. 
   
   This kind of rewrite can get all sorts of fancy and sometimes needs a cost 
model (to estimate, for example, if redundantly applying a filter during scan 
and after a join is worthwhile).  It probably makes sense to implement a basic 
rewrite pass with the single table predicate extraction first, and then make it 
fancier from there
   
   
   **Describe alternatives you've considered**
   A clear and concise description of any alternative solutions or features 
you've considered.
   
   **Additional context**
   Add any other context or screenshots about the feature request here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to