alamb opened a new issue #217: URL: https://github.com/apache/arrow-datafusion/issues/217
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.** As @Dandandan and I were discussing on https://github.com/apache/arrow-datafusion/issues/78#issuecomment-827528710 The good news is that after #78, DataFusion can run TPCH Q19 🎉 . The downside is that Q19 currently has abysmal (basically it will never finish) because DataFusion plans it as a CROSS JOIN followed by filter. A more optimal plan would recognize a join predicate (so INNER JOIN) can be used, as well as several "single column predicates" and "single table predicates" which could be pushed down to the scans (aka applied prior to the joins) For reference, TPCH Q19 looks like this ```sql select sum(l_extendedprice * (1 - l_discount) ) as revenue from lineitem, part where ( p_partkey = l_partkey and p_brand = ‘[BRAND1]’ and p_container in ( ‘SM CASE’, ‘SM BOX’, ‘SM PACK’, ‘SM PKG’) and l_quantity >= [QUANTITY1] and l_quantity <= [QUANTITY1] + 10 and p_size between 1 and 5 and l_shipmode in (‘AIR’, ‘AIR REG’) and l_shipinstruct = ‘DELIVER IN PERSON’ ) or ( p_partkey = l_partkey and p_brand = ‘[BRAND2]’ and p_container in (‘MED BAG’, ‘MED BOX’, ‘MED PKG’, ‘MED PACK’) and l_quantity >= [QUANTITY2] and l_quantity <= [QUANTITY2] + 10 and p_size between 1 and 10 and l_shipmode in (‘AIR’, ‘AIR REG’) and l_shipinstruct = ‘DELIVER IN PERSON’ ) or ( p_partkey = l_partkey and p_brand = ‘[BRAND3]’ and p_container in ( ‘LG CASE’, ‘LG BOX’, ‘LG PACK’, ‘LG PKG’) and l_quantity >= [QUANTITY3] and l_quantity <= [QUANTITY3] + 10 and p_size between 1 and 15 and l_shipmode in (‘AIR’, ‘AIR REG’) and l_shipinstruct = ‘DELIVER IN PERSON’ ); ``` Note that while the predicate is one big `OR`, it can be rewritten like: ``` where p_partkey = l_partkey and l_shipmode in (‘AIR’, ‘AIR REG’) and l_shipinstruct = ‘DELIVER IN PERSON’ and ( ( and p_brand = ‘[BRAND1]’ and p_container in ( ‘SM CASE’, ‘SM BOX’, ‘SM PACK’, ‘SM PKG’) and l_quantity >= [QUANTITY1] and l_quantity <= [QUANTITY1] + 10 and p_size between 1 and 5 ) or ( and p_brand = ‘[BRAND2]’ and p_container in (‘MED BAG’, ‘MED BOX’, ‘MED PKG’, ‘MED PACK’) and l_quantity >= [QUANTITY2] and l_quantity <= [QUANTITY2] + 10 and p_size between 1 and 10 ) or ( and p_brand = ‘[BRAND3]’ and p_container in ( ‘LG CASE’, ‘LG BOX’, ‘LG PACK’, ‘LG PKG’) and l_quantity >= [QUANTITY3] and l_quantity <= [QUANTITY3] + 10 and p_size between 1 and 15 ) ) ``` in which case the input cardinality into the join would be much lower. Note there are further rewrites possible (aka introducing additional single table predicates like `p_size between 1 and 15` that can filter the input to the joins even further (although the final filter is also still needed). **Describe the solution you'd like** The "classic" way to implement this is as a "predicate rewrite" pass that rearranges predicates for further downstream operations The goal is basically to get the predicate into a form of `good_predicate1` AND `good_predicate2` AND ... Where `good_predicate` means the predicate has special support in the execution engine. Since OR is not typically handled specially, rewrites to AND are helpful. Some common rewrites: * Rewrite 1 (needed for TPCH 19) : (p and q1) OR (p and q2) OR (p and ..) ==> p AND (q1 or q2) * Rewrite 2 (not in Q19, but useful elsewhere) : (col1 = A) OR (col1 = B) OR (col1 = C) ==> col1 IN (A, B, C) Which then the execution engine can treat like a single column predicate (push down to scan) and build a hash table for `(A, B, C)` and do fast filtering. This kind of rewrite can get all sorts of fancy and sometimes needs a cost model (to estimate, for example, if redundantly applying a filter during scan and after a join is worthwhile). It probably makes sense to implement a basic rewrite pass with the single table predicate extraction first, and then make it fancier from there **Describe alternatives you've considered** A clear and concise description of any alternative solutions or features you've considered. **Additional context** Add any other context or screenshots about the feature request here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
