Re: NOT IN subquery optimization

Li, Zheng Tue, 24 Mar 2020 15:33:25 -0700

Hi Tom,

Thanks for the feedback.
    
    * I find it entirely unacceptable to stick some planner temporary
    fields into struct SubLink.  If you need that storage you'll have
    to find some other place to put it.  But in point of fact I don't
    think you need it; it doesn't look to me to be critical to generate
    the subquery's plan any earlier than make_subplan would have done it.
    Moreover, you should really strive to *not* do that, because it's
    likely to get in the way of other future optimizations.  As the
    existing comment in make_subplan already suggests, we might want to
    delay subplan planning even further than that in future.


  The reason for calling make_subplan this early is that we want to
Call subplan_is_hashable(plan), to decide whether to proceed with the proposed
transformation. We try to stick with the fast hashed subplan when possible to 
avoid
potential performance degradation from the transformation which may restrict the
planner to choose Nested Loop Anti Join in order to handle Null properly,
the following is an example from subselect.out:
explain (costs false) select * from s where n not in (select u from l);
                  QUERY PLAN
-----------------------------------------------
 Nested Loop Anti Join
   InitPlan 1 (returns $0)
     ->  Seq Scan on l l_1
   ->  Seq Scan on s
         Filter: ((n IS NOT NULL) OR (NOT $0))
   ->  Index Only Scan using l_u on l
         Index Cond: (u = s.n)

However, if the subplan is not hashable, the above Nested Loop Anti Join is
actually faster.
    
    * I'm also not too happy with the (undocumented) rearrangement of
    reduce_outer_joins.  There's a specific sequence of processing that
    that's involved in, as documented at the top of prepjointree.c, and
    I doubt that you can just randomly call it from other places and expect
    good results.  In particular, since JOIN alias var flattening won't have
    happened yet when this code is being run from pull_up_sublinks, it's
    unlikely that reduce_outer_joins will reliably get the same answers it
    would get normally.  I also wonder whether it's safe to make the
    parsetree changes it makes earlier than normal, and whether it will be
    problematic to run it twice on the same tree, and whether its rather
    indirect connection to distribute_qual_to_rels is going to misbehave.

  The rearrangement of  reduce_outer_joins was to make the null test function
is_node_nonnullable() more accurate. Later we added strict predicates logic in
is_node_nonnullable(), so I think we can get rid of the rearrangement of
reduce_outer_joins now without losing accuracy.
    
    * The proposed test additions seem to about triple the runtime of
    subselect.sql.  This seems excessive.  I also wonder why it's necessary
    for this test to build its own large tables; couldn't it re-use ones
    that already exist in the regression database?

  I added a lot of test cases. But yes, I can reuse the existing large table if
there is one that doesn't fit in 64KB work_mem.
    
    * Not really sure that we need a new planner GUC for this, but if we
    do, it needs to be documented.

  The new GUC is just in case if anything goes wrong, the user can
easily turn it off.

Regards,
Zheng

Re: NOT IN subquery optimization

Reply via email to