My apologies, I'm sure this question has been asked before but I couldn't
find anything on the list that meant anything to me.

We have a table "contexts" with 1.6 million rows, and a table "articles"
with 1.4 million rows, where an "article" is a particular kind of
"context".  We want to select from a join on those two tables like this

    SELECT COUNT(*)
    FROM contexts
    JOIN articles ON (articles.context_key=contexts.context_key)
    WHERE contexts.context_key IN (...);
    /* and some combination of columns from articles and contexts */

If "IN(...)" is a query, then this guy does a seq scan on the contexts
table, even if the subquery is "select col_a from kgtest" where kgtest has
one row.  If however I read the ids beforehand and write them into the
query, a la "IN (111,222,333...)", then the everything is happy, up to at
least 20,000 values written into the sql, at which point smaller machines
will take 2-5 minutes to parse the query.

I can certainly write the ids inline into the SQL, but when I do that I get
the distinct impression that I'm Doing It Wrong.  Is this expected
behavior?  It seems surprising to me.


To demonstrate:

    /* nothing up my sleeve */
    # select * from kgtest;
      cola
    ---------
     1652729
    (1 row)

    /* inline, good query plan */
# explain (analyze, buffers) select count(*) from contexts JOIN articles ON
(articles.context_key=contexts.context_key) where contexts.context_key in
(1652729);
                                                             QUERY
PLAN
----------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=3.82..3.83 rows=1 width=0) (actual time=0.188..0.189
rows=1 loops=1)
   Buffers: shared hit=7
   ->  Nested Loop  (cost=0.00..3.81 rows=1 width=0) (actual
time=0.181..0.181 rows=0 loops=1)
         Buffers: shared hit=7
         ->  Index Scan using contexts_pkey on contexts  (cost=0.00..1.90
rows=1 width=4) (actual time=0.109..0.112 ro
               Index Cond: (context_key = 1652729)
               Buffers: shared hit=4
         ->  Index Scan using articles_pkey on articles  (cost=0.00..1.90
rows=1 width=4) (actual time=0.060..0.060 ro
               Index Cond: (articles.context_key = 1652729)
               Buffers: shared hit=3
 Total runtime: 0.324 ms
(11 rows)

  /* subselect, query plan does seq scan on contexts */


# explain (analyze, buffers) select count(*)from contexts JOIN articles ON
(articles.context_key=contexts.context_key) where contexts.context_key in
(select cola from kgtest);
                                                                   QUERY
PLAN
----------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=118505.72..118505.73 rows=1 width=0) (actual
time=0.274..0.275 rows=1 loops=1)
   Buffers: shared hit=5
   ->  Hash Join  (cost=12512.61..116661.91 rows=737524 width=0) (actual
time=0.269..0.269 rows=0 loops=1)
         Hash Cond: (contexts.context_key = articles.context_key)
         Buffers: shared hit=5
         ->  Seq Scan on contexts  (cost=0.00..64533.03 rows=1648203
width=4) (actual time=0.009..0.009 rows=1 loops=1
               Buffers: shared hit=1
         ->  Hash  (cost=412.56..412.56 rows=737524 width=8) (actual
time=0.110..0.110 rows=0 loops=1)
               Buckets: 4096  Batches: 32  Memory Usage: 0kB
               Buffers: shared hit=4
               ->  Nested Loop  (cost=40.00..412.56 rows=737524 width=8)
(actual time=0.107..0.107 rows=0 loops=1)
                     Buffers: shared hit=4
                     ->  HashAggregate  (cost=40.00..42.00 rows=200
width=4) (actual time=0.069..0.071 rows=1 loops=1)
                           Buffers: shared hit=1
                           ->  Seq Scan on kgtest  (cost=0.00..34.00
rows=2400 width=4) (actual time=0.048..0.050 rows
                                 Buffers: shared hit=1
                     ->  Index Scan using articles_pkey on articles
(cost=0.00..1.84 rows=1 width=4) (actual time=0.0
                           Index Cond: (articles.context_key = kgtest.cola)
                           Buffers: shared hit=3
 Total runtime: 0.442 ms

-- 
Kevin M. Goess
Software Engineer
Berkeley Electronic Press
kgo...@bepress.com

510-665-1200 x179
www.bepress.com

bepress: sustainable scholarly publishing

Reply via email to