Re: [GENERAL] Help with sql

Steve Crawford Fri, 06 Jul 2012 17:57:38 -0700

On 07/06/2012 02:34 PM, Perry Smith wrote:

Hi Guys,


This isn't a PostgreSQL specific question but just a SQL question.  If this is 
not an appropriate question for this list, please let me know.

It is also, perhaps, a really silly question.

This query (without the 'explain' keyword) , when executed takes forever and a 
day:

condor_development=> explain select id from filesets where id not in ( select 
fileset_id from service_pack_fileset_maps );
                                           QUERY PLAN
----------------------------------------------------------------------------------------------
  Seq Scan on filesets  (cost=0.00..71937742.00 rows=26088 width=4)
    Filter: (NOT (SubPlan 1))
    SubPlan 1
      ->  Materialize  (cost=0.00..2517.78 rows=95852 width=4)
            ->  Seq Scan on service_pack_fileset_maps  (cost=0.00..1663.52 
rows=95852 width=4)
(5 rows)

This query returns within a second:

condor_development=> explain select id from filesets where id not in ( select 
distinct fileset_id from service_pack_fileset_maps );
                                           QUERY PLAN
----------------------------------------------------------------------------------------------
  Seq Scan on filesets  (cost=2102.31..3153.53 rows=26088 width=4)
    Filter: (NOT (hashed SubPlan 1))
    SubPlan 1
      ->  HashAggregate  (cost=1903.15..2062.48 rows=15933 width=4)
            ->  Seq Scan on service_pack_fileset_maps  (cost=0.00..1663.52 
rows=95852 width=4)
(5 rows)

The difference is the "distinct" keyword in the inner select.

What I'm confused about is why isn't the "distinct" implicit?  I thought the construct "blah 
in ( select ... )" was using "sets" and an item (I thought) can not be in a set more than once.

Perhaps my question is the opposite really?  Why would you not always use "distinct" in the inner 
select when the operator is "in" or "not in" ?

And if I can throw in another question on top: is there a different method other than 
"not in" that would work better?

Actually it is *very* PostgreSQL specific. In fact, it may even bePostgreSQL *version* specific as you are delving into how the plannerdecides how to handle a query.

It appears that the planner is assuming, based on collected stats andavailable indexes, that there will be roughly 1/6 the records returnedby the "distinct" query and thus chose a different method to join therecords. One useful piece of information would be the indexes on the twotables.


As to other methods, you can use:

... where not exists (select 1 from service_pack_fileset_maps wherefileset_id = filesets.id)...(Note: as alluded to above, ...not in... works better in some releasesand ...not exists... better in others due to improvements over time.)


Still another method:

select id from filesets except select fileset_id fromservice_pack_fileset_maps;


Cheers,
Steve


--
Sent via pgsql-general mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Help with sql

Reply via email to