Re: [HACKERS] SQL 'in' vs join.
Don Baccus [EMAIL PROTECTED] writes: The optimizer should do a better job on your first query, sure, but why don't you like writing joins? The join wouldn't give quite the same answers. If there are multiple rows in table2 matching a particular table1 row, then a join would give multiple copies of the table1 row, whereas the WHERE foo IN (sub-select) way would give only one copy. SELECT DISTINCT can't be used to fix this, because that would eliminate legitimate duplicates from identical table1 rows. Now that the executor understands about multiple join rules (for OUTER JOIN support), I've been thinking about inventing a new join rule that says "at most one output row per left-hand row" --- this'd be sort of the opposite of the LEFT OUTER JOIN rule, "at least one output row per left-hand row" --- and then transforming IN (sub-select) clauses that appear at the top level of WHERE into this kind of join. Won't happen for 7.1, though. Of course, we will have the query tree redesign for 7.2, right, make that unnecessary. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026
RE: [HACKERS] SQL 'in' vs join.
Now, given the two components, each with very low costs, it chooses to do a sequential scan on the table. I don't get it. Read the FAQ? http://www.postgresql.org/docs/faq-english.html#4.23 "4.23) Why are my subqueries using IN so slow?") - Andrew
Re: [HACKERS] SQL 'in' vs join.
Hannu Krosing wrote: mlw wrote: Why is a "select * from table1 where field in (select field from table2 where condition )" is so dramatically bad compared to: "select * from table1, table2 where table1.field = table2.field and condition" I can't understand why the first query isn't optimized better than the second one. The 'in' query forces a full table scan (it shouldn't) and the second one uses the indexes. Does anyone know why? Its not done yet, and probably hsomewhat difficult to do in a general fashion I know I am no SQL guru, but my gut tells me that the 'in' operator should be far more efficient than a join. Here are the actual queries: cdinfo=# explain select trackid from zsong where muzenbr in (select muzenbr from ztitles where title = 'Mulan') ; try explain select trackid from zsong where muzenbr in ( select muzenbr from ztitles where title = 'Mulan' and ztitles.muzenbr=zsong.muzenbr ); this should hint the current optimizer to do the right thing; - Hannu Nope: cdinfo=# explain cdinfo-# select trackid cdinfo-#from zsong cdinfo-# where muzenbr in ( cdinfo(# select muzenbr cdinfo(#from ztitles cdinfo(# where title = 'Mulan' cdinfo(# and ztitles.muzenbr=zsong.muzenbr cdinfo(# ); NOTICE: QUERY PLAN: Seq Scan on zsong (cost=1.00..104474515.18 rows=2193213 width=4) SubPlan - Index Scan using ztitles_pkey on ztitles (cost=0.00..4.05 rows=1 width=4) But what I also find odd is, look at the components: cdinfo=# explain select muzenbr from ztitles where title = 'Mulan' ; NOTICE: QUERY PLAN: Index Scan using ztitles_title_ndx on ztitles (cost=0.00..7.08 rows=1 width=4) cdinfo=# explain select trackid from zsong where muzenbr in ( 1,2,3,4,5) ; NOTICE: QUERY PLAN: Index Scan using zsong_muzenbr_ndx, zsong_muzenbr_ndx, zsong_muzenbr_ndx, zsong_muzenbr_ndx, zsong_muzenbr_ndx on zsong (cost=0.00..392.66 rows=102 width=4) Now, given the two components, each with very low costs, it chooses to do a sequential scan on the table. I don't get it. I have have been having no end of problems with Postgres' optimizer. It just seems to be brain dead at times. It is a huge point of frustration to me. I am tied to postgres in my current project, and I fear that I will not be able to implement certain features because of this sort of behavior. -- http://www.mohawksoft.com
Re: [HACKERS] SQL 'in' vs join.
mlw wrote: Hannu Krosing wrote: mlw wrote: Why is a "select * from table1 where field in (select field from table2 where condition )" is so dramatically bad compared to: "select * from table1, table2 where table1.field = table2.field and condition" I can't understand why the first query isn't optimized better than the second one. The 'in' query forces a full table scan (it shouldn't) and the second one uses the indexes. Does anyone know why? Its not done yet, and probably hsomewhat difficult to do in a general fashion I know I am no SQL guru, but my gut tells me that the 'in' operator should be far more efficient than a join. Here are the actual queries: cdinfo=# explain select trackid from zsong where muzenbr in (select muzenbr from ztitles where title = 'Mulan') ; try explain select trackid from zsong where muzenbr in ( select muzenbr from ztitles where title = 'Mulan' and ztitles.muzenbr=zsong.muzenbr ); this should hint the current optimizer to do the right thing; - Hannu did you have indexes on both ztitles.muzenbr and zsong.muzenbr ? -- Hannu
Re: [HACKERS] SQL 'in' vs join.
At 08:37 AM 11/30/00 -0500, mlw wrote: mlw wrote: Why is a "select * from table1 where field in (select field from table2 where condition )" is so dramatically bad compared to: "select * from table1, table2 where table1.field = table2.field and condition" Now, given the two components, each with very low costs, it chooses to do a sequential scan on the table. I don't get it. I have have been having no end of problems with Postgres' optimizer. It just seems to be brain dead at times. It is a huge point of frustration to me. I am tied to postgres in my current project, and I fear that I will not be able to implement certain features because of this sort of behavior. But but but ... Not only is the join faster, but it is more readable and cleaner SQL as well. I would never write the query in its first form. I'd change the second one slightly to "select table1.* from ...", though, since those are apparently the only fields you want. The optimizer should do a better job on your first query, sure, but why don't you like writing joins? - Don Baccus, Portland OR [EMAIL PROTECTED] Nature photos, on-line guides, Pacific Northwest Rare Bird Alert Service and other goodies at http://donb.photo.net.
Re: [HACKERS] SQL 'in' vs join.
Don Baccus [EMAIL PROTECTED] writes: The optimizer should do a better job on your first query, sure, but why don't you like writing joins? The join wouldn't give quite the same answers. If there are multiple rows in table2 matching a particular table1 row, then a join would give multiple copies of the table1 row, whereas the WHERE foo IN (sub-select) way would give only one copy. SELECT DISTINCT can't be used to fix this, because that would eliminate legitimate duplicates from identical table1 rows. Now that the executor understands about multiple join rules (for OUTER JOIN support), I've been thinking about inventing a new join rule that says "at most one output row per left-hand row" --- this'd be sort of the opposite of the LEFT OUTER JOIN rule, "at least one output row per left-hand row" --- and then transforming IN (sub-select) clauses that appear at the top level of WHERE into this kind of join. Won't happen for 7.1, though. regards, tom lane
Re: [HACKERS] SQL 'in' vs join.
At 10:52 AM 11/30/00 -0500, Tom Lane wrote: Don Baccus [EMAIL PROTECTED] writes: The optimizer should do a better job on your first query, sure, but why don't you like writing joins? The join wouldn't give quite the same answers. If there are multiple rows in table2 matching a particular table1 row, then a join would give multiple copies of the table1 row, whereas the WHERE foo IN (sub-select) way would give only one copy. SELECT DISTINCT can't be used to fix this, because that would eliminate legitimate duplicates from identical table1 rows. Hmmm...I was presuming that "field" was a primary key of table1, so such duplicates wouldn't exist (and SELECT DISTINCT would weed out duplicates from table2 if "field" isn't a primary key of table2, i.e. if table2 has a many-to-one relationship to table1). For many-to-many relationships yes, you're right, the "in" version returns a different result. Now that the executor understands about multiple join rules (for OUTER JOIN support), I've been thinking about inventing a new join rule that says "at most one output row per left-hand row" --- this'd be sort of the opposite of the LEFT OUTER JOIN rule, "at least one output row per left-hand row" --- and then transforming IN (sub-select) clauses that appear at the top level of WHERE into this kind of join. Won't happen for 7.1, though. Same trick could be used for some classes of queries which do a SELECT DISTINCT on the results of a join, too ... - Don Baccus, Portland OR [EMAIL PROTECTED] Nature photos, on-line guides, Pacific Northwest Rare Bird Alert Service and other goodies at http://donb.photo.net.