Re: [PERFORM] hash semi join caused by IN (select ...)

2011-05-18 Thread Clemens Eisserer
Hi,

Does anybody know why the planner treats = ANY(ARRAY(select ...))
differently than IN(select ...)?
Which one is preferable, when I already have a lot of joins?

Thanks, Clemens

2011/5/17 Clemens Eisserer linuxhi...@gmail.com:
 Hi,

 select  from t1 left join t2  WHERE id IN (select )

 Does it work as expected with one less join?  If so, try increasing
 join_collapse_limit ...

 That did the trick - thanks a lot. I only had to increase
 join_collapse_limit a bit and now get an almost perfect plan.
 Instead of hash-joining all the data, the planner generates
 nested-loop-joins with index only on the few rows I fetch.

 Using = ANY(array(select... )) also seems to work, I wonder which one
 works better. Does ANY(ARRAY(...)) force the optimizer to plan the
 subquery seperated from the main query?

 Thanks, Clemens


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] hash semi join caused by IN (select ...)

2011-05-18 Thread Dave Johansen
On Wed, May 18, 2011 at 1:46 AM, Clemens Eisserer linuxhi...@gmail.comwrote:

 Hi,

 Does anybody know why the planner treats = ANY(ARRAY(select ...))
 differently than IN(select ...)?
 Which one is preferable, when I already have a lot of joins?

 Thanks, Clemens

 2011/5/17 Clemens Eisserer linuxhi...@gmail.com:
  Hi,
 
  select  from t1 left join t2  WHERE id IN (select )
 
  Does it work as expected with one less join?  If so, try increasing
  join_collapse_limit ...
 
  That did the trick - thanks a lot. I only had to increase
  join_collapse_limit a bit and now get an almost perfect plan.
  Instead of hash-joining all the data, the planner generates
  nested-loop-joins with index only on the few rows I fetch.
 
  Using = ANY(array(select... )) also seems to work, I wonder which one
  works better. Does ANY(ARRAY(...)) force the optimizer to plan the
  subquery seperated from the main query?
 
  Thanks, Clemens
 

 --
 Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance



I'm just a user so I don't have definitive knowledge of this, but my
experience seems to indicate that the = ANY(ARRAY(SELECT ...)) does the
select and turns it into an array and then uses that in the where clause in
a manner similar to a hard coded list of values, like IN (1, 2, 3, ...). In
theory, the planner could do the same sort of things with the IN (SELECT
...) but my experience seems to indicate that in some cases it decides not
to use an index that it could.

One specific example I know of is that at least in PostgreSQL 8.3, a view
with a UNION/UNION ALL will push the = ANY(ARRAY(SELECT ...)) down into the
two sub-queries, but the IN (SELECT ...) will be applied after the UNION
ALL.

Dave


[PERFORM] LIMIT and UNION ALL

2011-05-18 Thread Dave Johansen
I am using Postgres 8.3.3 and I have a VIEW which is a UNION ALL of two
tables but when I do a select on the view using a LIMIT, it scans the entire
tables and takes significantly longer than writing out the query with the
LIMITs in the sub-queries themselves. Is there a solution to get the view to
perform like the query with the LIMIT explicitly placed in the sub-queries?

I noticed a similar question in this post
http://archives.postgresql.org/pgsql-general/2005-06/msg00488.php but I
wasn't able to find an answer.

Thanks,
Dave


Re: [PERFORM] LIMIT and UNION ALL

2011-05-18 Thread Robert Klemme
On Wed, May 18, 2011 at 5:26 PM, Dave Johansen davejohan...@gmail.com wrote:
 I am using Postgres 8.3.3 and I have a VIEW which is a UNION ALL of two
 tables but when I do a select on the view using a LIMIT, it scans the entire
 tables and takes significantly longer than writing out the query with the
 LIMITs in the sub-queries themselves. Is there a solution to get the view to
 perform like the query with the LIMIT explicitly placed in the sub-queries?

Can you show DDL and queries?

The query with the LIMIT on the subqueries and the one with the LIMIT
on the overall query are not semantically equivalent.  Since you can
have an ORDER BY before the LIMIT on the query with the limit on the
view the database must have all the rows before it can apply the
ordering and properly determine the limit.  Although it might be
possible to determine under particular circumstances that only one of
the tables needs to be queried or tables need only be queried
partially I deem that quite complex. I do not know whether Postgres
can do such optimizations but for that we would certainly need to see
the concrete example (including constraint and indexes).

Kind regards

robert

-- 
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] LIMIT and UNION ALL

2011-05-18 Thread Pavel Stehule
Hello


2011/5/18 Dave Johansen davejohan...@gmail.com:
 I am using Postgres 8.3.3 and I have a VIEW which is a UNION ALL of two
 tables but when I do a select on the view using a LIMIT, it scans the entire
 tables and takes significantly longer than writing out the query with the
 LIMITs in the sub-queries themselves. Is there a solution to get the view to
 perform like the query with the LIMIT explicitly placed in the sub-queries?

 I noticed a similar question in this post
 http://archives.postgresql.org/pgsql-general/2005-06/msg00488.php but I
 wasn't able to find an answer.

maybe

SELECT *
FROM (SELECT * FROM tab1 LIMIT n) s1
UNION ALL
  SELECT *
FROM (SELECT * FROM tab2 LIMIT n) s2
LIMIT n

Regards

Pavel Stehule


 Thanks,
 Dave

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] hash semi join caused by IN (select ...)

2011-05-18 Thread Scott Carey


On 5/17/11 12:38 AM, Clemens Eisserer linuxhi...@gmail.com wrote:

Hi,

 select  from t1 left join t2  WHERE id IN (select )

 Does it work as expected with one less join?  If so, try increasing
 join_collapse_limit ...

That did the trick - thanks a lot. I only had to increase
join_collapse_limit a bit and now get an almost perfect plan.
Instead of hash-joining all the data, the planner generates
nested-loop-joins with index only on the few rows I fetch.

Using = ANY(array(select... )) also seems to work, I wonder which one
works better. Does ANY(ARRAY(...)) force the optimizer to plan the
subquery seperated from the main query?


I'm not sure exactly what happens with ANY(ARRAY()).

I am fairly confident that the planner simply transforms an IN(select ...)
to a join, since they are equivalent.

Because foo IN (select ...) is just a join, it counts towards
join_collapse_limit.



Thanks, Clemens

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] reducing random_page_cost from 4 to 2 to force index scan

2011-05-18 Thread Greg Smith

Jim Nasby wrote:

I think the challenge there would be how to define the scope of the hot-spot. 
Is it the last X pages? Last X serial values? Something like correlation?

Hmm... it would be interesting if we had average relation access times for each 
stats bucket on a per-column basis; that would give the planner a better idea 
of how much IO overhead there would be for a given WHERE clause


You've already given one reasonable first answer to your question here.  
If you defined a usage counter for each histogram bucket, and 
incremented that each time something from it was touched, that could 
lead to a very rough way to determine access distribution.  Compute a 
ratio of the counts in those buckets, then have an estimate of the total 
cached percentage; multiplying the two will give you an idea how much of 
that specific bucket might be in memory.  It's not perfect, and you need 
to incorporate some sort of aging method to it (probably weighted 
average based), but the basic idea could work.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance