Re: [PERFORM] Unexpected sequential scan on an indexed column

2009-11-16 Thread Eddy Escardo-Raffo
OK, I think that after reading this
dochttp://www.postgresql.org/files/developer/optimizer.pdf (which
I hadn't encountered before) about the optimizer, something clicked in my
brain and I think I can answer my own question. I was basically thinking
from my own perspective rather than from the query planner's perspective:
- From my perspective I know that the subselect will return very few values,
so naively I expected that the planner would be able to do a bitmap index
scan with the small set of values returned, without needing to do a join
(such as the nested loop join it ended up choosing).
- However (and this is probably obvious to all of you), the query
planner doesn't really know for a fact that a sub-select will result in a
small number of rows, so it guesses based on its statistics what the best
kind of join would be. A 'bitmap index scan' is not one of the choices for a
join, I'm guessing because a 'nested loop join with inner index scan' is a
more generally applicable strategy that can get the same order of magnitude
of performance in restriction cases that end up being as simple as an IN
(list) restriction. However, there are more competing possibilities for
picking an appropriate join strategy than for picking a strategy to apply an
IN (list) restriction, so the planner may not pick the 'nested loop join
with inner index scan' if the ANALYZE statistics don't guide it that way,
even if that would be the best strategy in the end.
I guess the only way I can think of to make a generic planner that would
have performend well even in the lopsided statistics case is to create some
plan nodes with contingency conditions. E.g.:

Plan: Nested loop join with sequential scan
Assumption: all table values are the same
Contingency plan: nested loop join with index scan

Then, if the assumption for the plan is violated early enough while
executing the plan, the query executor would abort that plan node execution
and start over with the contingency plan.

I guess implementing this kind of system in a generic way could get pretty
hairy, and given my limited experience I don't know if the proportion of
query plans that would be improved by having these kinds of contingency
plans is significant enough to warrant the cost of developing this system,
but I'm gathering that most query planners (including the postgres planner)
don't do this kind of contingency planning :)

Thanks!
Eddy
On Sun, Nov 15, 2009 at 5:46 PM, Eddy Escardo-Raffo eesca...@kikini.comwrote:

 I was using VALUES in my examples to more closely mirror the results of a
 sub-select (I abstracted everything else away and noticed that even just
 using VALUES syntax instead of a sub-select, the performance was bad). The
 full statement I had that led me into this more narrow investigation in the
 first place looks more like:

 explain analyze SELECT u.userid FROM users u, (SELECT locid FROM locations
 WHERE ...) l WHERE u.location = l.locid LIMIT 10;

 Based on the investigation so far, it seems like this kind of statement
 will perform well when the users.location distribution is not overwhelmingly
 lopsided, but not otherwise. However, using the IN (list) notation with a
 list of integer literals seems to perform well no matter what is the
 specific distribution of values in the users.location column.

 I would like to understand why this is so, to help me write better queries
 in the future.

 Thanks,
 Eddy
   On Sun, Nov 15, 2009 at 5:23 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Eddy Escardo-Raffo eesca...@kikini.com writes:
  For C, the planner estimated 10 thousand rows. For D, the planner
 estimated
  100 thousand rows, yet for E the planner estimated only 1 row, which is
 the
  closest to reality. So, is there any way to specify a query that has a
  SUB-SELECT that returns a small set of values so that the planner treats
 it
  similar to how it treats statement E, or does statement E get its
 additional
  edge precisely from the fact that the restriction is defined by integer
  literals?

 Currently there is no attempt to look at the exact contents of a VALUES
 construct for planning purposes.  For the examples you're showing it
 seems like the IN (list) notation is more compact and more widely used,
 so improving the VALUES alternative doesn't seem that exciting.

regards, tom lane





Re: [PERFORM] Unexpected sequential scan on an indexed column

2009-11-16 Thread Dave Crooke
Hi Eddy

Perhaps a slightly naive suggestion  have you considered
converting the query to a small stored procedure ('function' in
Postgres speak)? You can pull the location values, and then iterate
over a query like this:

select userid from users where location=:x

which is more-or-less guaranteed to use the index.


I had a somewhat similar situation recently, where I was passing in a
list of id's (from outwith Postgres) and it would on occasion avoid
the index in favour of a full table scan  I changed this to
iterate over the id's with separate queries (in Java, but using a
function will achieve the same thing) and went from one 5 minute query
doing full table scan to a handful of queries doing sub-millisecond
direct index lookups.

Cheers
Dave

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Unexpected sequential scan on an indexed column

2009-11-16 Thread Eddy Escardo-Raffo
Yeah this kind of thing would probably work. Doing this in java with
separate queries would be easy to code but require multiple round trips.
Doing it as a stored procedure would be nicer but I'd have to think a little
more about how to refactor the java code around the query to make this
happen. Thanks for the suggestion.

Eddy

On Mon, Nov 16, 2009 at 9:44 AM, Dave Crooke dcro...@gmail.com wrote:

 Hi Eddy

 Perhaps a slightly naive suggestion  have you considered
 converting the query to a small stored procedure ('function' in
 Postgres speak)? You can pull the location values, and then iterate
 over a query like this:

 select userid from users where location=:x

 which is more-or-less guaranteed to use the index.


 I had a somewhat similar situation recently, where I was passing in a
 list of id's (from outwith Postgres) and it would on occasion avoid
 the index in favour of a full table scan  I changed this to
 iterate over the id's with separate queries (in Java, but using a
 function will achieve the same thing) and went from one 5 minute query
 doing full table scan to a handful of queries doing sub-millisecond
 direct index lookups.

 Cheers
 Dave



Re: [PERFORM] Unexpected sequential scan on an indexed column

2009-11-16 Thread Kenneth Marshall
On Mon, Nov 16, 2009 at 12:45:46PM -0800, Eddy Escardo-Raffo wrote:
 Yeah this kind of thing would probably work. Doing this in java with
 separate queries would be easy to code but require multiple round trips.
 Doing it as a stored procedure would be nicer but I'd have to think a little
 more about how to refactor the java code around the query to make this
 happen. Thanks for the suggestion.
 
 Eddy
 

Hi Eddy,

Here is a lookup wrapper that is used in DSPAM to work around
a similar problem. Maybe you can use it as a template for your
function:

create function lookup_tokens(integer,bigint[])
  returns setof dspam_token_data
  language plpgsql stable
  as '
declare
  v_rec record;
begin
  for v_rec in select * from dspam_token_data
where uid=$1
  and token in (select $2[i]
from generate_series(array_lower($2,1),array_upper($2,1)) s(i))
  loop
return next v_rec;
  end loop;
  return;
end;';

Regards,
Ken

 On Mon, Nov 16, 2009 at 9:44 AM, Dave Crooke dcro...@gmail.com wrote:
 
  Hi Eddy
 
  Perhaps a slightly naive suggestion  have you considered
  converting the query to a small stored procedure ('function' in
  Postgres speak)? You can pull the location values, and then iterate
  over a query like this:
 
  select userid from users where location=:x
 
  which is more-or-less guaranteed to use the index.
 
 
  I had a somewhat similar situation recently, where I was passing in a
  list of id's (from outwith Postgres) and it would on occasion avoid
  the index in favour of a full table scan  I changed this to
  iterate over the id's with separate queries (in Java, but using a
  function will achieve the same thing) and went from one 5 minute query
  doing full table scan to a handful of queries doing sub-millisecond
  direct index lookups.
 
  Cheers
  Dave
 

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Unexpected sequential scan on an indexed column

2009-11-16 Thread Eddy Escardo-Raffo
This is incredibly helpful, Kenneth. I didn't know about the SETOF syntax at
all. This could help minimize the amount of refactoring I need to do.

Thanks!
Eddy

On Mon, Nov 16, 2009 at 12:55 PM, Kenneth Marshall k...@rice.edu wrote:

 On Mon, Nov 16, 2009 at 12:45:46PM -0800, Eddy Escardo-Raffo wrote:
  Yeah this kind of thing would probably work. Doing this in java with
  separate queries would be easy to code but require multiple round trips.
  Doing it as a stored procedure would be nicer but I'd have to think a
 little
  more about how to refactor the java code around the query to make this
  happen. Thanks for the suggestion.
 
  Eddy
 

 Hi Eddy,

 Here is a lookup wrapper that is used in DSPAM to work around
 a similar problem. Maybe you can use it as a template for your
 function:

 create function lookup_tokens(integer,bigint[])
  returns setof dspam_token_data
  language plpgsql stable
  as '
 declare
  v_rec record;
 begin
  for v_rec in select * from dspam_token_data
where uid=$1
  and token in (select $2[i]
from generate_series(array_lower($2,1),array_upper($2,1)) s(i))
  loop
return next v_rec;
  end loop;
  return;
 end;';

 Regards,
 Ken

  On Mon, Nov 16, 2009 at 9:44 AM, Dave Crooke dcro...@gmail.com wrote:
 
   Hi Eddy
  
   Perhaps a slightly naive suggestion  have you considered
   converting the query to a small stored procedure ('function' in
   Postgres speak)? You can pull the location values, and then iterate
   over a query like this:
  
   select userid from users where location=:x
  
   which is more-or-less guaranteed to use the index.
  
  
   I had a somewhat similar situation recently, where I was passing in a
   list of id's (from outwith Postgres) and it would on occasion avoid
   the index in favour of a full table scan  I changed this to
   iterate over the id's with separate queries (in Java, but using a
   function will achieve the same thing) and went from one 5 minute query
   doing full table scan to a handful of queries doing sub-millisecond
   direct index lookups.
  
   Cheers
   Dave
  



Re: [PERFORM] Unexpected sequential scan on an indexed column

2009-11-16 Thread Dave Crooke
With Postgres, you can transparently replace a regular select with a
function that takes the same types and returns a record iterator with the
same columns. The only change needed is the SQL used to invoke it, you won't
need any logic changes in your app code (Java or whatever), e.g.

*select  where x=:x ..(select .. where . y=:y)
*
Becomes

*select myfunction(:x, :y)
*
On Mon, Nov 16, 2009 at 2:45 PM, Eddy Escardo-Raffo eesca...@kikini.comwrote:

 Yeah this kind of thing would probably work. Doing this in java with
 separate queries would be easy to code but require multiple round trips.
 Doing it as a stored procedure would be nicer but I'd have to think a little
 more about how to refactor the java code around the query to make this
 happen. Thanks for the suggestion.

 Eddy

 On Mon, Nov 16, 2009 at 9:44 AM, Dave Crooke dcro...@gmail.com wrote:

 Hi Eddy

 Perhaps a slightly naive suggestion  have you considered
 converting the query to a small stored procedure ('function' in
 Postgres speak)? You can pull the location values, and then iterate
 over a query like this:

 select userid from users where location=:x

 which is more-or-less guaranteed to use the index.


 I had a somewhat similar situation recently, where I was passing in a
 list of id's (from outwith Postgres) and it would on occasion avoid
 the index in favour of a full table scan  I changed this to
 iterate over the id's with separate queries (in Java, but using a
 function will achieve the same thing) and went from one 5 minute query
 doing full table scan to a handful of queries doing sub-millisecond
 direct index lookups.

 Cheers
 Dave





Re: [PERFORM] Unexpected sequential scan on an indexed column

2009-11-16 Thread Eddy Escardo-Raffo
Thanks, Dave.
Eddy

On Mon, Nov 16, 2009 at 1:52 PM, Dave Crooke dcro...@gmail.com wrote:

 With Postgres, you can transparently replace a regular select with a
 function that takes the same types and returns a record iterator with the
 same columns. The only change needed is the SQL used to invoke it, you won't
 need any logic changes in your app code (Java or whatever), e.g.

 *select  where x=:x ..(select .. where . y=:y)
 *
 Becomes

 *select myfunction(:x, :y)
 *

 On Mon, Nov 16, 2009 at 2:45 PM, Eddy Escardo-Raffo 
 eesca...@kikini.comwrote:

 Yeah this kind of thing would probably work. Doing this in java with
 separate queries would be easy to code but require multiple round trips.
 Doing it as a stored procedure would be nicer but I'd have to think a little
 more about how to refactor the java code around the query to make this
 happen. Thanks for the suggestion.

 Eddy

   On Mon, Nov 16, 2009 at 9:44 AM, Dave Crooke dcro...@gmail.com wrote:

 Hi Eddy

 Perhaps a slightly naive suggestion  have you considered
 converting the query to a small stored procedure ('function' in
 Postgres speak)? You can pull the location values, and then iterate
 over a query like this:

 select userid from users where location=:x

 which is more-or-less guaranteed to use the index.


 I had a somewhat similar situation recently, where I was passing in a
 list of id's (from outwith Postgres) and it would on occasion avoid
 the index in favour of a full table scan  I changed this to
 iterate over the id's with separate queries (in Java, but using a
 function will achieve the same thing) and went from one 5 minute query
 doing full table scan to a handful of queries doing sub-millisecond
 direct index lookups.

 Cheers
 Dave






Re: [PERFORM] Unexpected sequential scan on an indexed column

2009-11-15 Thread Tom Lane
Eddy Escardo-Raffo eesca...@kikini.com writes:
 Do you guys have any idea why this is not working as I expect?

Datatype issue maybe?  When I try what seems to be the same case here
I get the expected indexscan, so I'm thinking the problem is that
the comparison isn't indexable, which is a possibility if the location
column isn't actually integer.

The fact that it's estimating 100 rows out is also extremely
suspicious --- it might or might not get the exact 2 estimate,
but I'd sure expect it to know that the majority of rows don't match.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Unexpected sequential scan on an indexed column

2009-11-15 Thread Tom Lane
Eddy Escardo-Raffo eesca...@kikini.com writes:
 The table used in this query is called users, and it has columns userid
 (primary key) and location.
 The location column is indexed.
 The users table has 1 million rows, and all rows have integer typed value
 '-1' for  location column, except for 2 rows that have the integer value
 '76543'.

Oh, after poking at it a bit more, I realize the problem: the planner
doesn't want to use an indexscan because it assumes there's a
significant probability that the search will be for -1 (in which case
the indexscan would be slower than a seqscan, as indeed your results
prove).  Even though it could know in this particular case that the
comparison value isn't -1, I doubt that teaching it that would help your
real queries where it will probably be impossible to determine the
comparison values in advance.

I would suggest considering using NULL rather than inventing a dummy
value for unknown locations.  The estimation heuristics will play a
lot nicer with that choice.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Unexpected sequential scan on an indexed column

2009-11-15 Thread Eddy Escardo-Raffo
Thanks, Tom. I had discarded the possibility of data type mismatch already,
which was your first guess, but was wondering if the lopsided distribution
of location values would lead the planner to make a decision that is good on
average but bad for this particular query, as you point out in your second
guess.

I'll try populating the test users with a more evenly distributed location
field, which will be more realistic anyway, and see if that works out
better.

BTW, the -1 is not really a dummy value, but it's just a value that we have
been using in tests for fake test location ID. I just started performance
measurement for my application and so far had measured performance with
every user being in the same default location and things seemed to be going
well, so I tried to switch a couple users to a different location and see
what happened, and that made performance drop significantly.
(even more detail: my queries also limit results to 10 approx, so DB quickly
found 10 rows that match location -1, but it took a while to discover there
weren't more than 2 rows with the other value).

Thanks!
Eddy

On Sun, Nov 15, 2009 at 3:33 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Eddy Escardo-Raffo eesca...@kikini.com writes:
  The table used in this query is called users, and it has columns
 userid
  (primary key) and location.
  The location column is indexed.
  The users table has 1 million rows, and all rows have integer typed value
  '-1' for  location column, except for 2 rows that have the integer
 value
  '76543'.

 Oh, after poking at it a bit more, I realize the problem: the planner
 doesn't want to use an indexscan because it assumes there's a
 significant probability that the search will be for -1 (in which case
 the indexscan would be slower than a seqscan, as indeed your results
 prove).  Even though it could know in this particular case that the
 comparison value isn't -1, I doubt that teaching it that would help your
 real queries where it will probably be impossible to determine the
 comparison values in advance.

 I would suggest considering using NULL rather than inventing a dummy
 value for unknown locations.  The estimation heuristics will play a
 lot nicer with that choice.

regards, tom lane



Re: [PERFORM] Unexpected sequential scan on an indexed column

2009-11-15 Thread Tom Lane
Eddy Escardo-Raffo eesca...@kikini.com writes:
 For C, the planner estimated 10 thousand rows. For D, the planner estimated
 100 thousand rows, yet for E the planner estimated only 1 row, which is the
 closest to reality. So, is there any way to specify a query that has a
 SUB-SELECT that returns a small set of values so that the planner treats it
 similar to how it treats statement E, or does statement E get its additional
 edge precisely from the fact that the restriction is defined by integer
 literals?

Currently there is no attempt to look at the exact contents of a VALUES
construct for planning purposes.  For the examples you're showing it
seems like the IN (list) notation is more compact and more widely used,
so improving the VALUES alternative doesn't seem that exciting.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance