Re: [PERFORM] Unexpected sequential scan on an indexed column
OK, I think that after reading this dochttp://www.postgresql.org/files/developer/optimizer.pdf (which I hadn't encountered before) about the optimizer, something clicked in my brain and I think I can answer my own question. I was basically thinking from my own perspective rather than from the query planner's perspective: - From my perspective I know that the subselect will return very few values, so naively I expected that the planner would be able to do a bitmap index scan with the small set of values returned, without needing to do a join (such as the nested loop join it ended up choosing). - However (and this is probably obvious to all of you), the query planner doesn't really know for a fact that a sub-select will result in a small number of rows, so it guesses based on its statistics what the best kind of join would be. A 'bitmap index scan' is not one of the choices for a join, I'm guessing because a 'nested loop join with inner index scan' is a more generally applicable strategy that can get the same order of magnitude of performance in restriction cases that end up being as simple as an IN (list) restriction. However, there are more competing possibilities for picking an appropriate join strategy than for picking a strategy to apply an IN (list) restriction, so the planner may not pick the 'nested loop join with inner index scan' if the ANALYZE statistics don't guide it that way, even if that would be the best strategy in the end. I guess the only way I can think of to make a generic planner that would have performend well even in the lopsided statistics case is to create some plan nodes with contingency conditions. E.g.: Plan: Nested loop join with sequential scan Assumption: all table values are the same Contingency plan: nested loop join with index scan Then, if the assumption for the plan is violated early enough while executing the plan, the query executor would abort that plan node execution and start over with the contingency plan. I guess implementing this kind of system in a generic way could get pretty hairy, and given my limited experience I don't know if the proportion of query plans that would be improved by having these kinds of contingency plans is significant enough to warrant the cost of developing this system, but I'm gathering that most query planners (including the postgres planner) don't do this kind of contingency planning :) Thanks! Eddy On Sun, Nov 15, 2009 at 5:46 PM, Eddy Escardo-Raffo eesca...@kikini.comwrote: I was using VALUES in my examples to more closely mirror the results of a sub-select (I abstracted everything else away and noticed that even just using VALUES syntax instead of a sub-select, the performance was bad). The full statement I had that led me into this more narrow investigation in the first place looks more like: explain analyze SELECT u.userid FROM users u, (SELECT locid FROM locations WHERE ...) l WHERE u.location = l.locid LIMIT 10; Based on the investigation so far, it seems like this kind of statement will perform well when the users.location distribution is not overwhelmingly lopsided, but not otherwise. However, using the IN (list) notation with a list of integer literals seems to perform well no matter what is the specific distribution of values in the users.location column. I would like to understand why this is so, to help me write better queries in the future. Thanks, Eddy On Sun, Nov 15, 2009 at 5:23 PM, Tom Lane t...@sss.pgh.pa.us wrote: Eddy Escardo-Raffo eesca...@kikini.com writes: For C, the planner estimated 10 thousand rows. For D, the planner estimated 100 thousand rows, yet for E the planner estimated only 1 row, which is the closest to reality. So, is there any way to specify a query that has a SUB-SELECT that returns a small set of values so that the planner treats it similar to how it treats statement E, or does statement E get its additional edge precisely from the fact that the restriction is defined by integer literals? Currently there is no attempt to look at the exact contents of a VALUES construct for planning purposes. For the examples you're showing it seems like the IN (list) notation is more compact and more widely used, so improving the VALUES alternative doesn't seem that exciting. regards, tom lane
Re: [PERFORM] Unexpected sequential scan on an indexed column
Hi Eddy Perhaps a slightly naive suggestion have you considered converting the query to a small stored procedure ('function' in Postgres speak)? You can pull the location values, and then iterate over a query like this: select userid from users where location=:x which is more-or-less guaranteed to use the index. I had a somewhat similar situation recently, where I was passing in a list of id's (from outwith Postgres) and it would on occasion avoid the index in favour of a full table scan I changed this to iterate over the id's with separate queries (in Java, but using a function will achieve the same thing) and went from one 5 minute query doing full table scan to a handful of queries doing sub-millisecond direct index lookups. Cheers Dave -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Unexpected sequential scan on an indexed column
Yeah this kind of thing would probably work. Doing this in java with separate queries would be easy to code but require multiple round trips. Doing it as a stored procedure would be nicer but I'd have to think a little more about how to refactor the java code around the query to make this happen. Thanks for the suggestion. Eddy On Mon, Nov 16, 2009 at 9:44 AM, Dave Crooke dcro...@gmail.com wrote: Hi Eddy Perhaps a slightly naive suggestion have you considered converting the query to a small stored procedure ('function' in Postgres speak)? You can pull the location values, and then iterate over a query like this: select userid from users where location=:x which is more-or-less guaranteed to use the index. I had a somewhat similar situation recently, where I was passing in a list of id's (from outwith Postgres) and it would on occasion avoid the index in favour of a full table scan I changed this to iterate over the id's with separate queries (in Java, but using a function will achieve the same thing) and went from one 5 minute query doing full table scan to a handful of queries doing sub-millisecond direct index lookups. Cheers Dave
Re: [PERFORM] Unexpected sequential scan on an indexed column
On Mon, Nov 16, 2009 at 12:45:46PM -0800, Eddy Escardo-Raffo wrote: Yeah this kind of thing would probably work. Doing this in java with separate queries would be easy to code but require multiple round trips. Doing it as a stored procedure would be nicer but I'd have to think a little more about how to refactor the java code around the query to make this happen. Thanks for the suggestion. Eddy Hi Eddy, Here is a lookup wrapper that is used in DSPAM to work around a similar problem. Maybe you can use it as a template for your function: create function lookup_tokens(integer,bigint[]) returns setof dspam_token_data language plpgsql stable as ' declare v_rec record; begin for v_rec in select * from dspam_token_data where uid=$1 and token in (select $2[i] from generate_series(array_lower($2,1),array_upper($2,1)) s(i)) loop return next v_rec; end loop; return; end;'; Regards, Ken On Mon, Nov 16, 2009 at 9:44 AM, Dave Crooke dcro...@gmail.com wrote: Hi Eddy Perhaps a slightly naive suggestion have you considered converting the query to a small stored procedure ('function' in Postgres speak)? You can pull the location values, and then iterate over a query like this: select userid from users where location=:x which is more-or-less guaranteed to use the index. I had a somewhat similar situation recently, where I was passing in a list of id's (from outwith Postgres) and it would on occasion avoid the index in favour of a full table scan I changed this to iterate over the id's with separate queries (in Java, but using a function will achieve the same thing) and went from one 5 minute query doing full table scan to a handful of queries doing sub-millisecond direct index lookups. Cheers Dave -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Unexpected sequential scan on an indexed column
This is incredibly helpful, Kenneth. I didn't know about the SETOF syntax at all. This could help minimize the amount of refactoring I need to do. Thanks! Eddy On Mon, Nov 16, 2009 at 12:55 PM, Kenneth Marshall k...@rice.edu wrote: On Mon, Nov 16, 2009 at 12:45:46PM -0800, Eddy Escardo-Raffo wrote: Yeah this kind of thing would probably work. Doing this in java with separate queries would be easy to code but require multiple round trips. Doing it as a stored procedure would be nicer but I'd have to think a little more about how to refactor the java code around the query to make this happen. Thanks for the suggestion. Eddy Hi Eddy, Here is a lookup wrapper that is used in DSPAM to work around a similar problem. Maybe you can use it as a template for your function: create function lookup_tokens(integer,bigint[]) returns setof dspam_token_data language plpgsql stable as ' declare v_rec record; begin for v_rec in select * from dspam_token_data where uid=$1 and token in (select $2[i] from generate_series(array_lower($2,1),array_upper($2,1)) s(i)) loop return next v_rec; end loop; return; end;'; Regards, Ken On Mon, Nov 16, 2009 at 9:44 AM, Dave Crooke dcro...@gmail.com wrote: Hi Eddy Perhaps a slightly naive suggestion have you considered converting the query to a small stored procedure ('function' in Postgres speak)? You can pull the location values, and then iterate over a query like this: select userid from users where location=:x which is more-or-less guaranteed to use the index. I had a somewhat similar situation recently, where I was passing in a list of id's (from outwith Postgres) and it would on occasion avoid the index in favour of a full table scan I changed this to iterate over the id's with separate queries (in Java, but using a function will achieve the same thing) and went from one 5 minute query doing full table scan to a handful of queries doing sub-millisecond direct index lookups. Cheers Dave
Re: [PERFORM] Unexpected sequential scan on an indexed column
With Postgres, you can transparently replace a regular select with a function that takes the same types and returns a record iterator with the same columns. The only change needed is the SQL used to invoke it, you won't need any logic changes in your app code (Java or whatever), e.g. *select where x=:x ..(select .. where . y=:y) * Becomes *select myfunction(:x, :y) * On Mon, Nov 16, 2009 at 2:45 PM, Eddy Escardo-Raffo eesca...@kikini.comwrote: Yeah this kind of thing would probably work. Doing this in java with separate queries would be easy to code but require multiple round trips. Doing it as a stored procedure would be nicer but I'd have to think a little more about how to refactor the java code around the query to make this happen. Thanks for the suggestion. Eddy On Mon, Nov 16, 2009 at 9:44 AM, Dave Crooke dcro...@gmail.com wrote: Hi Eddy Perhaps a slightly naive suggestion have you considered converting the query to a small stored procedure ('function' in Postgres speak)? You can pull the location values, and then iterate over a query like this: select userid from users where location=:x which is more-or-less guaranteed to use the index. I had a somewhat similar situation recently, where I was passing in a list of id's (from outwith Postgres) and it would on occasion avoid the index in favour of a full table scan I changed this to iterate over the id's with separate queries (in Java, but using a function will achieve the same thing) and went from one 5 minute query doing full table scan to a handful of queries doing sub-millisecond direct index lookups. Cheers Dave
Re: [PERFORM] Unexpected sequential scan on an indexed column
Thanks, Dave. Eddy On Mon, Nov 16, 2009 at 1:52 PM, Dave Crooke dcro...@gmail.com wrote: With Postgres, you can transparently replace a regular select with a function that takes the same types and returns a record iterator with the same columns. The only change needed is the SQL used to invoke it, you won't need any logic changes in your app code (Java or whatever), e.g. *select where x=:x ..(select .. where . y=:y) * Becomes *select myfunction(:x, :y) * On Mon, Nov 16, 2009 at 2:45 PM, Eddy Escardo-Raffo eesca...@kikini.comwrote: Yeah this kind of thing would probably work. Doing this in java with separate queries would be easy to code but require multiple round trips. Doing it as a stored procedure would be nicer but I'd have to think a little more about how to refactor the java code around the query to make this happen. Thanks for the suggestion. Eddy On Mon, Nov 16, 2009 at 9:44 AM, Dave Crooke dcro...@gmail.com wrote: Hi Eddy Perhaps a slightly naive suggestion have you considered converting the query to a small stored procedure ('function' in Postgres speak)? You can pull the location values, and then iterate over a query like this: select userid from users where location=:x which is more-or-less guaranteed to use the index. I had a somewhat similar situation recently, where I was passing in a list of id's (from outwith Postgres) and it would on occasion avoid the index in favour of a full table scan I changed this to iterate over the id's with separate queries (in Java, but using a function will achieve the same thing) and went from one 5 minute query doing full table scan to a handful of queries doing sub-millisecond direct index lookups. Cheers Dave
Re: [PERFORM] Unexpected sequential scan on an indexed column
Eddy Escardo-Raffo eesca...@kikini.com writes: Do you guys have any idea why this is not working as I expect? Datatype issue maybe? When I try what seems to be the same case here I get the expected indexscan, so I'm thinking the problem is that the comparison isn't indexable, which is a possibility if the location column isn't actually integer. The fact that it's estimating 100 rows out is also extremely suspicious --- it might or might not get the exact 2 estimate, but I'd sure expect it to know that the majority of rows don't match. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Unexpected sequential scan on an indexed column
Eddy Escardo-Raffo eesca...@kikini.com writes: The table used in this query is called users, and it has columns userid (primary key) and location. The location column is indexed. The users table has 1 million rows, and all rows have integer typed value '-1' for location column, except for 2 rows that have the integer value '76543'. Oh, after poking at it a bit more, I realize the problem: the planner doesn't want to use an indexscan because it assumes there's a significant probability that the search will be for -1 (in which case the indexscan would be slower than a seqscan, as indeed your results prove). Even though it could know in this particular case that the comparison value isn't -1, I doubt that teaching it that would help your real queries where it will probably be impossible to determine the comparison values in advance. I would suggest considering using NULL rather than inventing a dummy value for unknown locations. The estimation heuristics will play a lot nicer with that choice. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Unexpected sequential scan on an indexed column
Thanks, Tom. I had discarded the possibility of data type mismatch already, which was your first guess, but was wondering if the lopsided distribution of location values would lead the planner to make a decision that is good on average but bad for this particular query, as you point out in your second guess. I'll try populating the test users with a more evenly distributed location field, which will be more realistic anyway, and see if that works out better. BTW, the -1 is not really a dummy value, but it's just a value that we have been using in tests for fake test location ID. I just started performance measurement for my application and so far had measured performance with every user being in the same default location and things seemed to be going well, so I tried to switch a couple users to a different location and see what happened, and that made performance drop significantly. (even more detail: my queries also limit results to 10 approx, so DB quickly found 10 rows that match location -1, but it took a while to discover there weren't more than 2 rows with the other value). Thanks! Eddy On Sun, Nov 15, 2009 at 3:33 PM, Tom Lane t...@sss.pgh.pa.us wrote: Eddy Escardo-Raffo eesca...@kikini.com writes: The table used in this query is called users, and it has columns userid (primary key) and location. The location column is indexed. The users table has 1 million rows, and all rows have integer typed value '-1' for location column, except for 2 rows that have the integer value '76543'. Oh, after poking at it a bit more, I realize the problem: the planner doesn't want to use an indexscan because it assumes there's a significant probability that the search will be for -1 (in which case the indexscan would be slower than a seqscan, as indeed your results prove). Even though it could know in this particular case that the comparison value isn't -1, I doubt that teaching it that would help your real queries where it will probably be impossible to determine the comparison values in advance. I would suggest considering using NULL rather than inventing a dummy value for unknown locations. The estimation heuristics will play a lot nicer with that choice. regards, tom lane
Re: [PERFORM] Unexpected sequential scan on an indexed column
Eddy Escardo-Raffo eesca...@kikini.com writes: For C, the planner estimated 10 thousand rows. For D, the planner estimated 100 thousand rows, yet for E the planner estimated only 1 row, which is the closest to reality. So, is there any way to specify a query that has a SUB-SELECT that returns a small set of values so that the planner treats it similar to how it treats statement E, or does statement E get its additional edge precisely from the fact that the restriction is defined by integer literals? Currently there is no attempt to look at the exact contents of a VALUES construct for planning purposes. For the examples you're showing it seems like the IN (list) notation is more compact and more widely used, so improving the VALUES alternative doesn't seem that exciting. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance