On 09/14/2011 01:10 AM, Alban Hertroys wrote:
On 13 Sep 2011, at 23:44, Brian Fehrle wrote:

These queries basically do a 'select max(primary_key_column) from table group by 
column1, column2." Because of the group by, we would result in a sequential 
scan of the entire table which proves to be costly.
That seems to suggest a row where the primary key that has the max value is "special" in 
some way. Making them more easily distinguishable from "normal" rows seems like a good 
idea here.

Since the table has a ton of columns, I set up a smaller table that will house 
a copy of some of the data that the query uses, the Primary Key colum, and the 
two columns I do my 'group by' on.
That's one way to distinguish these special rows from the rest. You could also 
mark them as special using an extra column and/or create an expression-based 
index over just those rows.

However, especially with the below section in mind, it would appear your data 
could be normalised a bit more (your splitting off that shadow table is a step 
in doing so, in fact).

I'm also wondering, does your primary key have actual meaning? It would appear 
to just indicate the order in which the records were created (I'm assuming it's 
a serial type surrogate PK, and not a natural one).

It isn't a serial type, and the id increment is handled by the application.
This shadow table will also only contain one row for every column1 and column2 
combination (due to the group by), and for those rows, will have the max of the 
primary key. Even with this, the 'shadow' table will have about 14 million 
rows, compared to the 15 million in the main table.
Don't (column1, column2) make up a key then? I get the feeling you should split 
your table in 3 sections:
Table 1: main lookup (PK: pkey_sid)
Table 2: Variation lookup (PK: (column1, column2), FK: pkey_sid)
Table 3: Data (FK: the above)

(column1, column2) could possibly have multiple occurrences of the combination. Such as, 4 rows where column1 = 54 and column2 = 86, in these cases with multiple rows, I just want the one with the max(primary_key).

I'm looking into options like this, but at this moment changing the base table structure is out of the question, but adding tables along the side to try to speed things up is ok. Im trying to not cause changes in the application.
So the issue here comes in retrieving the needed data from my main table. The 
resulting rows is estimated to be 409,600, and the retrieving of the primary 
key's that are associated with those rows is actually really easy. However, 
when we take those 409,600 rows back to the main table to retrieve the other 
columns I need, the planner is just doing a sequential scan as it's most likely 
going to be faster than hitting the index then retrieving the columns I need 
for all 400K+ rows.
Is that estimate accurate? If not, see Ondrej's suggestions.

That is only about 1/30th of your table. I don't think a seqscan makes sense 
here unless your data is distributed badly.

Yeah the more I look at it, the more I think it's postgres _thinking_ that it's faster to do a seqential scan. I'll be playing with the random_page_cost that Ondrej suggested, and schedule a time where I can do some explain analyzes (production server and all).
Things to note:
1. If I reduce my where clause's range, then the sequential scan turns into an 
index scan, but sadly this can't always be done.
Does it make sense to CLUSTER your data in some sense? That would improve the 
data distribution issue and would probably push the threshold for a seqscan up 
some.

Cheers,

Alban Hertroys

Thanks, I'll be reporting back in with my next findings.

- Brian F
--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Reply via email to