Thanks a lot, all of you - this is excellent advice. With the data clustered and statistics at a more reasonable value of 100, it now reproducibly takes even less time - 20-57 ms per query.

        1000x speedup with proper tuning - always impressive, lol.
        IO seeks are always your worst enemy.

After reading the section on "Statistics Used By the Planner" in the manual, I was a little concerned that, while the statistics sped up the queries that I tried immeasurably, that the most_common_vals array was where the speedup was happening, and that the values which wouldn't fit in this array wouldn't be sped up. Though I couldn't offhand find an example where this occurred, the clustering approach seems intuitively like a much more complete and scalable solution, at least for a read-only table like this.

Actually, with statistics set to 100, then 100 values will be stored in most_common_vals. This would mean that the values not in most_common_vals will have less than 1% frequency, and probably much less than that. The choice of plan for these rare values is pretty simple.

With two columns, "interesting" stuff can happen, like if you have col1 in [1...10] and col2 in [1...10] and use a condition on col1=const and col2=const, the selectivity of the result depends not only on the distribution of col1 and col2 but also their correlation.

As for the tests you did, it's hard to say without seeing the explain analyze outputs. If you change the stats and the plan choice (EXPLAIN) stays the same, and you use the same values in your query, any difference in timing comes from caching, since postgres is executing the same plan and therefore doing the exact same thing. Caching (from PG and from the OS) can make the timings vary a lot.

- Trying the same constant a second time gave an instantaneous result, I'm guessing because of query/result caching.

PG does not cache queries or results. It caches data & index pages in its shared buffers, and then the OS adds another layer of the usual disk cache. A simple query like selecting one row based on PK takes about 60 microseconds of CPU time, but if it needs one seek for the index and one for the data it may take 20 ms waiting for the moving parts to move... Hence, CLUSTER is a very useful tool.

Bitmap index scans love clustered tables because all the interesting rows end up being grouped together, so much less pages need to be visited.

- I didn't try decreasing the statistics back to 10 before I ran the cluster command, so I can't show the search times going up because of that. But I tried killing the 500 meg process. The new process uses less than 5 megs of ram, and still reproducibly returns a result in less than 60 ms. Again, this is with a statistics value of 100 and the data clustered by gene_prediction_view_gene_ref_key.

        Killing it or just restarting postgres ?
If you let postgres run (not idle) for a while, naturally it will fill the RAM up to the shared_buffers setting that you specified in the configuration file. This is good, since grabbing data from postgres' own cache is faster than having to make a syscall to the OS to get it from the OS disk cache (or disk). This isn't bloat. But what those 500 MB versus 6 MB show is that before, postgres had to read a lot of data for your query, so it stayed in the cache ; after tuning it needs to read much less data (thanks to CLUSTER) so the cache stays empty.


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Reply via email to