Re: [PERFORM] large tables and simple "= constant" queries using indexes

PFC Thu, 10 Apr 2008 14:34:09 -0700

Thanks a lot, all of you - this is excellent advice. With the dataclustered and statistics at a more reasonable value of 100, it nowreproducibly takes even less time - 20-57 ms per query.


        1000x speedup with proper tuning - always impressive, lol.
        IO seeks are always your worst enemy.

After reading the section on "Statistics Used By the Planner" in themanual, I was a little concerned that, while the statistics sped up thequeries that I tried immeasurably, that the most_common_vals array waswhere the speedup was happening, and that the values which wouldn't fitin this array wouldn't be sped up. Though I couldn't offhand find anexample where this occurred, the clustering approach seems intuitivelylike a much more complete and scalable solution, at least for aread-only table like this.

Actually, with statistics set to 100, then 100 values will be stored inmost_common_vals. This would mean that the values not in most_common_valswill have less than 1% frequency, and probably much less than that. Thechoice of plan for these rare values is pretty simple.

With two columns, "interesting" stuff can happen, like if you have col1in [1...10] and col2 in [1...10] and use a condition on col1=const andcol2=const, the selectivity of the result depends not only on thedistribution of col1 and col2 but also their correlation.

As for the tests you did, it's hard to say without seeing the explainanalyze outputs. If you change the stats and the plan choice (EXPLAIN)stays the same, and you use the same values in your query, any differencein timing comes from caching, since postgres is executing the same planand therefore doing the exact same thing. Caching (from PG and from theOS) can make the timings vary a lot.

- Trying the same constant a second time gave an instantaneous result,I'm guessing because of query/result caching.

PG does not cache queries or results. It caches data & index pages in itsshared buffers, and then the OS adds another layer of the usual disk cache.A simple query like selecting one row based on PK takes about 60microseconds of CPU time, but if it needs one seek for the index and onefor the data it may take 20 ms waiting for the moving parts to move...Hence, CLUSTER is a very useful tool.

Bitmap index scans love clustered tables because all the interesting rowsend up being grouped together, so much less pages need to be visited.

- I didn't try decreasing the statistics back to 10 before I ran thecluster command, so I can't show the search times going up because ofthat. But I tried killing the 500 meg process. The new process uses lessthan 5 megs of ram, and still reproducibly returns a result in less than60 ms. Again, this is with a statistics value of 100 and the dataclustered by gene_prediction_view_gene_ref_key.


        Killing it or just restarting postgres ?

If you let postgres run (not idle) for a while, naturally it will fillthe RAM up to the shared_buffers setting that you specified in theconfiguration file. This is good, since grabbing data from postgres' owncache is faster than having to make a syscall to the OS to get it from theOS disk cache (or disk). This isn't bloat.But what those 500 MB versus 6 MB show is that before, postgres had toread a lot of data for your query, so it stayed in the cache ; aftertuning it needs to read much less data (thanks to CLUSTER) so the cachestays empty.



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] large tables and simple "= constant" queries using indexes

Reply via email to