People:

Transparent "query caching" is the "industry standard" for how these things 
are handled.   However, Postgres' lack of this feature has made me consider 
other approaches, and I'm starting to wonder if the "standard" query caching 
-- where a materialized query result, or some reduction thereof, is cached in 
database memory -- isn't the best way to cache things.  I'm going to 
abbreviate it "SQC" for the rest of this e-mail.

Obviously, the draw of SQC is its transparency to developers.   With it, the 
Java/Perl/PHP programmers and the DBA don't have to communicate at all -- you 
set it up, give it some RAM, and it "just works".   As someone who frequently 
has to consult based on limited knowledge, I can understand the appeal.

However, one of the problems with SQC, aside from the ones already mentioned 
of stale data and/or cache-clearing, is that (at least in applications like 
MySQL's) it is indiscriminate and caches, at least breifly, unique queries as 
readily as common ones.   Possibly Oracle's implementation is more 
sophisticated; I've not had an opportunity.    

The other half of that problem is that an entire query is cached, rather than 
just the relevant data to uniquely identify the request to the application.    
This is bad in two respects; one that the entire query needs to be parsed to 
see if a new query is materially equivalent, and that two materially 
different queries which could utilize overlapping ranges of the same 
underlying result set must instead cache their results seperately, eating up 
yet more memory.

To explain what I'm talking about, let me give you a counter-example of 
another approach.

I have a data-warehousing application with a web front-end.    The data in the 
application is quite extensive and complex, and only a summary is presented 
to the public users -- but that summary is a query involving about 30 lines 
and 16 joins.  This summary information is available in 3 slightly different 
forms.  Further, the client has indicated that an up to 1/2 hour delay in 
data "freshness" is acceptable.

The first step is forcing that "materialized" view of the data into memory.  
Right now I'm working on a reliable way to do that without using Memcached, 
which won't install on our Solaris servers.  Temporary tables have the 
annoying property of being per-connection, which doesn't work in a pool of 60 
connections.

The second step, which I completed first due to the lack of technical 
obstacles, is to replace all queries against this data with calls to a 
Set-Returning Function (SRF).   This allowed me to re-direct where the data 
was coming from -- presumably the same thing could be done through RULES, but 
it would have been considerably harder to implement.

The first thing the SRF does is check the criteria passed to it against a set 
of cached (in a table) criteria with that user's permission level which is < 
1/2 hour old.   If the same criteria are found, then the SRF is returned a 
set of row identifiers for the materialized view (MV), and looks up the rows 
in the MV and returns those to the web client.   

If no identical set of criteria are found, then the query is run to get a set 
of identifiers which are then cached, and the SRF returns the queried rows.

Once I surmount the problem of storing all the caching information in 
protected memory, the advantages of this approach over SQC are several:

1) The materialized data is available in 3 different forms; a list, a detail 
view, and a spreadsheet.  Each form as somewhat different columns and 
different rules about ordering, which would likely confuse an SQC planner.   
In this implementation, all 3 forms are able to share the same cache.

2) The application is comparing only sets of unambguous criteria rather than 
long queries which would need to be compared in planner form in order to 
determine query equivalence. 

3) With the identifier sets, we are able to cache other information as well, 
such as a count of rows, further limiting the number of queries we must run.

4) This approach is ideally suited to the pagination and re-sorting common to 
a web result set.  As only the identifiers are cached, the results can be 
re-sorted and broken in to pages after the cache read, a fast, all-in-memory 
operation.

In conclusion, what I'm saying is that while forms of transparent query 
caching (plan, materialized or whatever) may be desirable for other reasons, 
it's quite possible to acheive a superior level of "query caching" through 
tight integration with the front-end application.   

If people are interested in this, I'd love to see some suggestions on ways to 
force the materialized view into dedicated memory.

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

               http://archives.postgresql.org

Reply via email to