On 11/06/2011 01:08 PM, Tom Lane wrote:
Peter Geoghegan<pe...@2ndquadrant.com>  writes:
... It also does things
like intelligently distinguishing between queries with different
limit/offset constant values, as these constants are deemed to be
differentiators of queries for our purposes. A guiding principal that
I've followed is that anything that could result in a different plan
is a differentiator of queries.
This claim seems like bunk, unless you're hashing the plan tree,
in which case it's tautological.

Peter's patch adds a planner hook and hashes at that level. Since this cat is rapidly escaping its bag and impacting other contributors, we might as well share the work in progress early if anyone has a burning desire to look at the code. A diff against the version I've been internally reviewing in prep for community submission is at https://github.com/greg2ndQuadrant/postgres/compare/master...pg_stat_statements_norm Easier to browse than ask questions probing about it I think. Apologies to Peter for sharing his work before he was completely ready; there is a list of known problems with it. I also regret the thread hijack here.

The first chunk of code is a Python based regression test program, and an open item here is the best way to turn that into a standard regression test set.

There will be additional infrastructure added to the parser to support
normalisation of query strings for the patch I'll be submitting (that
obviously won't be supported in the version that builds against
existing Postgres versions that I'll make available). Essentially,
I'll be adding a length field to certain nodes,
This seems like a good way to get your patch rejected: adding overhead
to the core system for the benefit of a feature that not everyone cares
about is problematic.

Struggling around this area is the main reason this code hasn't been submitted yet. To step back for a moment, there are basically three options here that any code like this can take, in regards to storing the processed query name used as the key:

1) Use the first instance of that query seen as the "reference" version
2) Use the last instance seen
3) Transform the text of the query in a way that's stable across all duplicates of that statement, and output that

Downstream tools operating on this data, things that will sample pg_stat_statements multiple times, need some sort of stable query text they can operate on. It really needs to survive server restart too. Neither (1) nor (2) seem like usable solutions. Both have been implemented already in Peter's patch, with (2) being what's in the current patch. How best to do (3) instead is not obvious though.

Doing the query matching operating at the planner level seems effective at side-stepping the problem of needing to parse the SQL, which is where most implementations of this idea get stuck doing fragile transformations. My own first try at this back in September and Tomas's patch both fall into that category. That part of Peter's work seems to work as expected. That still leaves the problem of "parsed query -> stable normalized string". I think that is an easier one to solve than directly taking on the entirety of "query text -> stable normalized string" though. Peter has an idea he's exploring for how to implement that, and any amount of overhead it adds to people who don't use this feature is an obvious concern. There are certainly use cases that don't care about this problem, ones that would happily take (1) or (2). I'd rather not ship yet another not quite right for everyone version of pg_stat_statements again though; only solving too limited of a use case is the big problem with the one that's already there.

--
Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to