Re: [HACKERS] Hash id in pg_stat_statements

2012-11-15 Thread Magnus Hagander
On Tue, Oct 2, 2012 at 8:22 PM, Peter Geoghegan pe...@2ndquadrant.com wrote:
 On 2 October 2012 18:16, Tom Lane t...@sss.pgh.pa.us wrote
 1. Why isn't something like md5() on the reported query text an equally
 good solution for users who want a query hash?

 Because that does not uniquely identify the entry. The very first
 thing that the docs say on search_path is Qualified names are tedious
 to write, and it's often best not to wire a particular schema name
 into applications anyway. Presumably, the reason it's best not to
 wire schema names into apps is because it might be useful to modify
 search_path in a way that dynamically made the same queries in some
 application reference what are technically distinct relations. If
 anyone does this, and it seems likely that many do for various
 reasons, they will be out of luck when using some kind of
 pg_stat_statements aggregation.

 This was the behaviour that I intended for pg_stat_statements all
 along, and I think it's better than a solution that matches query
 strings.

 2. If people are going to accumulate stats on queries over a long period
 of time, is a 32-bit hash really good enough for the purpose?  If I'm
 doing the math right, the chance of collision is already greater than 1%
 at 1 queries, and rises to about 70% for 10 queries; see
 http://en.wikipedia.org/wiki/Birthday_paradox
 We discussed this issue and decided it was okay for pg_stat_statements's
 internal hash table, but it's not at all clear to me that it's sensible
 to use 32-bit hashes for external accumulation of query stats.

 Well, forgive me for pointing this out, but I did propose that the
 hash be a 64-bit value (which would have necessitated adopting
 hash_any() to produce 64-bit values), but you rejected the proposal. I
 arrived at the same probability for a collision as you did and posted
 in to the list, in discussion shortly after the normalisation stuff
 was committed.

What was the argument for rejecting it? Just that it required
hash_any() to be adapted?

Now that we have a very clear use case where this would help, perhaps
it's time to re-visit this proposal?

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash id in pg_stat_statements

2012-11-15 Thread Peter Geoghegan
On 15 November 2012 13:10, Magnus Hagander mag...@hagander.net wrote:
 Well, forgive me for pointing this out, but I did propose that the
 hash be a 64-bit value (which would have necessitated adopting
 hash_any() to produce 64-bit values), but you rejected the proposal. I
 arrived at the same probability for a collision as you did and posted
 in to the list, in discussion shortly after the normalisation stuff
 was committed.

 What was the argument for rejecting it? Just that it required
 hash_any() to be adapted?

I think so, yes. It just wasn't deemed necessary. Note that hash_any()
has a comment that says something like if you want to adapt this so
that the Datum returned is a 64-bit integer, the way to do that
is I followed this advice in a revision of the pg_stat_statements
normalisation patch, and preserved compatibility by using a macro,
which Tom objected to.

 Now that we have a very clear use case where this would help, perhaps
 it's time to re-visit this proposal?

Perhaps. I think that the issue of whether or not a 64-bit integer is
necessary is totally orthogonal to the question of whether or not we
should expose the hash. The need to aggregate historical statistics
just doesn't appreciably alter things here, I feel. The number of
discrete queries that an application will execute in a week just isn't
that different from the number that it will ever execute, I suspect.

-- 
Peter Geoghegan   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash id in pg_stat_statements

2012-11-15 Thread Dimitri Fontaine
Peter Geoghegan pe...@2ndquadrant.com writes:
 should expose the hash. The need to aggregate historical statistics
 just doesn't appreciably alter things here, I feel. The number of
 discrete queries that an application will execute in a week just isn't
 that different from the number that it will ever execute, I suspect.

Please don't forget that some people won't write the most effective SQL
from the get go and will rewrite problematic queries. Some people will
even rollout new features in their applications, with new queries to
implement them. Or new applications on top of the existing database.

So I don't think that the query list is that static. That said, I don't
have any idea at all about the impact of what I'm saying to your
analysis…

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash id in pg_stat_statements

2012-10-15 Thread Peter Geoghegan
On 3 October 2012 19:04, Tom Lane t...@sss.pgh.pa.us wrote:
 Daniel Farina dan...@heroku.com writes:
 Instead, I think it makes sense to assign a number -- arbitrarily, but
 uniquely -- to the generation of a new row in pg_stat_statements, and,
 on the flip side, whenever a row is retired its number should be
 eliminated, practically, for-ever.  This way re-introductions between
 two samplings of pg_stat_statements cannot be confused for a
 contiguously maintained statistic on a query.

 This argument seems sensible to me.

Daniel: Could you please submit the patch that you were working on
that does this to the next commitfest?

-- 
Peter Geoghegan   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash id in pg_stat_statements

2012-10-15 Thread Daniel Farina
On Mon, Oct 15, 2012 at 7:35 AM, Peter Geoghegan pe...@2ndquadrant.com wrote:
 On 3 October 2012 19:04, Tom Lane t...@sss.pgh.pa.us wrote:
 Daniel Farina dan...@heroku.com writes:
 Instead, I think it makes sense to assign a number -- arbitrarily, but
 uniquely -- to the generation of a new row in pg_stat_statements, and,
 on the flip side, whenever a row is retired its number should be
 eliminated, practically, for-ever.  This way re-introductions between
 two samplings of pg_stat_statements cannot be confused for a
 contiguously maintained statistic on a query.

 This argument seems sensible to me.

 Daniel: Could you please submit the patch that you were working on
 that does this to the next commitfest?

Yes. Sorry I haven't done that already.  I'll clean it up and send it
out Real Soon Now, thanks for the expression of interest.

-- 
fdr


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash id in pg_stat_statements

2012-10-10 Thread Peter Geoghegan
On 3 October 2012 19:54, Peter Geoghegan pe...@2ndquadrant.com wrote:
 On 3 October 2012 19:04, Tom Lane t...@sss.pgh.pa.us wrote:
 This argument seems sensible to me.  Is there any use-case where the
 proposed counter wouldn't do what people wished to do with an exposed
 hash value?

 Yes. The hash could be used to aggregate query execution costs across
 entire WAL-based replication clusters. I'm not opposed to Daniel's
 suggestion, though.

Could we please try and reach a consensus here? If you're still dead
set against exposing the hash value, I think that just following what
Daniel has suggested is a fair compromise.

-- 
Peter Geoghegan   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash id in pg_stat_statements

2012-10-03 Thread Tom Lane
Daniel Farina dan...@heroku.com writes:
 Instead, I think it makes sense to assign a number -- arbitrarily, but
 uniquely -- to the generation of a new row in pg_stat_statements, and,
 on the flip side, whenever a row is retired its number should be
 eliminated, practically, for-ever.  This way re-introductions between
 two samplings of pg_stat_statements cannot be confused for a
 contiguously maintained statistic on a query.

This argument seems sensible to me.  Is there any use-case where the
proposed counter wouldn't do what people wished to do with an exposed
hash value?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash id in pg_stat_statements

2012-10-03 Thread Peter Geoghegan
On 3 October 2012 19:04, Tom Lane t...@sss.pgh.pa.us wrote:
 Daniel Farina dan...@heroku.com writes:
 Instead, I think it makes sense to assign a number -- arbitrarily, but
 uniquely -- to the generation of a new row in pg_stat_statements, and,
 on the flip side, whenever a row is retired its number should be
 eliminated, practically, for-ever.  This way re-introductions between
 two samplings of pg_stat_statements cannot be confused for a
 contiguously maintained statistic on a query.

 This argument seems sensible to me.  Is there any use-case where the
 proposed counter wouldn't do what people wished to do with an exposed
 hash value?

Yes. The hash could be used to aggregate query execution costs across
entire WAL-based replication clusters. I'm not opposed to Daniel's
suggestion, though.

-- 
Peter Geoghegan   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash id in pg_stat_statements

2012-10-03 Thread Daniel Farina
On Wed, Oct 3, 2012 at 11:04 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Daniel Farina dan...@heroku.com writes:
 Instead, I think it makes sense to assign a number -- arbitrarily, but
 uniquely -- to the generation of a new row in pg_stat_statements, and,
 on the flip side, whenever a row is retired its number should be
 eliminated, practically, for-ever.  This way re-introductions between
 two samplings of pg_stat_statements cannot be confused for a
 contiguously maintained statistic on a query.

 This argument seems sensible to me.  Is there any use-case where the
 proposed counter wouldn't do what people wished to do with an exposed
 hash value?

Yes: unless schemas are included into the canonicalized query text,
two identical texts can actually mean different things.  This is the
only corner case (besides collision) I can think of.

I also referenced an idea in a similar vein to the counter/arbitrary
number: instead, one can perform a kind of error propagation from
evicted entries in the hash table.  This might avoid having to even
bother saving a counter, which feels rather nice to me.  I have a
sketch (I now know of two stupid bugs in this) in implementation and
it comes out very neatly so far.

I got this idea from a paper that I saw implemented once, with
strikingly good effect:
http://www.cs.ucsb.edu/research/tech_reports/reports/2005-23.pdf

Here they have a very specific rendition that is, for a couple of
reasons, not quite what we want, I think.  But the part I found very
interesting was the simple propagation of error that made the
technique possible.  Inspired by this, I gave every hash entry a
maximum-under-estimation error.  When members of the hash table are
evicted, the minimum of the metric gathered plus its already
accumulated under-estimation error are propagated to the new entry.

The general property is that hash table members who are frequently
evicted will accumulate huge amounts of error.  Those that are never
evicted (thanks to constant use) never accumulate any error.

A tool reading the table can take into account the error accumulation
to determine if there was an eviction between two samplings, as well
as bounding the error accrued between two snapshots.  I think there is
a tiny bit of room for some sort of ambiguity in a corner case, but
I'd have to think more about it.

-- 
fdr


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash id in pg_stat_statements

2012-10-02 Thread Peter Geoghegan
On 1 October 2012 18:05, Stephen Frost sfr...@snowman.net wrote:
 * Peter Geoghegan (pe...@2ndquadrant.com) wrote:
 That won't really help matters. There'd still be duplicate entries,
 from before and after the change, even if we make it immediately
 obvious which is which. The only reasonable solution in that scenario
 is to bump PGSS_FILE_HEADER, which will cause all existing entries to
 be invalidated.

 You're going to have to help me here, 'cause I don't see how there can
 be duplicates if we include the PGSS_FILE_HEADER as part of the hash,
 unless we're planning to keep PGSS_FILE_HEADER constant while we change
 what the hash value is for a given query, yet that goes against the
 assumptions that were laid out, aiui.

Well, they wouldn't be duplicates if you think that the fact that one
query was executed before some point release and another after ought
to differentiate queries. I do not.

 If there's a change that results in a given query X no longer hashing to
 a value A, we need to change PGSS_FILE_HEADER to invalidate statistics
 which were collected for value A (or else we risk an independent query Y
 hashing to value A and ending up with completely invalid stats..).
 Provided we apply that consistently and don't reuse PGSS_FILE_HEADER
 values along the way, a combination of PGSS_FILE_HEADER and the hash
 value for a given query should be unique over time.

 We do need to document that the hash value for a given query may
 change..

By invalidate, I mean that when we go to open the saved file, if the
header doesn't match, the file is considered corrupt, and we simply
log that the file could not be read, before unlinking it. This would
be necessary in the unlikely event of there being some substantive
change in the representation of query trees in a point release. I am
not aware of any precedent for this, though Tom said that there was
one.

Tom: Could you please describe the precedent you had in mind? I would
like to better understand this risk.

I don't want to get too hung up on what we'd do if this problem
actually occurred, because that isn't what this thread is about.
Exposing the hash is a completely unrelated question, except that to
do so might make pg_stat_statements (including its limitations) better
understood. In my opinion, the various log parsing tools have taught
people to think about this in the wrong way - the query string is just
a representation of a query (and an imperfect one at that).

There are other, similar tools that exist in proprietary databases.
They expose a hash value, which is subject to exactly the same caveats
as our own. They explicitly encourage the type of aggregation by
third-party tools that I anticipate will happen with
pg_stat_statements. I want to facilitate this, but I'm confident that
the use of (dbid, userid, querytext) as a candidate key by these
tools is going to cause confusion, in subtle ways. By using the hash
value, those tools are exactly as well off as pg_stat_statements is,
which is to say, as well off as possible.

I simply do not understand objections to the proposal. Have I missed something?

-- 
Peter Geoghegan   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash id in pg_stat_statements

2012-10-02 Thread Euler Taveira
On 02-10-2012 10:15, Peter Geoghegan wrote:
 There are other, similar tools that exist in proprietary databases.
 They expose a hash value, which is subject to exactly the same caveats
 as our own. They explicitly encourage the type of aggregation by
 third-party tools that I anticipate will happen with
 pg_stat_statements. I want to facilitate this, but I'm confident that
 the use of (dbid, userid, querytext) as a candidate key by these
 tools is going to cause confusion, in subtle ways. By using the hash
 value, those tools are exactly as well off as pg_stat_statements is,
 which is to say, as well off as possible.
 
It depends on how you will use this information. If it is for a rapid
analysis, it is fine to use a hash because the object internals won't change
(I hope not). However, if you want to analyze query statistics for a long
period of time (say a few months) or your environment is distributed, you
can't use the hash. There isn't a perfect solution but I see both cases being
useful for analysis tools.


-- 
   Euler Taveira de Oliveira - Timbira   http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash id in pg_stat_statements

2012-10-02 Thread Magnus Hagander
P
On Oct 2, 2012 5:04 PM, Euler Taveira eu...@timbira.com wrote:

 On 02-10-2012 10:15, Peter Geoghegan wrote:
  There are other, similar tools that exist in proprietary databases.
  They expose a hash value, which is subject to exactly the same caveats
  as our own. They explicitly encourage the type of aggregation by
  third-party tools that I anticipate will happen with
  pg_stat_statements. I want to facilitate this, but I'm confident that
  the use of (dbid, userid, querytext) as a candidate key by these
  tools is going to cause confusion, in subtle ways. By using the hash
  value, those tools are exactly as well off as pg_stat_statements is,
  which is to say, as well off as possible.
 
 It depends on how you will use this information. If it is for a rapid
 analysis, it is fine to use a hash because the object internals won't
change
 (I hope not). However, if you want to analyze query statistics for a long
 period of time (say a few months) or your environment is distributed, you
 can't use the hash. There isn't a perfect solution but I see both cases
being
 useful for analysis tools.

Having the hash available in no way prevents you from using the query
string ad your key. Not having the hash certainly prevents you from using
it.

/Magnus


Re: [HACKERS] Hash id in pg_stat_statements

2012-10-02 Thread Stephen Frost
* Peter Geoghegan (pe...@2ndquadrant.com) wrote:
 On 1 October 2012 18:05, Stephen Frost sfr...@snowman.net wrote:
  You're going to have to help me here, 'cause I don't see how there can
  be duplicates if we include the PGSS_FILE_HEADER as part of the hash,
  unless we're planning to keep PGSS_FILE_HEADER constant while we change
  what the hash value is for a given query, yet that goes against the
  assumptions that were laid out, aiui.
 
 Well, they wouldn't be duplicates if you think that the fact that one
 query was executed before some point release and another after ought
 to differentiate queries. I do not.

This would only be if we happened to change what hash was generated for
a given query during such a point release, where I share your feeling
that it aught to be quite rare.  I'm not suggestion we do this for every
point release...

 By invalidate, I mean that when we go to open the saved file, if the
 header doesn't match, the file is considered corrupt, and we simply
 log that the file could not be read, before unlinking it. This would
 be necessary in the unlikely event of there being some substantive
 change in the representation of query trees in a point release. I am
 not aware of any precedent for this, though Tom said that there was
 one.

Right, and that's all I'm trying to address here- how do we provide a
value for a given query which can be relied upon by outside sources,
even in the face of a point release which changes what our internal hash
value for a given query is.

 I don't want to get too hung up on what we'd do if this problem
 actually occurred, because that isn't what this thread is about.

[...]

 I simply do not understand objections to the proposal. Have I missed 
 something?

It was my impression that the concern is the stability of the hash value
and ensuring that tools which operate on it don't mistakenly lump two
different queries into one because they had the same hash value (caused
by a change in our hashing algorithm or input into it over time, eg a
point release).  I was hoping to address that to allow this proposal to
move forward..

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Hash id in pg_stat_statements

2012-10-02 Thread Tom Lane
Stephen Frost sfr...@snowman.net writes:
 * Peter Geoghegan (pe...@2ndquadrant.com) wrote:
 I simply do not understand objections to the proposal. Have I missed 
 something?

 It was my impression that the concern is the stability of the hash value
 and ensuring that tools which operate on it don't mistakenly lump two
 different queries into one because they had the same hash value (caused
 by a change in our hashing algorithm or input into it over time, eg a
 point release).  I was hoping to address that to allow this proposal to
 move forward..

I think there are at least two questions that ought to be answered:

1. Why isn't something like md5() on the reported query text an equally
good solution for users who want a query hash?

2. If people are going to accumulate stats on queries over a long period
of time, is a 32-bit hash really good enough for the purpose?  If I'm
doing the math right, the chance of collision is already greater than 1%
at 1 queries, and rises to about 70% for 10 queries; see
http://en.wikipedia.org/wiki/Birthday_paradox
We discussed this issue and decided it was okay for pg_stat_statements's
internal hash table, but it's not at all clear to me that it's sensible
to use 32-bit hashes for external accumulation of query stats.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash id in pg_stat_statements

2012-10-02 Thread Peter Geoghegan
On 2 October 2012 18:16, Tom Lane t...@sss.pgh.pa.us wrote
 1. Why isn't something like md5() on the reported query text an equally
 good solution for users who want a query hash?

Because that does not uniquely identify the entry. The very first
thing that the docs say on search_path is Qualified names are tedious
to write, and it's often best not to wire a particular schema name
into applications anyway. Presumably, the reason it's best not to
wire schema names into apps is because it might be useful to modify
search_path in a way that dynamically made the same queries in some
application reference what are technically distinct relations. If
anyone does this, and it seems likely that many do for various
reasons, they will be out of luck when using some kind of
pg_stat_statements aggregation.

This was the behaviour that I intended for pg_stat_statements all
along, and I think it's better than a solution that matches query
strings.

 2. If people are going to accumulate stats on queries over a long period
 of time, is a 32-bit hash really good enough for the purpose?  If I'm
 doing the math right, the chance of collision is already greater than 1%
 at 1 queries, and rises to about 70% for 10 queries; see
 http://en.wikipedia.org/wiki/Birthday_paradox
 We discussed this issue and decided it was okay for pg_stat_statements's
 internal hash table, but it's not at all clear to me that it's sensible
 to use 32-bit hashes for external accumulation of query stats.

Well, forgive me for pointing this out, but I did propose that the
hash be a 64-bit value (which would have necessitated adopting
hash_any() to produce 64-bit values), but you rejected the proposal. I
arrived at the same probability for a collision as you did and posted
in to the list, in discussion shortly after the normalisation stuff
was committed.

A more sensible way of assessing the risk of a collision would be to
try and come up with the probability of a collision that someone
actually ends up caring about, which is considerably less than the 1%
for 10,000 entries. I'm not being glib - people are very used to the
idea that aggregating information on query costs is a lossy process.
Prior to 9.2, the only way execution costs could reasonably be
measured on at the query granularity on a busy system was to set
log_min_duration_statement to something like 1 second.

I am also unconvinced by the idea that aggregating historical data
(with the same hash value) in a separate application is likely to make
the collision situation appreciably worse. People are going to be
using something like an RRD circular buffer to aggregate the
information, and I can't see anyone caring about detailed information
that is more than a couple of weeks in the past. The point of
aggregation isn't to store more queries, it's to construct time-series
data from snapshots. Besides, do most applications really even have
more than 10,000 distinct queries?

-- 
Peter Geoghegan   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash id in pg_stat_statements

2012-10-02 Thread Peter Geoghegan
On 2 October 2012 17:58, Stephen Frost sfr...@snowman.net wrote:
 Right, and that's all I'm trying to address here- how do we provide a
 value for a given query which can be relied upon by outside sources,
 even in the face of a point release which changes what our internal hash
 value for a given query is.

I don't know of a way. Presumably, we'd hope to avoid this, and would
look for alternatives to anything that would necessitate bumping
PGSS_FILE_HEADER, while not going so far as to let pg_stat_statements
contort things in the core system. If I was aware of a case where this
would have come up had pg_stat_statements fingerprinting been around
at the time, perhaps I could give a better answer than that.

-- 
Peter Geoghegan   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash id in pg_stat_statements

2012-10-02 Thread Martijn van Oosterhout
On Tue, Oct 02, 2012 at 12:58:15PM -0400, Stephen Frost wrote:
  I simply do not understand objections to the proposal. Have I missed 
  something?
 
 It was my impression that the concern is the stability of the hash value
 and ensuring that tools which operate on it don't mistakenly lump two
 different queries into one because they had the same hash value (caused
 by a change in our hashing algorithm or input into it over time, eg a
 point release).  I was hoping to address that to allow this proposal to
 move forward..

That makes no sense though. The moment you talk about hash you
consider the possibility of lumping together things that aren't the
same.  Any tools using these hashes must have realised this.

Fortunatly, the statistics are better than the birthday paradox. The
chances that the two most important queries in your system end up
having the same hash is miniscule.

Like mentioned elsewhere, a system with more than 10,000 different
queries sounds rare to me (once you exclude query parameters ofcourse).

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature


Re: [HACKERS] Hash id in pg_stat_statements

2012-10-02 Thread Daniel Farina
On Mon, Oct 1, 2012 at 12:57 AM, Magnus Hagander mag...@hagander.net wrote:
 Can we please expose the internal hash id of the statements in
 pg_stat_statements?

 I know there was discussions about it earlier, and it wasn't done with
 an argument of it not being stable between releases (IIRC). I think we
 can live with that drawback, assuming of course that we document this
 properly.

 I've now run into multiple customer installations where it would be
 very useful to have. The usecase is mainly storing snapshots of the
 pg_stat_statements output over time and analyzing those. Weird things
 happen for example when the query text is the same, but the hash is
 different (which can happen for example when a table is dropped and
 recreated). And even without that, in order to do anything useful with
 it, you end up hashing the query text anyway - so using the already
 existing hash would be easier and more useful.

I have a similar problem, however, I am not sure if the hash generated
is ideal.  Putting aside the number of mechanical, versioning,
shut-down/stats files issues, etc reasons given in the main branch of
the thread, I also have this feeling that it is not what I want.
Consider the following case:

SELECT * FROM users WHERE id = ?

this query isn't seen for a while

SELECT * FROM users WHERE id = ?

In the intervening time, an equivalent hash could still be evicted and
reintroduced and the statistics silently reset, and that'll befuddle
principled tools.  This is worse than merely less-useful, because it
can lead to drastic underestimations that otherwise pass inspection.

Instead, I think it makes sense to assign a number -- arbitrarily, but
uniquely -- to the generation of a new row in pg_stat_statements, and,
on the flip side, whenever a row is retired its number should be
eliminated, practically, for-ever.  This way re-introductions between
two samplings of pg_stat_statements cannot be confused for a
contiguously maintained statistic on a query.

-- 
fdr


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Hash id in pg_stat_statements

2012-10-01 Thread Magnus Hagander
Can we please expose the internal hash id of the statements in
pg_stat_statements?

I know there was discussions about it earlier, and it wasn't done with
an argument of it not being stable between releases (IIRC). I think we
can live with that drawback, assuming of course that we document this
properly.

I've now run into multiple customer installations where it would be
very useful to have. The usecase is mainly storing snapshots of the
pg_stat_statements output over time and analyzing those. Weird things
happen for example when the query text is the same, but the hash is
different (which can happen for example when a table is dropped and
recreated). And even without that, in order to do anything useful with
it, you end up hashing the query text anyway - so using the already
existing hash would be easier and more useful.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash id in pg_stat_statements

2012-10-01 Thread Peter Geoghegan
On 1 October 2012 08:57, Magnus Hagander mag...@hagander.net wrote:
 I know there was discussions about it earlier, and it wasn't done with
 an argument of it not being stable between releases (IIRC). I think we
 can live with that drawback, assuming of course that we document this
 properly.

Well, I'll point out once again that the argument about its stability
is invalid, because we serialise the entries to disk. If a point
release changes the representation of the query tree such that the
hash values won't match, then we have no recourse but to bump
pg_stat_statements version number, and invalidate all existing
entries. I credit our users with the intelligence to not jump to any
rash conclusions about the hash if directly exposed, such as assuming
that it has any particular degree of stability with respect to the
queries that are fingerprinted, or any degree greater than the
self-evident bare minimum.

I'm pretty sure that the stability among point releases in the face
of potential minor changes to query tree representation thing was
something that I imagined as a reason for the proposal being rejected,
when I tried to read between the lines of a flat rejection. Perhaps I
should have asked for clarification on that point. Now that I think
about it, I'm pretty sure that the need to bump catversion, as a
result of any change in the way dumping the query tree struct into
stored rules needs to happen, will preclude that problem in practice.

 I've now run into multiple customer installations where it would be
 very useful to have. The usecase is mainly storing snapshots of the
 pg_stat_statements output over time and analyzing those. Weird things
 happen for example when the query text is the same, but the hash is
 different (which can happen for example when a table is dropped and
 recreated). And even without that, in order to do anything useful with
 it, you end up hashing the query text anyway - so using the already
 existing hash would be easier and more useful.

Yes, these are all arguments that I'm familiar with :-)  . I've always
thought of pg_stat_statements as a low-level statistical view that
people would naturally want to store snapshots of for analysis, in
much the same way as many do now with things like pg_stat_bgwriter
using tools like Munin. Who wouldn't want to know what queries were
running a half an hour ago when the database server seemed slower than
usual, for example? Such tools should naturally have access to the
same candidate key for entries, rather than adding a subtle
impedance mismatch by using the string. That reminds me - when are you
writing the pg_stat_statements Postgres plugin for Munin?

I was disappointed that my proposal was shot down, despite the fact
that I independently raised it on list at least twice, and pushed as
hard as I felt that I could at the time.

--
Peter Geoghegan   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash id in pg_stat_statements

2012-10-01 Thread Tom Lane
Magnus Hagander mag...@hagander.net writes:
 Can we please expose the internal hash id of the statements in
 pg_stat_statements?

 I know there was discussions about it earlier, and it wasn't done with
 an argument of it not being stable between releases (IIRC).

Worse than that: it could change across a minor version update.  These
are internal data structures we're hashing, and we've been known to
have to change them for bug-fix purposes.

 I've now run into multiple customer installations where it would be
 very useful to have. The usecase is mainly storing snapshots of the
 pg_stat_statements output over time and analyzing those.

Doesn't that immediately refute your claim that unstable hash values
would be okay?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash id in pg_stat_statements

2012-10-01 Thread Magnus Hagander
On Mon, Oct 1, 2012 at 4:10 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Magnus Hagander mag...@hagander.net writes:
 Can we please expose the internal hash id of the statements in
 pg_stat_statements?

 I know there was discussions about it earlier, and it wasn't done with
 an argument of it not being stable between releases (IIRC).

 Worse than that: it could change across a minor version update.  These
 are internal data structures we're hashing, and we've been known to
 have to change them for bug-fix purposes.

As Peter pointed out, when we do that we have to change the file
format anyway, and wipe the statistics. So chaning that is something
we'd have to *note*.


 I've now run into multiple customer installations where it would be
 very useful to have. The usecase is mainly storing snapshots of the
 pg_stat_statements output over time and analyzing those.

 Doesn't that immediately refute your claim that unstable hash values
 would be okay?

No. It means they need to know when it changes, if it changes in a
minor release. Which would obviously have to go in the release notes,
since it would also invalidate any stored statistics in the *general*
view.

As long as we *tell* them under what conditions it might change, I
think it's perfectly fine. Particularly those who are likely to use
this functionality should certainly be capable of understanding that.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash id in pg_stat_statements

2012-10-01 Thread Euler Taveira
On 01-10-2012 11:22, Magnus Hagander wrote:
 As long as we *tell* them under what conditions it might change, I
 think it's perfectly fine. Particularly those who are likely to use
 this functionality should certainly be capable of understanding that.
 
Even if we do that it is too much work for those statistics tools, isn't it?
Instead of relying on internal structures hash, why don't you expose the query
text hash to such tools? If you solve the space normalizations, it is an
almost perfect solution for your use case.


-- 
   Euler Taveira de Oliveira - Timbira   http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash id in pg_stat_statements

2012-10-01 Thread Peter Geoghegan
On 1 October 2012 15:22, Magnus Hagander mag...@hagander.net wrote:
 On Mon, Oct 1, 2012 at 4:10 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Worse than that: it could change across a minor version update.  These
 are internal data structures we're hashing, and we've been known to
 have to change them for bug-fix purposes.

 As Peter pointed out, when we do that we have to change the file
 format anyway, and wipe the statistics. So chaning that is something
 we'd have to *note*.

I think the need to not break stored rules pins down the actual
representation of the Query struct in release branches. I'm not
prepared to say that it does so absolutely, since of course certain
logic could be altered in a way that resulted in different values
within the struct or its children, but I'm curious as to when this
actually occurred in the past. Discussion about exposing this value
aside, it'd obviously be a bit disappointing if we had to invalidate
existing statistics in a point release. Still, that situation isn't
made any worse by exposing the value, and in fact doing so could aid
in helping users to understand the issues involved.

-- 
Peter Geoghegan   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash id in pg_stat_statements

2012-10-01 Thread Stephen Frost
Peter, all,

* Peter Geoghegan (pe...@2ndquadrant.com) wrote:
 Well, I'll point out once again that the argument about its stability
 is invalid, because we serialise the entries to disk. If a point
 release changes the representation of the query tree such that the
 hash values won't match, then we have no recourse but to bump
 pg_stat_statements version number, and invalidate all existing
 entries.

What if we simply included the pg_stat_statements version number in
what's shown to the user as the 'hash'?  ver#.hash ?

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Hash id in pg_stat_statements

2012-10-01 Thread Peter Geoghegan
On 1 October 2012 17:12, Stephen Frost sfr...@snowman.net wrote:
 Peter, all,

 * Peter Geoghegan (pe...@2ndquadrant.com) wrote:
 Well, I'll point out once again that the argument about its stability
 is invalid, because we serialise the entries to disk. If a point
 release changes the representation of the query tree such that the
 hash values won't match, then we have no recourse but to bump
 pg_stat_statements version number, and invalidate all existing
 entries.

 What if we simply included the pg_stat_statements version number in
 what's shown to the user as the 'hash'?  ver#.hash ?

That won't really help matters. There'd still be duplicate entries,
from before and after the change, even if we make it immediately
obvious which is which. The only reasonable solution in that scenario
is to bump PGSS_FILE_HEADER, which will cause all existing entries to
be invalidated.

This is a hypothetical scenario, and concerns about it are totally
orthogonal to the actual question of whether or not the queryid should
be exposed...

-- 
Peter Geoghegan   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash id in pg_stat_statements

2012-10-01 Thread Stephen Frost
Peter,

* Peter Geoghegan (pe...@2ndquadrant.com) wrote:
 That won't really help matters. There'd still be duplicate entries,
 from before and after the change, even if we make it immediately
 obvious which is which. The only reasonable solution in that scenario
 is to bump PGSS_FILE_HEADER, which will cause all existing entries to
 be invalidated.

You're going to have to help me here, 'cause I don't see how there can
be duplicates if we include the PGSS_FILE_HEADER as part of the hash,
unless we're planning to keep PGSS_FILE_HEADER constant while we change
what the hash value is for a given query, yet that goes against the
assumptions that were laid out, aiui.

If there's a change that results in a given query X no longer hashing to
a value A, we need to change PGSS_FILE_HEADER to invalidate statistics
which were collected for value A (or else we risk an independent query Y
hashing to value A and ending up with completely invalid stats..).
Provided we apply that consistently and don't reuse PGSS_FILE_HEADER
values along the way, a combination of PGSS_FILE_HEADER and the hash
value for a given query should be unique over time.

We do need to document that the hash value for a given query may
change..

Thanks,

Stephen


signature.asc
Description: Digital signature