Re: [HACKERS] Revisiting pg_stat_statements and IN() (Was: Re: pg_stat_statements fingerprinting logic and ArrayExpr)

2015-12-30 Thread Bruce Momjian
On Mon, Nov 30, 2015 at 02:41:02PM +0100, Shulgin, Oleksandr wrote:
> On Wed, Nov 25, 2015 at 9:13 AM, Lukas Fittl  wrote:
> 
> On Mon, Nov 23, 2015 at 11:53 PM, Peter Geoghegan  wrote:
> 
> One specific justification he gave for not using pg_stat_statements
> was:
> 
> "Doesn’t merge bind vars in IN()" (See slide #11)
> 
> 
> I wonder:
> 
> * How do other people feel about this? Personally, I've seen enough
> problems of this kind in the field that "slippery slope" arguments
> against this don't seem very compelling.
> 
> 
> As someone who runs a little monitoring service thats solely based on
> pg_stat_statements data, ignoring IN list length would certainly be a good
> change.
> 
> We currently do this in post-processing, together with other data cleanup
> (e.g. ignoring the length of a VALUES list in an INSERT statement).
> 
> Given the fact that pgss data is normalized & you don't know which plan 
> was
> chosen, I don't think distinguishing based on the length of the list helps
> anyone really.
> 
> I see pg_stat_statements as a high-level overview of which queries have
> run, and which ones you might want to look into closer using e.g.
> auto_explain.
> 
> 
> I still have the plans to try to marry pg_stat_statements and auto_explain for
> the next iteration of "online query plans" extension I was proposing a few
> months ago, and the first thing I was going to look into is rectifying this
> problem with IN() jumbling.  So, have a +1 from me.

Is this a TODO?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Revisiting pg_stat_statements and IN() (Was: Re: pg_stat_statements fingerprinting logic and ArrayExpr)

2015-12-30 Thread Peter Geoghegan
On Wed, Dec 30, 2015 at 5:20 PM, Bruce Momjian  wrote:
> Is this a TODO?

It's on my (very long) personal TODO list. It would be great if
someone else took it, though. So, yes.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Revisiting pg_stat_statements and IN() (Was: Re: pg_stat_statements fingerprinting logic and ArrayExpr)

2015-12-30 Thread Bruce Momjian
On Wed, Dec 30, 2015 at 05:21:05PM -0800, Peter Geoghegan wrote:
> On Wed, Dec 30, 2015 at 5:20 PM, Bruce Momjian  wrote:
> > Is this a TODO?
> 
> It's on my (very long) personal TODO list. It would be great if
> someone else took it, though. So, yes.

TODO added.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Revisiting pg_stat_statements and IN() (Was: Re: pg_stat_statements fingerprinting logic and ArrayExpr)

2015-11-30 Thread Shulgin, Oleksandr
On Wed, Nov 25, 2015 at 9:13 AM, Lukas Fittl  wrote:

> On Mon, Nov 23, 2015 at 11:53 PM, Peter Geoghegan  wrote:
>
>> One specific justification he gave for not using pg_stat_statements was:
>>
>> "Doesn’t merge bind vars in IN()" (See slide #11)
>>
>> I wonder:
>>
>> * How do other people feel about this? Personally, I've seen enough
>> problems of this kind in the field that "slippery slope" arguments
>> against this don't seem very compelling.
>>
>
> As someone who runs a little monitoring service thats solely based on
> pg_stat_statements data, ignoring IN list length would certainly be a good
> change.
>
> We currently do this in post-processing, together with other data cleanup
> (e.g. ignoring the length of a VALUES list in an INSERT statement).
>
> Given the fact that pgss data is normalized & you don't know which plan
> was chosen, I don't think distinguishing based on the length of the list
> helps anyone really.
>
> I see pg_stat_statements as a high-level overview of which queries have
> run, and which ones you might want to look into closer using e.g.
> auto_explain.
>

I still have the plans to try to marry pg_stat_statements and auto_explain
for the next iteration of "online query plans" extension I was proposing a
few months ago, and the first thing I was going to look into is rectifying
this problem with IN() jumbling.  So, have a +1 from me.

--
Alex


Re: [HACKERS] Revisiting pg_stat_statements and IN() (Was: Re: pg_stat_statements fingerprinting logic and ArrayExpr)

2015-11-25 Thread Lukas Fittl
On Mon, Nov 23, 2015 at 11:53 PM, Peter Geoghegan  wrote:

> One specific justification he gave for not using pg_stat_statements was:
>
> "Doesn’t merge bind vars in IN()" (See slide #11)
>
> I wonder:
>
> * How do other people feel about this? Personally, I've seen enough
> problems of this kind in the field that "slippery slope" arguments
> against this don't seem very compelling.
>

As someone who runs a little monitoring service thats solely based on
pg_stat_statements data, ignoring IN list length would certainly be a good
change.

We currently do this in post-processing, together with other data cleanup
(e.g. ignoring the length of a VALUES list in an INSERT statement).

Given the fact that pgss data is normalized & you don't know which plan was
chosen, I don't think distinguishing based on the length of the list helps
anyone really.

I see pg_stat_statements as a high-level overview of which queries have
run, and which ones you might want to look into closer using e.g.
auto_explain.

Best,
Lukas

-- 
Lukas Fittl

Skype: lfittl
Phone: +1 415 321 0630


Re: [HACKERS] Revisiting pg_stat_statements and IN() (Was: Re: pg_stat_statements fingerprinting logic and ArrayExpr)

2015-11-24 Thread Peter Geoghegan
On Tue, Nov 24, 2015 at 10:55 PM, Jim Nasby  wrote:
> I'm not following your point. Obviously you can't compare int to text that
> doesn't convert back to an int, but that's not what I was talking about.

I didn't see what else you could have meant. In any case, the type
text has no involvement in your example.

pg_stat_statements sees the following SQL queries as equivalent:

select integer '5';

select 6;

select 7::int4;

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Revisiting pg_stat_statements and IN() (Was: Re: pg_stat_statements fingerprinting logic and ArrayExpr)

2015-11-24 Thread Tom Lane
Peter Geoghegan  writes:
> On Tue, Dec 10, 2013 at 1:30 AM, Peter Geoghegan  wrote:
>> pg_stat_statements' fingerprinting logic considers the following two
>> statements as distinct:
>> 
>> select 1 in (1, 2, 3);
>> select 1 in (1, 2, 3, 4);
>> 
>> This is because the ArrayExpr jumble case jumbles any ArrayExpr's list
>> of elements recursively. In this case it's a list of Const nodes, and
>> the fingerprinting logic jumbles those nodes indifferently.

I think this is a vastly oversimplified explanation of the problem.
In particular, because the planner will flatten an ArrayExpr containing
only Const nodes to an array constant (see eval_const_expressions),
I don't believe the case ever arises in exactly the form you posit here.

A portion of the problem is possibly due to the heuristics in
parse_expr.c's transformAExprIn():

 * We try to generate a ScalarArrayOpExpr from IN/NOT IN, but this is only
 * possible if there is a suitable array type available.  If not, we fall
 * back to a boolean condition tree with multiple copies of the lefthand
 * expression.  Also, any IN-list items that contain Vars are handled as
 * separate boolean conditions, because that gives the planner more scope
 * for optimization on such clauses.

If the original text actually involves a variable number of Vars, then you
will end up with a boolean expression with a varying number of OR arms,
even if the Vars later get flattened to constants.  However, it's not
clear to me that anyone would expect such cases to be treated as
identical.  Another possibility is a type clash, for example
"x IN (42, 44.1)" will end up as a boolean tree for lack of a common
type for the would-be array elements.  That case might possibly be an
issue in practice.

But what seems more likely to be annoying people is cases in which the
original text contains a varying number of Param markers.  Those might or
might not get folded to constants during planning depending on context,
so that they might or might not look different to pg_stat_statements.

So I suspect the real problem here is that we might want all of these
things to look identical to pg_stat_statements:

   ARRAY[$1, $2, 42]
   ARRAY[$1, $2, $3, 47]
   '{1,2,3,47}'::int[]

Don't see a very clean way to do that ...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Revisiting pg_stat_statements and IN() (Was: Re: pg_stat_statements fingerprinting logic and ArrayExpr)

2015-11-24 Thread Jim Nasby

On 11/24/15 1:29 PM, Tom Lane wrote:

So I suspect the real problem here is that we might want all of these
things to look identical to pg_stat_statements:

ARRAY[$1, $2, 42]
ARRAY[$1, $2, $3, 47]
'{1,2,3,47}'::int[]

Don't see a very clean way to do that ...


Another not-uncommon case is IN ( '1', '2', ... , '2342' ); in other 
words, treating an integer as text. A lot of frameworks like to do that 
and just push the problem onto the database. I'm not sure what 
pg_stat_statements would ultimately see in that case..


Since there's a few different things people might want, maybe a good 
first step is to allow extending/changing the jumbling decision at the C 
level. That would make it easy for a knowledgeable enough person to come 
up with an alternative as a plugin that regular users could use.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Revisiting pg_stat_statements and IN() (Was: Re: pg_stat_statements fingerprinting logic and ArrayExpr)

2015-11-24 Thread Peter Geoghegan
On Tue, Nov 24, 2015 at 5:39 PM, Jim Nasby  wrote:
> Another not-uncommon case is IN ( '1', '2', ... , '2342' ); in other words,
> treating an integer as text. A lot of frameworks like to do that and just
> push the problem onto the database. I'm not sure what pg_stat_statements
> would ultimately see in that case..

They do?

postgres=# select 5::int4 in ('5');
 ?column?
──
 t
(1 row)

postgres=# select 5::int4 in ('5a');
ERROR:  22P02: invalid input syntax for integer: "5a"
LINE 1: select 5::int4 in ('5a');
   ^
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Revisiting pg_stat_statements and IN() (Was: Re: pg_stat_statements fingerprinting logic and ArrayExpr)

2015-11-24 Thread Tom Lane
I wrote:
> Peter Geoghegan  writes:
>> This is because the ArrayExpr jumble case jumbles any ArrayExpr's list
>> of elements recursively. In this case it's a list of Const nodes, and
>> the fingerprinting logic jumbles those nodes indifferently.

> I think this is a vastly oversimplified explanation of the problem.
> In particular, because the planner will flatten an ArrayExpr containing
> only Const nodes to an array constant (see eval_const_expressions),
> I don't believe the case ever arises in exactly the form you posit here.

Um ... disregard that.  For some reason I was thinking that
pg_stat_statements looks at plan trees, but of course what it looks at
is the query tree immediately post-parse-analysis.  So the behavior of
the const-folding pass is not relevant.

The heuristics in transformAExprIn() are still relevant, though, and
I suspect that the question of whether Params should be considered
the same as Consts is also highly relevant.

I wonder whether we could improve this by arranging things so that both
Consts and Params contribute zero to the jumble hash, and a list of these
things also contributes zero, regardless of the length of the list.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Revisiting pg_stat_statements and IN() (Was: Re: pg_stat_statements fingerprinting logic and ArrayExpr)

2015-11-24 Thread Jim Nasby

On 11/24/15 7:46 PM, Peter Geoghegan wrote:

On Tue, Nov 24, 2015 at 5:39 PM, Jim Nasby  wrote:

Another not-uncommon case is IN ( '1', '2', ... , '2342' ); in other words,
treating an integer as text. A lot of frameworks like to do that and just
push the problem onto the database. I'm not sure what pg_stat_statements
would ultimately see in that case..


They do?

postgres=# select 5::int4 in ('5');
  ?column?
──
  t
(1 row)

postgres=# select 5::int4 in ('5a');
ERROR:  22P02: invalid input syntax for integer: "5a"
LINE 1: select 5::int4 in ('5a');
^


I'm not following your point. Obviously you can't compare int to text 
that doesn't convert back to an int, but that's not what I was talking 
about.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Revisiting pg_stat_statements and IN() (Was: Re: pg_stat_statements fingerprinting logic and ArrayExpr)

2015-11-23 Thread Peter Geoghegan
On Tue, Dec 10, 2013 at 1:30 AM, Peter Geoghegan  wrote:
> pg_stat_statements' fingerprinting logic considers the following two
> statements as distinct:
>
> select 1 in (1, 2, 3);
> select 1 in (1, 2, 3, 4);
>
> This is because the ArrayExpr jumble case jumbles any ArrayExpr's list
> of elements recursively. In this case it's a list of Const nodes, and
> the fingerprinting logic jumbles those nodes indifferently.
>
> Somebody told me that they think that pg_stat_statements should not do
> that. This person felt that it would be preferable for such
> expressions to be normalized without regard to the number of distinct
> Const elements. I suppose that that would work by determing if the
> ArrayExpr elements list was a list of Const nodes and only const
> nodes. Iff that turned out to be the case, something else would be
> jumbled (something other than the list) that would essentially be a
> representation of "some list of zero or more (or maybe one or more)
> Const nodes with consttype of, in this example, 23". I think that this
> would make at least one person happy, because of course the two
> statements above would have their costs aggregated within a single
> pg_stat_statements entry.

Baron Schwartz recently gave a talk at PGConf Silicon Valley about the
proprietary query instrumentation tool, VividCortex. The slides are
available from:

http://info.citusdata.com/rs/235-CNE-301/images/Analyzing_PostgreSQL_Network_Traffic_with_vc-pgsql-sniffer_-_Baron_Schwartz.pdf

One specific justification he gave for not using pg_stat_statements was:

"Doesn’t merge bind vars in IN()" (See slide #11)

His theory is that you should allow a proprietary binary to run with
root permissions, a binary that sniffs the wire protocol, mostly
because pg_stat_statements has this limitation (all other
pg_stat_statements limitations listed are pretty unconvincing, IMV).
That doesn't seem like a great plan to me, but I think he has a point
about pg_stat_statements. It's about time that we fixed this -- it
isn't realistic to imagine that people are going to know to use an
array constant like "= any ('{1,2,3}')" -- even a major contributor to
Django that I talked to about this issue a couple of years ago didn't
know about that. It also isn't portable across database systems.

I wonder:

* How do other people feel about this? Personally, I've seen enough
problems of this kind in the field that "slippery slope" arguments
against this don't seem very compelling.

* How might altering the jumbling logic to make it recognize a
variable number of constants as equivalent work in practice? Perhaps
we should do something to flatten the representation based on which
powers of two the number of constants is between. There are still some
details to work out there, but that's my first idea. That seems like a
good compromise between the current behavior, and completely
disregarding the number of constants.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers