Re: [HACKERS] Add min and max execute statement time in pg_stat_statement

2015-01-21 Thread Arne Scheffer



On Wed, 21 Jan 2015, Andrew Dunstan wrote:



On 01/21/2015 09:27 AM, Arne Scheffer wrote:

Sorry, corrected second try because of copypaste mistakes:
VlG-Arne


Comments appreciated.
Definition var_samp = Sum of squared differences /n-1
Definition stddev_samp = sqrt(var_samp)
Example N=4
1.) Sum of squared differences
   1_4Sum(Xi-XM4)²
=
2.) adding nothing
   1_4Sum(Xi-XM4)²
  +0
  +0
  +0
=
3.) nothing changed
  1_4Sum(Xi-XM4)²
  +(-1_3Sum(Xi-XM3)²+1_3Sum(Xi-XM3)²)
  +(-1_2Sum(Xi-XM2)²+1_2Sum(Xi-XM2)²)
  +(-1_1Sum(Xi-XM1)²+1_1Sum(Xi-XM1)²)
=
4.) parts reordered
   (1_4Sum(Xi-XM4)²-1_3Sum(Xi-XM3)²)
  +(1_3Sum(Xi-XM3)²-1_2Sum(Xi-XM2)²)
  +(1_2Sum(Xi-XM2)²-1_1Sum(Xi-XM1)²)
  +1_1Sum(X1-XM1)²
=
5.)
   (X4-XM4)(X4-XM3)
+ (X3-XM3)(X3-XM2)
+ (X2-XM2)(X2-XM1)
+ (X1-XM1)²
=
6.) XM1=X1 = There it is - The iteration part of Welfords Algorithm
(in
reverse order)
   (X4-XM4)(X4-XM3)
+ (X3-XM3)(X3-XM2)
+ (X2-XM2)(X2-X1)
+ 0
The missing piece is 4.) to 5.)
it's algebra, look at e.g.:
http://jonisalonen.com/2013/deriving-welfords-method-for-computing-variance/







I have no idea what you are saying here.


I'm sorry for that statistics stuff, 
my attempt was only to visualize in detail 
the mathematical reason for 
the iterating part of Welfords algorithm

being computing the current sum of squared differences in every step

- therefore it's in my opinion better to call the variable sum_of_squared_diffs
  (every statistician will be confused bei sum_of_variances,
   because:  sample variance = sum_of_squared_diffs / n-1,
   have a look at Mr. Cooks explanation)

- therefore deviding by n-1 is the unbiased estimator by definition.
  (have a look at Mr. Cooks explanation)

- therefore I suggested (as a minor nomenclature issue) to call the 
column/description
  stdev_samp (PostgreSQL-nomenclature) / sample_ to indicate that 
information.
  (have a look at the PostgreSQL aggregate functions, it's doing that the same 
way)



Here are comments in email to me from the author of 
http://www.johndcook.com/blog/standard_deviation regarding the divisor 
used:


  My code is using the unbiased form of the sample variance, dividing
  by n-1.



I am relieved, now we are at least two persons saying that. :-)
Insert into the commonly known definition


Definition stddev_samp = sqrt(var_samp)


from above, and it's exactly my point.

Maybe I should add that in the code comments. Otherwise, I don't think we 
need a change.


Huh?

Why is it a bad thing to call the column stddev_samp analog to the
aggregate function or make a note in the documentation, 
that the sample stddev is used to compute the solution?


I really think it not a good strategy having the user to make a test or dive
into the source code to determine the divisor used.

E.g. David expected stdev_pop, so there is a need for documentation for cases 
with a small sample.

VlG-Arne

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add min and max execute statement time in pg_stat_statement

2015-01-21 Thread Arne Scheffer


Andrew Dunstan schrieb am 2015-01-21:

 On 01/21/2015 11:21 AM, Arne Scheffer wrote:



 Why is it a bad thing to call the column stddev_samp analog to the
 aggregate function or make a note in the documentation, that the
 sample stddev is used to compute the solution?


 I think you are making a mountain out of a molehill, frankly. These
 stats are not intended as anything other than a pretty indication of
 the
 shape, to see if they are significantly influenced by outliers. For
 any
 significantly large sample size the difference will be negligible.

You're right, I maybe exaggerated the statistics part a bit.
I wanted to help, because the patch is of interest for us.
I will try to keep focus in the future.


 But I will add a note to the documentation, that seems reasonable.


*happy*

Thx

Arne



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add min and max execute statement time in pg_stat_statement

2015-01-21 Thread Arne Scheffer


David G Johnston schrieb am 2015-01-21:
 Andrew Dunstan wrote
  On 01/20/2015 01:26 PM, Arne Scheffer wrote:

  And a very minor aspect:
  The term standard deviation in your code stands for
  (corrected) sample standard deviation, I think,
  because you devide by n-1 instead of n to keep the
  estimator unbiased.
  How about mentioning the prefix sample
  to indicate this beiing the estimator?


  I don't understand. I'm following pretty exactly the calculations
  stated
  at lt;http://www.johndcook.com/blog/standard_deviation/gt;


  I'm not a statistician. Perhaps others who are more literate in
  statistics can comment on this paragraph.

 I'm largely in the same boat as Andrew but...

 I take it that Arne is referring to:

 http://en.wikipedia.org/wiki/Bessel's_correction

Yes, it is.

 but the mere presence of an (n-1) divisor does not mean that is what
 is
 happening.  In this particular situation I believe the (n-1) simply
 is a
 necessary part of the recurrence formula and not any attempt to
 correct for
 sampling bias when estimating a population's variance.

That's wrong, it's applied in the end to the sum of squared differences
and therefore per definition the corrected sample standard deviation
estimator.

 In fact, as
 far as
 the database knows, the values provided to this function do represent
 an
 entire population and such a correction would be unnecessary.  I

That would probably be an exotic assumption in a working database
and it is not, what is computed here!

 guess it
 boils down to whether future queries are considered part of the
 population
 or whether the population changes upon each query being run and thus
 we are
 calculating the ever-changing population variance.

Yes, indeed correct.
And exactly to avoid that misunderstanding, I suggested to
use the sample term.
To speak in Postgresql terms; applied in Andrews/Welfords algorithm
is stddev_samp(le), not stddev_pop(ulation).
Therefore stddev in Postgres is only kept for historical reasons, look at
http://www.postgresql.org/docs/9.4/static/functions-aggregate.html
Table 9-43.

VlG-Arne

 Note point 3 in
 the
 linked Wikipedia article.





 David J.



 --
 View this message in context:
 http://postgresql.nabble.com/Add-min-and-max-execute-statement-time-in-pg-stat-statement-tp5774989p5834805.html
 Sent from the PostgreSQL - hackers mailing list archive at
 Nabble.com.


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add min and max execute statement time in pg_stat_statement

2015-01-21 Thread Arne Scheffer
 I don't understand. I'm following pretty exactly the calculations
 stated
 at lt;http://www.johndcook.com/blog/standard_deviation/gt;

 I'm not a statistician. Perhaps others who are more literate in

Maybe I'm mistaken here,
but I think, the algorithm is not that complicated.
I try to explain it further:

Comments appreciated.

Definition var_samp = Sum of squared differences /n-1
Definition stddev_samp = sqrt(var_samp)

Example N=4

1.) Sum of squared differences
  1_4Sum(Xi-XM4)²
=
2.) adding nothing
  1_4Sum(Xi-XM4)²
 +0
 +0
 +0
=
3.) nothing changed
 1_4Sum(Xi-XM4)²
 +(-1_3Sum(Xi-XM3)²+1_3Sum(Xi-XM3)²)
 +(-1_2Sum(Xi-XM2)²+1_2Sum(Xi-XM3)²)
 +(-1_1Sum(Xi-XM2)²+1_1Sum(Xi-XM3)²)

=
4.) parts reordered
  (1_4Sum(Xi-XM4)²-1_3Sum(Xi-XM3)²)
 +(1_3Sum(Xi-XM3)²-1_2Sum(Xi-XM2)²)
 +(1_2Sum(Xi-XM2)²-1_1Sum(Xi-XM2)²)
 +1_1Sum(X1-XM1)²
=
5.)
  (X4-XM4)(X4-XM3)
+ (X3-XM3)(X3-XM2)
+ (X2-XM2)(X2-XM1)
+ (X1-XM1)²
=
6.) XM1=X1 = There it is - The iteration part of Welfords Algorithm (in
reverse order)
  (X4-XM4)(X4-XM3)
+ (X3-XM3)(X3-XM2)
+ (X2-XM2)(X2-X1)
+ 0

The missing piece is 4.) to 5.)
it's algebra, look at e.g.:
http://jonisalonen.com/2013/deriving-welfords-method-for-computing-variance/

 Thanks. Still not quite sure what to do, though :-) I guess in the
 end we want the answer to come up with similar results to the builtin
 stddev SQL function. I'll try to set up a test program, to see if we do.

If you want to go this way:
Maybe this is one of the very few times, you have to use a small sample
;-)

VlG-Arne


 cheers

 andrew


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add min and max execute statement time in pg_stat_statement

2015-01-21 Thread Arne Scheffer
Sorry, corrected second try because of copypaste mistakes:
VlG-Arne

 Comments appreciated.

 Definition var_samp = Sum of squared differences /n-1
 Definition stddev_samp = sqrt(var_samp)

 Example N=4

 1.) Sum of squared differences
   1_4Sum(Xi-XM4)²
 =
 2.) adding nothing
   1_4Sum(Xi-XM4)²
  +0
  +0
  +0
 =
 3.) nothing changed
  1_4Sum(Xi-XM4)²
  +(-1_3Sum(Xi-XM3)²+1_3Sum(Xi-XM3)²)
  +(-1_2Sum(Xi-XM2)²+1_2Sum(Xi-XM2)²)
  +(-1_1Sum(Xi-XM1)²+1_1Sum(Xi-XM1)²)

 =
 4.) parts reordered
   (1_4Sum(Xi-XM4)²-1_3Sum(Xi-XM3)²)
  +(1_3Sum(Xi-XM3)²-1_2Sum(Xi-XM2)²)
  +(1_2Sum(Xi-XM2)²-1_1Sum(Xi-XM1)²)
  +1_1Sum(X1-XM1)²
 =
 5.)
   (X4-XM4)(X4-XM3)
 + (X3-XM3)(X3-XM2)
 + (X2-XM2)(X2-XM1)
 + (X1-XM1)²
 =
 6.) XM1=X1 = There it is - The iteration part of Welfords Algorithm
 (in
 reverse order)
   (X4-XM4)(X4-XM3)
 + (X3-XM3)(X3-XM2)
 + (X2-XM2)(X2-X1)
 + 0

 The missing piece is 4.) to 5.)
 it's algebra, look at e.g.:
 http://jonisalonen.com/2013/deriving-welfords-method-for-computing-variance/




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add min and max execute statement time in pg_stat_statement

2015-01-20 Thread Arne Scheffer


Andrew Dunstan schrieb am 2015-01-20:

 On 01/20/2015 01:26 PM, Arne Scheffer wrote:
 Interesting patch.
 I did a quick review looking only into the patch file.

 The sum of variances variable contains
 the sum of squared differences instead, I think.

 Umm, no. It's not.

Umm, yes, i think, it is ;-)

   e-counters.sum_var_time +=
   (total_time - old_mean) * (total_time - e-counters.mean_time);
 This is not a square that's being added.

That's correct.
Nevertheless it's the difference between the computed sum of squared
differences
and the preceeding one, added in every step.

 old_mean is not the same as
 e-counters.mean_time.

 Since the variance is this value divided by (n - 1), AIUI, I think
 sum
 of variances isn't a bad description. I'm open to alternative
 suggestions.

 And a very minor aspect:
 The term standard deviation in your code stands for
 (corrected) sample standard deviation, I think,
 because you devide by n-1 instead of n to keep the
 estimator unbiased.
 How about mentioning the prefix sample
 to indicate this beiing the estimator?

 I don't understand. I'm following pretty exactly the calculations
 stated
 at http://www.johndcook.com/blog/standard_deviation/

(There is nothing bad about that calculations, Welford's algorithm
 is simply sequently adding the differences mentioned above.)

VlG-Arne


 I'm not a statistician. Perhaps others who are more literate in
 statistics can comment on this paragraph.

 And I'm sure I'm missing C specifics (again)
 (or it's the reduced patch file scope),
 but you introduce sqrtd, but sqrt is called?


 Good catch. Will fix.

 cheers

 andrew


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add min and max execute statement time in pg_stat_statement

2015-01-20 Thread Arne Scheffer
Interesting patch.
I did a quick review looking only into the patch file.

The sum of variances variable contains
the sum of squared differences instead, I think.

And a very minor aspect:
The term standard deviation in your code stands for
(corrected) sample standard deviation, I think,
because you devide by n-1 instead of n to keep the
estimator unbiased.
How about mentioning the prefix sample
to indicate this beiing the estimator?

And I'm sure I'm missing C specifics (again)
(or it's the reduced patch file scope),
but you introduce sqrtd, but sqrt is called?

VlG

Arne












-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] explain sortorder

2015-01-14 Thread Arne Scheffer

Hi,

we will also remove the following is lc_collate hint in the next version, 
showing only mandatory info as suggested.

/* for those who use COLLATE although their default is already 
the wanted */
if (strcmp(collname, localeptr) == 0)
{
appendStringInfo(sortorderInformation,  (%s is 
LC_COLLATE), collname);
}

Anybody insisting on that?

Arne

Note: I see, at the moment we use the wrong default for DESC. We'll fix that.

On Wed, 14 Jan 2015, Heikki Linnakangas wrote:


On 01/14/2015 05:26 PM, Timmer, Marius wrote:

Hello Heikki,

abbreviated version:
Sorry, the problem is only the unhandy patch text format, not different 
opinions how to proceed.


Long version:
The v7 patch file already addressed your suggestions,
but the file contained serveral (old) local commits,
the new ones at the end of the patch text/file.


Ah, missed that. I stopped reading when I saw the old stuff there :-).


v7.1 is attached and addresses this issue providing a clean patch file.


Ok, thanks, will take a look.


V8 will - as mentioned - add missing docs and regression tests,


Great!

- Heikki





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] explain sortorder

2014-12-26 Thread Arne Scheffer
Heikki Linnakangas hlinnakangas(at)vmware(dot)com writes:
 I would suggest just adding the information to the Sort Key line. As
 long as you don't print the modifiers when they are defaults (ASC and
 NULLS LAST), we could print the information even in non-VERBOSE mode.

+1.  I had assumed without looking that that was what it did already,
else I'd have complained too.

   regards, tom lane

We will change the patch according to Heikkis suggestions.

A nice Christmas  all the best in the New Year

Arne Scheffer

http://www.uni-muenster.de/ZIV/Mitarbeiter/ArneScheffer.shtml


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers