Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
On Wed, 21 Jan 2015, Andrew Dunstan wrote: On 01/21/2015 09:27 AM, Arne Scheffer wrote: Sorry, corrected second try because of copypaste mistakes: VlG-Arne Comments appreciated. Definition var_samp = Sum of squared differences /n-1 Definition stddev_samp = sqrt(var_samp) Example N=4 1.) Sum of squared differences 1_4Sum(Xi-XM4)² = 2.) adding nothing 1_4Sum(Xi-XM4)² +0 +0 +0 = 3.) nothing changed 1_4Sum(Xi-XM4)² +(-1_3Sum(Xi-XM3)²+1_3Sum(Xi-XM3)²) +(-1_2Sum(Xi-XM2)²+1_2Sum(Xi-XM2)²) +(-1_1Sum(Xi-XM1)²+1_1Sum(Xi-XM1)²) = 4.) parts reordered (1_4Sum(Xi-XM4)²-1_3Sum(Xi-XM3)²) +(1_3Sum(Xi-XM3)²-1_2Sum(Xi-XM2)²) +(1_2Sum(Xi-XM2)²-1_1Sum(Xi-XM1)²) +1_1Sum(X1-XM1)² = 5.) (X4-XM4)(X4-XM3) + (X3-XM3)(X3-XM2) + (X2-XM2)(X2-XM1) + (X1-XM1)² = 6.) XM1=X1 = There it is - The iteration part of Welfords Algorithm (in reverse order) (X4-XM4)(X4-XM3) + (X3-XM3)(X3-XM2) + (X2-XM2)(X2-X1) + 0 The missing piece is 4.) to 5.) it's algebra, look at e.g.: http://jonisalonen.com/2013/deriving-welfords-method-for-computing-variance/ I have no idea what you are saying here. I'm sorry for that statistics stuff, my attempt was only to visualize in detail the mathematical reason for the iterating part of Welfords algorithm being computing the current sum of squared differences in every step - therefore it's in my opinion better to call the variable sum_of_squared_diffs (every statistician will be confused bei sum_of_variances, because: sample variance = sum_of_squared_diffs / n-1, have a look at Mr. Cooks explanation) - therefore deviding by n-1 is the unbiased estimator by definition. (have a look at Mr. Cooks explanation) - therefore I suggested (as a minor nomenclature issue) to call the column/description stdev_samp (PostgreSQL-nomenclature) / sample_ to indicate that information. (have a look at the PostgreSQL aggregate functions, it's doing that the same way) Here are comments in email to me from the author of http://www.johndcook.com/blog/standard_deviation regarding the divisor used: My code is using the unbiased form of the sample variance, dividing by n-1. I am relieved, now we are at least two persons saying that. :-) Insert into the commonly known definition Definition stddev_samp = sqrt(var_samp) from above, and it's exactly my point. Maybe I should add that in the code comments. Otherwise, I don't think we need a change. Huh? Why is it a bad thing to call the column stddev_samp analog to the aggregate function or make a note in the documentation, that the sample stddev is used to compute the solution? I really think it not a good strategy having the user to make a test or dive into the source code to determine the divisor used. E.g. David expected stdev_pop, so there is a need for documentation for cases with a small sample. VlG-Arne -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
Andrew Dunstan schrieb am 2015-01-21: On 01/21/2015 11:21 AM, Arne Scheffer wrote: Why is it a bad thing to call the column stddev_samp analog to the aggregate function or make a note in the documentation, that the sample stddev is used to compute the solution? I think you are making a mountain out of a molehill, frankly. These stats are not intended as anything other than a pretty indication of the shape, to see if they are significantly influenced by outliers. For any significantly large sample size the difference will be negligible. You're right, I maybe exaggerated the statistics part a bit. I wanted to help, because the patch is of interest for us. I will try to keep focus in the future. But I will add a note to the documentation, that seems reasonable. *happy* Thx Arne -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
David G Johnston schrieb am 2015-01-21: Andrew Dunstan wrote On 01/20/2015 01:26 PM, Arne Scheffer wrote: And a very minor aspect: The term standard deviation in your code stands for (corrected) sample standard deviation, I think, because you devide by n-1 instead of n to keep the estimator unbiased. How about mentioning the prefix sample to indicate this beiing the estimator? I don't understand. I'm following pretty exactly the calculations stated at lt;http://www.johndcook.com/blog/standard_deviation/gt; I'm not a statistician. Perhaps others who are more literate in statistics can comment on this paragraph. I'm largely in the same boat as Andrew but... I take it that Arne is referring to: http://en.wikipedia.org/wiki/Bessel's_correction Yes, it is. but the mere presence of an (n-1) divisor does not mean that is what is happening. In this particular situation I believe the (n-1) simply is a necessary part of the recurrence formula and not any attempt to correct for sampling bias when estimating a population's variance. That's wrong, it's applied in the end to the sum of squared differences and therefore per definition the corrected sample standard deviation estimator. In fact, as far as the database knows, the values provided to this function do represent an entire population and such a correction would be unnecessary. I That would probably be an exotic assumption in a working database and it is not, what is computed here! guess it boils down to whether future queries are considered part of the population or whether the population changes upon each query being run and thus we are calculating the ever-changing population variance. Yes, indeed correct. And exactly to avoid that misunderstanding, I suggested to use the sample term. To speak in Postgresql terms; applied in Andrews/Welfords algorithm is stddev_samp(le), not stddev_pop(ulation). Therefore stddev in Postgres is only kept for historical reasons, look at http://www.postgresql.org/docs/9.4/static/functions-aggregate.html Table 9-43. VlG-Arne Note point 3 in the linked Wikipedia article. David J. -- View this message in context: http://postgresql.nabble.com/Add-min-and-max-execute-statement-time-in-pg-stat-statement-tp5774989p5834805.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
I don't understand. I'm following pretty exactly the calculations stated at lt;http://www.johndcook.com/blog/standard_deviation/gt; I'm not a statistician. Perhaps others who are more literate in Maybe I'm mistaken here, but I think, the algorithm is not that complicated. I try to explain it further: Comments appreciated. Definition var_samp = Sum of squared differences /n-1 Definition stddev_samp = sqrt(var_samp) Example N=4 1.) Sum of squared differences 1_4Sum(Xi-XM4)² = 2.) adding nothing 1_4Sum(Xi-XM4)² +0 +0 +0 = 3.) nothing changed 1_4Sum(Xi-XM4)² +(-1_3Sum(Xi-XM3)²+1_3Sum(Xi-XM3)²) +(-1_2Sum(Xi-XM2)²+1_2Sum(Xi-XM3)²) +(-1_1Sum(Xi-XM2)²+1_1Sum(Xi-XM3)²) = 4.) parts reordered (1_4Sum(Xi-XM4)²-1_3Sum(Xi-XM3)²) +(1_3Sum(Xi-XM3)²-1_2Sum(Xi-XM2)²) +(1_2Sum(Xi-XM2)²-1_1Sum(Xi-XM2)²) +1_1Sum(X1-XM1)² = 5.) (X4-XM4)(X4-XM3) + (X3-XM3)(X3-XM2) + (X2-XM2)(X2-XM1) + (X1-XM1)² = 6.) XM1=X1 = There it is - The iteration part of Welfords Algorithm (in reverse order) (X4-XM4)(X4-XM3) + (X3-XM3)(X3-XM2) + (X2-XM2)(X2-X1) + 0 The missing piece is 4.) to 5.) it's algebra, look at e.g.: http://jonisalonen.com/2013/deriving-welfords-method-for-computing-variance/ Thanks. Still not quite sure what to do, though :-) I guess in the end we want the answer to come up with similar results to the builtin stddev SQL function. I'll try to set up a test program, to see if we do. If you want to go this way: Maybe this is one of the very few times, you have to use a small sample ;-) VlG-Arne cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
Sorry, corrected second try because of copypaste mistakes: VlG-Arne Comments appreciated. Definition var_samp = Sum of squared differences /n-1 Definition stddev_samp = sqrt(var_samp) Example N=4 1.) Sum of squared differences 1_4Sum(Xi-XM4)² = 2.) adding nothing 1_4Sum(Xi-XM4)² +0 +0 +0 = 3.) nothing changed 1_4Sum(Xi-XM4)² +(-1_3Sum(Xi-XM3)²+1_3Sum(Xi-XM3)²) +(-1_2Sum(Xi-XM2)²+1_2Sum(Xi-XM2)²) +(-1_1Sum(Xi-XM1)²+1_1Sum(Xi-XM1)²) = 4.) parts reordered (1_4Sum(Xi-XM4)²-1_3Sum(Xi-XM3)²) +(1_3Sum(Xi-XM3)²-1_2Sum(Xi-XM2)²) +(1_2Sum(Xi-XM2)²-1_1Sum(Xi-XM1)²) +1_1Sum(X1-XM1)² = 5.) (X4-XM4)(X4-XM3) + (X3-XM3)(X3-XM2) + (X2-XM2)(X2-XM1) + (X1-XM1)² = 6.) XM1=X1 = There it is - The iteration part of Welfords Algorithm (in reverse order) (X4-XM4)(X4-XM3) + (X3-XM3)(X3-XM2) + (X2-XM2)(X2-X1) + 0 The missing piece is 4.) to 5.) it's algebra, look at e.g.: http://jonisalonen.com/2013/deriving-welfords-method-for-computing-variance/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
Andrew Dunstan schrieb am 2015-01-20: On 01/20/2015 01:26 PM, Arne Scheffer wrote: Interesting patch. I did a quick review looking only into the patch file. The sum of variances variable contains the sum of squared differences instead, I think. Umm, no. It's not. Umm, yes, i think, it is ;-) e-counters.sum_var_time += (total_time - old_mean) * (total_time - e-counters.mean_time); This is not a square that's being added. That's correct. Nevertheless it's the difference between the computed sum of squared differences and the preceeding one, added in every step. old_mean is not the same as e-counters.mean_time. Since the variance is this value divided by (n - 1), AIUI, I think sum of variances isn't a bad description. I'm open to alternative suggestions. And a very minor aspect: The term standard deviation in your code stands for (corrected) sample standard deviation, I think, because you devide by n-1 instead of n to keep the estimator unbiased. How about mentioning the prefix sample to indicate this beiing the estimator? I don't understand. I'm following pretty exactly the calculations stated at http://www.johndcook.com/blog/standard_deviation/ (There is nothing bad about that calculations, Welford's algorithm is simply sequently adding the differences mentioned above.) VlG-Arne I'm not a statistician. Perhaps others who are more literate in statistics can comment on this paragraph. And I'm sure I'm missing C specifics (again) (or it's the reduced patch file scope), but you introduce sqrtd, but sqrt is called? Good catch. Will fix. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
Interesting patch. I did a quick review looking only into the patch file. The sum of variances variable contains the sum of squared differences instead, I think. And a very minor aspect: The term standard deviation in your code stands for (corrected) sample standard deviation, I think, because you devide by n-1 instead of n to keep the estimator unbiased. How about mentioning the prefix sample to indicate this beiing the estimator? And I'm sure I'm missing C specifics (again) (or it's the reduced patch file scope), but you introduce sqrtd, but sqrt is called? VlG Arne -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] explain sortorder
Hi, we will also remove the following is lc_collate hint in the next version, showing only mandatory info as suggested. /* for those who use COLLATE although their default is already the wanted */ if (strcmp(collname, localeptr) == 0) { appendStringInfo(sortorderInformation, (%s is LC_COLLATE), collname); } Anybody insisting on that? Arne Note: I see, at the moment we use the wrong default for DESC. We'll fix that. On Wed, 14 Jan 2015, Heikki Linnakangas wrote: On 01/14/2015 05:26 PM, Timmer, Marius wrote: Hello Heikki, abbreviated version: Sorry, the problem is only the unhandy patch text format, not different opinions how to proceed. Long version: The v7 patch file already addressed your suggestions, but the file contained serveral (old) local commits, the new ones at the end of the patch text/file. Ah, missed that. I stopped reading when I saw the old stuff there :-). v7.1 is attached and addresses this issue providing a clean patch file. Ok, thanks, will take a look. V8 will - as mentioned - add missing docs and regression tests, Great! - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] explain sortorder
Heikki Linnakangas hlinnakangas(at)vmware(dot)com writes: I would suggest just adding the information to the Sort Key line. As long as you don't print the modifiers when they are defaults (ASC and NULLS LAST), we could print the information even in non-VERBOSE mode. +1. I had assumed without looking that that was what it did already, else I'd have complained too. regards, tom lane We will change the patch according to Heikkis suggestions. A nice Christmas all the best in the New Year Arne Scheffer http://www.uni-muenster.de/ZIV/Mitarbeiter/ArneScheffer.shtml -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers