subject:"\[HACKERS\] \[WIP\] Zipfian distribution in pgbench"

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-10-01 Thread Daniel Gustafsson

> On 06 Sep 2017, at 08:42, Fabien COELHO  wrote:
> 
> Hello Alik,
> 
> Applies, compiles, works for me.
> 
> Some minor comments and suggestions.
> 
> Two typos:
> - "usinng" -> "using"
> - "a rejection method used" -> "a rejection method is used"
> 
> I'm not sure of "least_recently_used_i", this naming style is not used in 
> pgbench. "least_recently_used" would be ok.
> 
> "..nb_cells.. != ZIPF_CACHE_SIZE", ISTM that "<" is more logical,
> even if the result is the same?
> 
> I would put the parameter value check in getZipfianRand, so that if someone 
> reuse the function elsewhere the check is also performed. That would also 
> simplify a bit the already very large expression evaluation function.
> 
> When/if the pgbench tap test patch get through, coverage tests should
> be added.
> 
> Maybe the cache overflow could be counted, to allow for a possible warning 
> message in the final report?

Since this patch has been Waiting for author and no update on this patch has
been posted during the commitfest, it is Returned with feedback.  When you have
a new version of the patch, please re-submit to a future commitfest.

cheers ./daniel

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-09-06 Thread Fabien COELHO



Hello Alik,

Applies, compiles, works for me.

Some minor comments and suggestions.

Two typos:
 - "usinng" -> "using"
 - "a rejection method used" -> "a rejection method is used"

I'm not sure of "least_recently_used_i", this naming style is not used in 
pgbench. "least_recently_used" would be ok.


"..nb_cells.. != ZIPF_CACHE_SIZE", ISTM that "<" is more logical,
even if the result is the same?

I would put the parameter value check in getZipfianRand, so that if 
someone reuse the function elsewhere the check is also performed. That 
would also simplify a bit the already very large expression evaluation 
function.


When/if the pgbench tap test patch get through, coverage tests should
be added.

Maybe the cache overflow could be counted, to allow for a possible warning 
message in the final report?


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-09-02 Thread Alik Khilazhev

Hello Fabien,

Thank you for detailed review. I hope I have fixed all the issues you mentioned 
in your letter.



pgbench-zipf-08v.patch
Description: Binary data

—
Thanks and Regards,
Alik Khilazhev
Postgres Professional:
http://www.postgrespro.com
The Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-08-23 Thread Fabien COELHO



Hello Alik,


I am attaching patch v7.


Patch generates multiple warnings with "git apply", apparently because of 
end-of-line spaces, and fails:


  pgbench-zipf-07v.patch:52: trailing whitespace.
{
  pgbench-zipf-07v.patch:53: trailing whitespace.
"random_zipfian", 3, PGBENCH_RANDOM_ZIPFIAN
  pgbench-zipf-07v.patch:54: trailing whitespace.
},
  pgbench-zipf-07v.patch:66: trailing whitespace.
  #define ZIPF_CACHE_SIZE 15  /* cache cells number */
  pgbench-zipf-07v.patch:67: trailing whitespace.

  error: patch failed: doc/src/sgml/ref/pgbench.sgml:1013
  error: doc/src/sgml/ref/pgbench.sgml: patch does not apply
  error: patch failed: src/bin/pgbench/exprparse.y:191
  error: src/bin/pgbench/exprparse.y: patch does not apply
  error: patch failed: src/bin/pgbench/pgbench.c:93
  error: src/bin/pgbench/pgbench.c: patch does not apply
  error: patch failed: src/bin/pgbench/pgbench.h:75
  error: src/bin/pgbench/pgbench.h: patch does not apply

It also complains with "patch", although it seems to work:

 (Stripping trailing CRs from patch; use --binary to disable.)
 patching file doc/src/sgml/ref/pgbench.sgml
 (Stripping trailing CRs from patch; use --binary to disable.)
 patching file src/bin/pgbench/exprparse.y
 (Stripping trailing CRs from patch; use --binary to disable.)
 patching file src/bin/pgbench/pgbench.c
 (Stripping trailing CRs from patch; use --binary to disable.)
 patching file src/bin/pgbench/pgbench.h
 patch unexpectedly ends in middle of line
 patch unexpectedly ends in middle of line

Please do not put spaces at the end of lines, eg there is a lonely tab on 
the second line of "computeHarmonicZipfian".


It seems that "CR" characters are used:

  sh> sha1sum ~/pgbench-zipf-07v.patch
  439ad1de0a960b3b415ff6de9412b3cc4d6922e7

  sh> file ~/pgbench-zipf-07v.patch
  pgbench-zipf-07v.patch: unified diff output, ASCII text, with CRLF line 
terminators

Compilation issues one warning:

 pgbench.c: In function ‘computeHarmonicZipfian’:
 pgbench.c:894:2: warning: ISO C90 forbids mixed declarations and code 
[-Wdeclaration-after-statement]
double  uniform = pg_erand48(thread->random_state);

Please conform to pg standard for portability, even if it is a very old one:-)


About the documentation:

I would suggest to replace 0.5 in the first example by 1.5, as a zipfian 
distribution with a lower-than-one parameter is not that well defined.


I'm fine with using references to papers or books for the algorithm.

The documentation lines in the sgml file are too long. At least
limit text paragraphs to maybe 80 columns as done elsewhere.

typo: " Zipfian" (remove space)

typo: "used(described" (missing space)

typo: "numbers(algorithm" (again)

The sentence "It's recommended to pick parameter in range 
(0, 1) for better accuracy." does not make sense with the updated 
generation algorithm.


The text would benefit from a review by a native English speaker, who I am 
not. Anyway here is an attempt at improving/simplifying the text (I have 
removed sgml formatting). If some native English speaker could review it, 
that would be great.


"""
  "random_zipfian" generates an approximated bounded zipfian distribution.
  For parameter in (0,1), an approximated algorithm is taken from
  "Quickly Generating Billion-Record Synthetic Databases", Jim Gray et al, 
SIGMOD 1994.
  For parameter in (1, 1000), a rejection method is used, based on
  "Non-Uniform Random Variate Generation", Luc Devroye, p. 550-551, Springer 
1986.
  The distribution is not defined when the parameter's value is 1.0.
  The drawing performance is poor for parameter values close and above 1.0
  and on a small range.

  "parameter" defines how skewed the distribution is.
  The larger the "parameter", the more frequently values close to the beginning
  of the interval are drawn.
  The closer to 0 "parameter" is, the flatter (more uniform) the access 
distribution.
"""


English in comments and code:

"inited" is not English, I think you want "initialized". Maybe "nb_cells" 
would do.


"it is not exist" (multiple instances) -> "it does not exist".

I think that the references to the paper/book should appear as comments in 
the code, for instance in front of the functions which implement the 
algorithm.


"current_time", "last_touch_time" are counter based, they are not "time". 
I suggest to update the name to "current" and "last_touch" or "last_used".


"time when cell was used last time" -> "last used logical time"


About the code:

I would prefer that you use "1.0" instead of "1." for double constants.

I think that getZipfianRand should be symmetric. Maybe a direct formula 
could do:


  return min - 1 + (s > 1) ? xxx() : yyy();

or at least use an explicit "else" for symmetry.

Function "zipfn" asks for s and n as arguments, but ISTM that they are 
already include as fields in cell, so this seems redundant. Moreover I do 
not think that the zipfn function is that useful. I would suggest to

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-08-22 Thread Alik Khilazhev

Hello, Fabien

I am attaching patch v7.

> Yes, I agree. a >= 1 does not make much sense... If you want uniform you 
> should use random(), not call random_zipfian with a = 1. Basically it 
> suggests that too large values of "a" should be rejected. Not sure where to 
> put the limit, though.

I set upper bound for parameter to be equal to 1000.
> 
> Yes, as a general principle I think that the documentation should reflect the 
> implementation.

Documentation have been updated, I have removed algorithms descriptions and put 
references to them there.



pgbench-zipf-07v.patch
Description: Binary data
—
Thanks and Regards,
Alik Khilazhev
Postgres Professional:
http://www.postgrespro.com
The Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-08-13 Thread Fabien COELHO



Hello Alik,


Now “a” does not have upper bound, that’s why on using iterative algorithm with a 
>= 1 program will stuck on infinite loop because of following line of code:
double b = pow(2.0, s - 1.0);
Because after overflow “b” becomes “+Inf”.


Yep, overflow can happen.


So should upper bound for “a" be set?


Yes, I agree. a >= 1 does not make much sense... If you want uniform 
you should use random(), not call random_zipfian with a = 1. Basically 
it suggests that too large values of "a" should be rejected. Not sure 
where to put the limit, though.


Should I mention in docs that there are two algorithms are used 
depending on values of a(s/theta)?


Yes, as a general principle I think that the documentation should reflect 
the implementation.


In attaching patch, I have added computeIterativeZipfian method and it’s 
usage in getZipfianRand. Is it better to move code of computing via 
cache to new method, so that getZipfianRand will contain only 2 
computeXXXZipfian method calls?


I have not looked in detail, but from what you say I would agree that the 
implementation should be symmetric, so having one function calling one 
method or the other sounds good.


--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-08-13 Thread Alik Khilazhev

Hello Fabien,

> 
> I think that this method should be used for a>1, and the other very rough one 
> can be kept for parameter a in [0, 1), a case which does not make much sense 
> to a mathematician as it diverges if unbounded.

Now “a” does not have upper bound, that’s why on using iterative algorithm with 
a >= 1 program will stuck on infinite loop because of following line of 
code:
double b = pow(2.0, s - 1.0); 
Because after overflow “b” becomes “+Inf”.

So should upper bound for “a" be set? 

Should I mention in docs that there are two algorithms are used depending on 
values of a(s/theta)?

In attaching patch, I have added computeIterativeZipfian method and it’s usage 
in getZipfianRand.
Is it better to move code of computing via cache to new method, so that 
getZipfianRand will contain only 2 computeXXXZipfian method calls?


pgbench-zipf-06v.patch
Description: Binary data
—
Thanks and Regards,
Alik Khilazhev
Postgres Professional:
http://www.postgrespro.com
The Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-08-06 Thread Alik Khilazhev

Hello Fabien,


> On 5 Aug 2017, at 12:15, Fabien COELHO  wrote:
> 
> 
> Hello Alik,
> 
> I've done some math investigations, which consisted in spending one hour with 
> Christian, a statistician colleague of mine. He took an old book out of a 
> shelf, opened it to page 550 (roughly in the middle), and explained to me how 
> to build a real zipfian distribution random generator.
> 
> The iterative method is for parameter a>1 and works for unbounded values. It 
> is simple to add a bound. In practice the iterative method is quite 
> effective, i.e. number of iterations is typically small, at least if the 
> bound is large and if parameter a is not too close to 1.

> I've attached a python3 script which implements the algorithm. It looks like 
> magic. Beware that a C implementation should take care of float and int 
> overflows.
> 
Thank you for the script. I will rewrite it to C and add to the patch soon. 

—
Thanks and Regards,
Alik Khilazhev
Postgres Professional:
http://www.postgrespro.com
The Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: LP_DEAD hinting and not holding on to a buffer pin on leaf page (Was: [HACKERS] [WIP] Zipfian distribution in pgbench)

2017-08-05 Thread Peter Geoghegan

On Fri, Aug 4, 2017 at 3:30 PM, Peter Geoghegan  wrote:
> Yura Sokolov of Postgres Pro performed this benchmark at my request.
> He took the 9.5 commit immediately proceeding 2ed5b87f9 as a baseline.

I attach a simple patch that comments out the release of the buffer
pin for logged tables where an MVCC snapshot is used for a scan that
is not an index-only scan. This is the simplest way of evaluating the
effects of disabling 2ed5b87f9. Yura or others may find this helpful.

To be clear, I am certainly not suggesting that we should revert
2ed5b87f9. I do think that we need to give serious thought to fixing
the regression that 2ed5b87f9 introduced for LP_DEAD setting by index
scans, though.

-- 
Peter Geoghegan
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
new file mode 100644
index 642c894..696beda
*** a/src/backend/access/nbtree/nbtsearch.c
--- b/src/backend/access/nbtree/nbtsearch.c
*** _bt_drop_lock_and_maybe_pin(IndexScanDes
*** 58,63 
--- 58,64 
  {
  	LockBuffer(sp->buf, BUFFER_LOCK_UNLOCK);

+ #if 0
  	if (IsMVCCSnapshot(scan->xs_snapshot) &&
  		RelationNeedsWAL(scan->indexRelation) &&
  		!scan->xs_want_itup)
*** _bt_drop_lock_and_maybe_pin(IndexScanDes
*** 65,70 
--- 66,72 
  		ReleaseBuffer(sp->buf);
  		sp->buf = InvalidBuffer;
  	}
+ #endif
  }

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-08-05 Thread Fabien COELHO



Hello Alik,

So I would be in favor of expanding the documentation but not 
restricting the parameter beyond avoiding value 1.0.


I have removed restriction and expanded documentation in attaching patch v5.


I've done some math investigations, which consisted in spending one hour 
with Christian, a statistician colleague of mine. He took an old book out 
of a shelf, opened it to page 550 (roughly in the middle), and explained 
to me how to build a real zipfian distribution random generator.


The iterative method is for parameter a>1 and works for unbounded values. 
It is simple to add a bound. In practice the iterative method is quite 
effective, i.e. number of iterations is typically small, at least if the 
bound is large and if parameter a is not too close to 1.


I've attached a python3 script which implements the algorithm. It looks 
like magic. Beware that a C implementation should take care of float and 
int overflows.


  # usage: a, #values, #tests

  sh> zipf.py 1.5 1000 100
  # after 1.7 seconds
  c = [391586, 138668, 75525, 49339, 35222, 26621, ...
   ... 11, 13, 12, 11, 16] (1.338591 iterations per draw)

  sh> zipf.py 1.1 1000 100
  # after 3.1 seconds
  c = [179302, 83927, 53104, 39015, 30557, 25164, ...
   ... 82, 95, 93, 81, 80] (2.681451 iterations per draw)

I think that this method should be used for a>1, and the other very rough 
one can be kept for parameter a in [0, 1), a case which does not make much 
sense to a mathematician as it diverges if unbounded.


--
Fabien.#! /usr/bin/env python3
#
# generate Zipf distribution
#
# method taken from:
#   Luc Devroye,
#  "Non-Uniform Random Variate Generation"
#  p. 550-551.
#  Springer 1986
#
# the method works for an infinite bound, the finite bound condition has been
# added.

a = 1.1
N = 100
M = 1

import sys
if len(sys.argv) >= 3:
a = float(sys.argv[1])
N = int(sys.argv[2])
if len(sys.argv) >= 4:
M = int(sys.argv[3])

# beware, a close to 1 and n small (eg 100) leads to large number of iterations
# i.e. rejection probability is high when a -> 1
# - 1.001: 280
# - 1.002: 139.2
# - 1.005:  55.9
# - 1.010:  28.4
# - 1.020:  14.8
# - 1.050:   6.2
# - 1.100:   3.5
# however if n is larger the number of iterations decreases significantly

from random import random
from math import exp

def zipfgen(a, N):
assert a > 1.0, "a must be greater than 1"
b = 2.0 ** (a - 1.0)
i = 0 # count iterations
while True:
i += 1
u, v = random(), random()
try:
x = int(u ** (- 1.0 / (a - 1.0)))
t = (1.0 + 1.0 / x) ** (a - 1.0)
# reject if too large or out of bound
if v * x * (t - 1.0) / (b - 1.0) <= t / b and x <= N:
break
except OverflowError: # on u ** ...
pass
return (x, i)

if M == 1:
x, i = zipfgen(a, N)
print("X = %d (%d)" % (x, i))
else:
c = [0 for i in range(0, N)]
cost = 0
for i in range(0, M):
x, i = zipfgen(a, N)
# assert 1 <= x and x <= N, "x = %d" % x
cost += i
c[x-1] += 1
print("c = %s (%f iterations per draw)" % (c, cost/M))

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-08-05 Thread Fabien COELHO



Hello Peter,

I think that it would also be nice if there was an option to make 
functions like random_zipfian() actually return a value that has 
undergone perfect hashing. When this option is used, any given value 
that the function returns would actually be taken from a random mapping 
to some other value in the same range. So, you can potentially get a 
Zipfian distribution without the locality.


I definitely agree. This is a standard problem with all non uniform random 
generators in pgbench, namely random_{gaussian,exponential}.


However hashing is not a good solution on a finite domain because of the 
significant collision rate, so that typically 1/3 of values are empty and 
collisions cause spikes. Also, collisions would break PKs.


The solution is to provide a (good) pseudo-random parametric permutation, 
which is non trivial especially for non powers of two, so ISTM that it 
should be a patch on its own.


The good news is that it is on my todo list and I have some ideas on how 
to do it.


The bad news is that given the rate at which I succeed in getting things 
committed in pgbench, it might take some years:-( For instance, simplistic 
functions and operators to extend the current set have been in the pipe 
for 15 months and missed pg10.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-08-04 Thread Peter Geoghegan

On Fri, Jul 21, 2017 at 4:51 AM, Alik Khilazhev
 wrote:
> (Latest version of pgbench Zipfian patch)

While I'm +1 on this idea, I think that it would also be nice if there
was an option to make functions like random_zipfian() actually return
a value that has undergone perfect hashing. When this option is used,
any given value that the function returns would actually be taken from
a random mapping to some other value in the same range. So, you can
potentially get a Zipfian distribution without the locality.

While I think that Zipfian distributions are common, it's probably
less common for the popular items to be clustered together within the
primary key, unless the PK has multiple attributes, and the first is
the "popular attribute". For example, there would definitely be a lot
of interest among contiguous items within an index on "(artist,
album)" where "artist" is a popular artist, which is certainly quite
possible. But, if "album" is the primary key, and it's a SERIAL PK,
then there will only be weak temporal locality within the PK, and only
because today's fads are more popular than the fads from previous
years.

Just another idea, that doesn't have to hold this patch up.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: LP_DEAD hinting and not holding on to a buffer pin on leaf page (Was: [HACKERS] [WIP] Zipfian distribution in pgbench)

2017-08-04 Thread Peter Geoghegan

On Mon, Jul 31, 2017 at 10:54 AM, Peter Geoghegan  wrote:
> Let's wait to see what difference it makes if Alik's zipfian
> distribution pgbench test case uses unlogged tables. That may gives us a
> good sense of the problem for cases with contention/concurrency.

Yura Sokolov of Postgres Pro performed this benchmark at my request.
He took the 9.5 commit immediately proceeding 2ed5b87f9 as a baseline.
In all cases, logged tables were used. Note that this is not the most
effective benchmark for showing the regression, because he didn't
replace the PK with a non-unique index, though that is planned as
follow-up; I wanted to stick with Alik's Zipfian test case for the
time being.

(Using a unique index is not the most effective thing for showing the
regression because unique indexes have most LP_DEAD setting done in
_bt_check_unique(), so only SELECTs will do less LP_DEAD cleanup
there; SELECTs are 50% of all queries.)

His results with 10 minute pgbench runs:

Logged
clients | 8217fb14 | 2ed5b87f | master | hashsnap | hashsnap_lwlock
+--+--++--+
 10 |   201569 |   204154 | 201095 |   201793 |  206111
 20 |   328180 |   58 | 334858 |   336721 |  370769
 40 |   196352 |   194692 | 232413 |   231968 |  393947
 70 |   121083 |   116881 | 148066 |   148443 |  224162
110 |77989 |73414 |  99305 |   111404 |  161900
160 |48958 |45830 |  65830 |82340 |  115678
230 |27699 |25510 |  38186 |57617 |   80575
310 |16369 |15137 |  21571 |39745 |   56819
400 |10327 | 9486 |  13157 |27318 |   40859
500 | 6920 | 6496 |   8638 |18677 |   29429
650 | 4315 | 3971 |   5196 |11762 |   17883
800 | 2850 | 2772 |   3523 | 7733 |   10977

Note that you also see numbers from various patches from Yura, and the
master branch mixed in here, but 8217fb14 (before) and 2ed5b87f
(after) are the interesting numbers as far as this regression goes.

There is an appreciable reduction in TPS here, though this workload is
not as bad by that measure as first thought. There is a roughly 5%
regression here past 40 clients or so. The regression in the
*consistency* of transaction *throughput* is far more interesting,
though. I've been doing analysis of this by drilling down to
individual test cases with vimdiff, as follows:

$ vimdiff test_8217fb14_logged_1_pgbench_40.out
test_2ed5b87f_logged_1_pgbench_40.out

(I attach these two files as a single example. I can provide the full
results to those that ask for them privately; it's too much data to
attach to an e-mail to the list.)

You can see in this example that for most 5 second slices of the 10
minute benchmark, commit 2ed5b87f actually increases TPS somewhat,
which is good. But there are also occasional *big* drops in TPS,
sometimes by as much as 50% over a single 5 second period (when
ANALYZE runs, adding random I/O during holding an exclusive buffer
lock [1]?). When this slowdown happens, latency can be over 3 times
higher, too.

We see much more consistent performance without the B-Tree buffer pin
VACUUM optimization, where there is no discernible pattern of
performance dropping. The headline regression of 4% or 5% is not the
main problem here, it seems.

In summary, commit 2ed5b87f makes the workload have increased
throughput most of the time, but occasionally sharply reduces
throughput, which averages out to TPS being 4% or 5% lower overall. I
think we'll find that there are bigger problems TPS-wise with
non-unique indexes when that other test is performed by Yura; let's
wait for those results to come back.

Finally, I find it interesting that when Yura did the same benchmark,
but with 5% SELECTs + 95% UPDATEs, rather than 50% SELECTs + 50%
UPDATEs as above, the overall impact was surprisingly similar. His
results:

clients | 8217fb14 | 2ed5b87f | master | hashsnap | hashsnap_lwlock
+--+--++--+
 20 |   187697 |   187335 | 217558 |   215059 |  266894
 50 |81272 |78784 |  97948 |97659 |  157710
 85 |49446 |47683 |  64597 |70814 |  107748
130 |32358 |30393 |  42216 |50531 |   75001
190 |19403 |17569 |  25704 |35506 |   51292
270 |10803 | 9878 |  14166 |23851 |   35257
370 | 6108 | 5645 |   7684 |15390 |   23659
500 | 3649 | 3297 |   4225 | 9172 |   14643
650 | 2239 | 2049 |   2635 | 5887 |8588
800 | 1475 | 1424 |   1804 | 4035 |5611

If nothing else, this shows how generally reliant these kinds of
workload can be on LP_DEAD setting. And, there is one difference: The
regression is seen here at *all* client

Re: LP_DEAD hinting and not holding on to a buffer pin on leaf page (Was: [HACKERS] [WIP] Zipfian distribution in pgbench)

2017-07-31 Thread Peter Geoghegan


Robert Haas  wrote:

On Mon, Jul 31, 2017 at 1:54 PM, Peter Geoghegan  wrote:

That is hard to justify. I don't think that failing to set LP_DEAD hints
is the cost that must be paid to realize a benefit elsewhere, though. I
don't see much problem with having both benefits consistently. It's
actually very unlikely that VACUUM will run, and a TID will be recycled
at exactly the wrong time. We could probably come up with a more
discriminating way of detecting that that may have happened, at least
for Postgres 11. We'd continue to use LSN; the slow path would be taken
when the LSN changed, but we do not give up on setting LP_DEAD bits. I
think we can justify going to the heap again in this slow path, if
that's what it takes.


Yeah, that might possibly be a good approach.


I also wonder if it's worth teaching lazy_scan_heap() to keep around a
list of TIDs that can at least have their LP_DEAD bit set within their
index page, for use during subsequent index vacuuming. Doing at least
this much for TIDs from heap pages that are skipped due to some other
session concurrently holding a pin on the heap page ("pinskipped_pages"
pages) could help some cases, and seems doable. VACUUM does not need an
extra interlock against another VACUUM (such as a buffer pin) here, of
course.

I wouldn't expect this to help very much on many workloads, including
Alik's Zipfian workload, but it might be useful to have a real guarantee
about how long it can be, in VACUUM cycles, before a dead index tuple at
least has its LP_DEAD bit set. LP_DEAD will only be set when an index
scan goes to the heap, and it's not hard to imagine workloads where no
index scan ever does that with dead tuples whose heap TIDs were killed
only very recently.

Unlike with heap pruning, setting the LP_DEAD bit of all dead index
tuples on a leaf page is pretty much as good as a full VACUUM of the
page. The only thing that makes it not quite as good is that you cannot
assume that it's safe to kill the heap TIDs afterwards.

--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: LP_DEAD hinting and not holding on to a buffer pin on leaf page (Was: [HACKERS] [WIP] Zipfian distribution in pgbench)

2017-07-31 Thread Robert Haas

On Mon, Jul 31, 2017 at 1:54 PM, Peter Geoghegan  wrote:
> That is hard to justify. I don't think that failing to set LP_DEAD hints
> is the cost that must be paid to realize a benefit elsewhere, though. I
> don't see much problem with having both benefits consistently. It's
> actually very unlikely that VACUUM will run, and a TID will be recycled
> at exactly the wrong time. We could probably come up with a more
> discriminating way of detecting that that may have happened, at least
> for Postgres 11. We'd continue to use LSN; the slow path would be taken
> when the LSN changed, but we do not give up on setting LP_DEAD bits. I
> think we can justify going to the heap again in this slow path, if
> that's what it takes.

Yeah, that might possibly be a good approach.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: LP_DEAD hinting and not holding on to a buffer pin on leaf page (Was: [HACKERS] [WIP] Zipfian distribution in pgbench)

2017-07-31 Thread Peter Geoghegan


Robert Haas  wrote:

On Thu, Jul 27, 2017 at 10:05 PM, Peter Geoghegan  wrote:

I really don't know if that would be worthwhile. It would certainly fix
the regression shown in my test case, but that might not go far enough.
I strongly suspect that there are more complicated workloads where
LP_DEAD cleanup from SELECT statements matters, which is prevented by
the LSN check thing, just because there are always other sessions that
modify the page concurrently. This might be true of Alik's Zipfian test
case, for example.


I haven't studied the test case, but I think as a general principle it
makes sense to be happy when someone comes up with an algorithm that
holds a lock for a shorter period of time (and buffer pins are a type
of lock).  There are a number of places (fast-path locking, for
example, or vacuum skipping pinned heap pages) where we have
fast-paths that get taken most of the time and slow paths that get
used when concurrent activity happens; empirically, such things often
work out to a win.  I think it's disturbing that this code seems to be
taking the slow-path (which, in context, means skipping LP_DEAD
cleanup) even there is no concurrent activity.  That's hard to
justify.


That is hard to justify. I don't think that failing to set LP_DEAD hints
is the cost that must be paid to realize a benefit elsewhere, though. I
don't see much problem with having both benefits consistently. It's
actually very unlikely that VACUUM will run, and a TID will be recycled
at exactly the wrong time. We could probably come up with a more
discriminating way of detecting that that may have happened, at least
for Postgres 11. We'd continue to use LSN; the slow path would be taken
when the LSN changed, but we do not give up on setting LP_DEAD bits. I
think we can justify going to the heap again in this slow path, if
that's what it takes.


But the fact that it is taking the slow-path when there *is*
concurrent activity is harder to complain about.  That might win or it
might lose; the non-concurrent case only loses.


Let's wait to see what difference it makes if Alik's zipfian
distribution pgbench test case uses unlogged tables. That may gives us a
good sense of the problem for cases with contention/concurrency.

--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: LP_DEAD hinting and not holding on to a buffer pin on leaf page (Was: [HACKERS] [WIP] Zipfian distribution in pgbench)

2017-07-31 Thread Robert Haas

On Thu, Jul 27, 2017 at 10:05 PM, Peter Geoghegan  wrote:
> I really don't know if that would be worthwhile. It would certainly fix
> the regression shown in my test case, but that might not go far enough.
> I strongly suspect that there are more complicated workloads where
> LP_DEAD cleanup from SELECT statements matters, which is prevented by
> the LSN check thing, just because there are always other sessions that
> modify the page concurrently. This might be true of Alik's Zipfian test
> case, for example.

I haven't studied the test case, but I think as a general principle it
makes sense to be happy when someone comes up with an algorithm that
holds a lock for a shorter period of time (and buffer pins are a type
of lock).  There are a number of places (fast-path locking, for
example, or vacuum skipping pinned heap pages) where we have
fast-paths that get taken most of the time and slow paths that get
used when concurrent activity happens; empirically, such things often
work out to a win.  I think it's disturbing that this code seems to be
taking the slow-path (which, in context, means skipping LP_DEAD
cleanup) even there is no concurrent activity.  That's hard to
justify.  But the fact that it is taking the slow-path when there *is*
concurrent activity is harder to complain about.  That might win or it
might lose; the non-concurrent case only loses.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: LP_DEAD hinting and not holding on to a buffer pin on leaf page (Was: [HACKERS] [WIP] Zipfian distribution in pgbench)

2017-07-28 Thread Peter Geoghegan


Amit Kapila  wrote:

Isn't it possible to confirm if the problem is due to commit
2ed5b87f9?  Basically, if we have unlogged tables, then it won't
release the pin.  So if the commit in question is the culprit, then
the same workload should not lead to bloat.


That's a great idea. Alik?

--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: LP_DEAD hinting and not holding on to a buffer pin on leaf page (Was: [HACKERS] [WIP] Zipfian distribution in pgbench)

2017-07-28 Thread Amit Kapila

On Wed, Jul 26, 2017 at 3:32 AM, Peter Geoghegan  wrote:
> On Fri, Jul 14, 2017 at 5:06 PM, Peter Geoghegan  wrote:
>> I think that what this probably comes down to, more than anything
>> else, is that you have leftmost hot/bloated leaf pages like this:
>>
>>
>>   idx  | level | l_item | blkno | btpo_prev |
>> btpo_next | btpo_flags | type | live_items | dead_items |
>> avg_item_size | page_size | free_size | highkey
>> ---+---++---+---+---++--+++---+---+---+-
>>  ...
>>  pgbench_accounts_pkey | 0 |  1 | 1 | 0 |
>> 2751 | 65 | l|100 | 41 |16 |
>>8192 |  5328 | 11 00 00 00 00 00 00 00
>>  pgbench_accounts_pkey | 0 |  2 |  2751 | 1 |
>> 2746 | 65 | l| 48 | 90 |16 |
>>8192 |  5388 | 32 00 00 00 00 00 00 00
>> ...
>>
>> The high key for the leftmost shows that only values below 0x11 belong
>> on the first page. This is about 16 or 17 possible distinct values,
>> and yet the page has 100 live items, and 41 dead items; in total,
>> there is room for 367 items. It's only slightly better with other
>> nearby pages. This is especially bad because once the keyspace gets
>> split up this finely, it's *impossible* to reverse it -- it's more or
>> less a permanent problem, at least until a REINDEX.
>
> I've been thinking about this a lot, because this really does look
> like a pathological case to me. I think that this workload is very
> sensitive to how effective kill_prior_tuples/LP_DEAD hinting is. Or at
> least, I can imagine that mechanism doing a lot better than it
> actually manages to do here. I wonder if it's possible that commit
> 2ed5b87f9, which let MVCC snapshots not hold on to a pin on leaf
> pages, should have considered workloads like this.
>

Isn't it possible to confirm if the problem is due to commit
2ed5b87f9?  Basically, if we have unlogged tables, then it won't
release the pin.  So if the commit in question is the culprit, then
the same workload should not lead to bloat.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: LP_DEAD hinting and not holding on to a buffer pin on leaf page (Was: [HACKERS] [WIP] Zipfian distribution in pgbench)

2017-07-27 Thread Peter Geoghegan


Peter Geoghegan  wrote:

In Alik's workload, there are two queries: One UPDATE, one SELECT.  Even
though the bloated index was a unique index, and so still gets
_bt_check_unique() item killing, the regression is still going to block
LP_DEAD cleanup by the SELECTs, which seems like it might be quite
important there.  After all, _bt_check_unique() cleanup has to happen
with an exclusive buffer lock held, whereas the kill_prior_tuple stuff
happens with only a shared buffer lock held.  It's not hard to imaging
that there will be a whole lot less LP_DEAD setting, overall.


Actually, there is a much bigger reason why SELECT statement LP_DEAD
killing matters more to Alik's workload than _bt_check_unique() LP_DEAD
killing: The UPDATE query itself does not affect indexed columns. Most
UPDATEs are actually HOT-safe (despite the degree of index bloat we
see), and so most UPDATEs will never reach _bt_check_unique().

--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: LP_DEAD hinting and not holding on to a buffer pin on leaf page (Was: [HACKERS] [WIP] Zipfian distribution in pgbench)

2017-07-27 Thread Peter Geoghegan


Robert Haas  wrote:

We now see that no update ever kills items within _bt_killitems(),
because our own update to the index leaf page itself nullifies our
ability to kill anything, by changing the page LSN from the one
stashed in the index scan state variable. Fortunately, we are not
really "self-blocking" dead item cleanup here, because the
_bt_check_unique() logic for killing all_dead index entries saves the
day by not caring about LSNs. However, that only happens because the
index on "aid" is a unique index.


This seems like an oversight.  If we modify the page ourselves, could
we check whether the saved LSN is still current just before, and if
so, update the saved LSN just after?


I really don't know if that would be worthwhile. It would certainly fix
the regression shown in my test case, but that might not go far enough.
I strongly suspect that there are more complicated workloads where
LP_DEAD cleanup from SELECT statements matters, which is prevented by
the LSN check thing, just because there are always other sessions that
modify the page concurrently. This might be true of Alik's Zipfian test
case, for example.

In Alik's workload, there are two queries: One UPDATE, one SELECT.  Even
though the bloated index was a unique index, and so still gets
_bt_check_unique() item killing, the regression is still going to block
LP_DEAD cleanup by the SELECTs, which seems like it might be quite
important there.  After all, _bt_check_unique() cleanup has to happen
with an exclusive buffer lock held, whereas the kill_prior_tuple stuff
happens with only a shared buffer lock held.  It's not hard to imaging
that there will be a whole lot less LP_DEAD setting, overall.

Alik's workload was one with a high degree of contention on just a few
leaf pages, pages that become very bloated. It's not fair to blame the
bloat we saw there on this regression, but I have to wonder how much it
may have contributed.

--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: LP_DEAD hinting and not holding on to a buffer pin on leaf page (Was: [HACKERS] [WIP] Zipfian distribution in pgbench)

2017-07-27 Thread Robert Haas

On Tue, Jul 25, 2017 at 11:02 PM, Peter Geoghegan  wrote:
> While the benchmark Alik came up with is non-trivial to reproduce, I
> can show a consistent regression for a simple case with only one
> active backend.

That's not good.

> We now see that no update ever kills items within _bt_killitems(),
> because our own update to the index leaf page itself nullifies our
> ability to kill anything, by changing the page LSN from the one
> stashed in the index scan state variable. Fortunately, we are not
> really "self-blocking" dead item cleanup here, because the
> _bt_check_unique() logic for killing all_dead index entries saves the
> day by not caring about LSNs. However, that only happens because the
> index on "aid" is a unique index.

This seems like an oversight.  If we modify the page ourselves, could
we check whether the saved LSN is still current just before, and if
so, update the saved LSN just after?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: LP_DEAD hinting and not holding on to a buffer pin on leaf page (Was: [HACKERS] [WIP] Zipfian distribution in pgbench)

2017-07-25 Thread Peter Geoghegan

On Tue, Jul 25, 2017 at 8:02 PM, Peter Geoghegan  wrote:
> Setup:
>
> Initialize pgbench (any scale factor).
> create index on pgbench_accounts (aid);

That "create index" was meant to be on "abalance", to make the UPDATE
queries always HOT-unsafe. (You'll want to *also* create this index
once the PK is dropped, to show how killing items on recent Postgres
versions hinges upon this being a unique index).

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: LP_DEAD hinting and not holding on to a buffer pin on leaf page (Was: [HACKERS] [WIP] Zipfian distribution in pgbench)

2017-07-25 Thread Peter Geoghegan

On Tue, Jul 25, 2017 at 3:02 PM, Peter Geoghegan  wrote:
> I've been thinking about this a lot, because this really does look
> like a pathological case to me. I think that this workload is very
> sensitive to how effective kill_prior_tuples/LP_DEAD hinting is. Or at
> least, I can imagine that mechanism doing a lot better than it
> actually manages to do here. I wonder if it's possible that commit
> 2ed5b87f9, which let MVCC snapshots not hold on to a pin on leaf
> pages, should have considered workloads like this.

While the benchmark Alik came up with is non-trivial to reproduce, I
can show a consistent regression for a simple case with only one
active backend. I'm not sure whether or not this is considered an
acceptable trade-off -- I didn't look through the archives from around
March of 2015 just yet.

Setup:

Initialize pgbench (any scale factor).
create index on pgbench_accounts (aid);

The point of this example is to show a simple query that is never
quite HOT-safe, where we need to create a new index entry in an index
for that reason alone. Theoretically, only one index needs to be
updated, not all indexes, because only one index covers attributes
that have truly changed. For example, if we had WARM, I think that
many of these theoretically unnecessary index tuple insertions would
not happen. When they do happen, because we don't have something like
WARM, it seems important that the kill_prior_tuples/LP_DEAD stuff does
its job. I don't think that it will consistently do that, though.

Steps:

(Set debugger breakpoint from within _bt_killitems())

Test queries (run these from a single interactive psql session):

update pgbench_accounts set abalance = abalance + 1 where aid = 763;
update pgbench_accounts set abalance = abalance + 1 where aid = 763;
update pgbench_accounts set abalance = abalance + 1 where aid = 763;

We now see that no update ever kills items within _bt_killitems(),
because our own update to the index leaf page itself nullifies our
ability to kill anything, by changing the page LSN from the one
stashed in the index scan state variable. Fortunately, we are not
really "self-blocking" dead item cleanup here, because the
_bt_check_unique() logic for killing all_dead index entries saves the
day by not caring about LSNs. However, that only happens because the
index on "aid" is a unique index.

If we drop the primary key, and create a regular index on "aid" in its
place, then the same UPDATE queries really do self-block from killing
index tuples due to changes in 2ed5b87f9, which could be pretty bad if
there wasn't some selects to do the kill_prior_tuple stuff instead. I
verified that this regression against 9.4 exists, just to be sure that
the problem wasn't somehow there all along -- the regression is real.
 :-(

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

LP_DEAD hinting and not holding on to a buffer pin on leaf page (Was: [HACKERS] [WIP] Zipfian distribution in pgbench)

2017-07-25 Thread Peter Geoghegan

On Fri, Jul 14, 2017 at 5:06 PM, Peter Geoghegan  wrote:
> I think that what this probably comes down to, more than anything
> else, is that you have leftmost hot/bloated leaf pages like this:
>
>
>   idx  | level | l_item | blkno | btpo_prev |
> btpo_next | btpo_flags | type | live_items | dead_items |
> avg_item_size | page_size | free_size | highkey
> ---+---++---+---+---++--+++---+---+---+-
>  ...
>  pgbench_accounts_pkey | 0 |  1 | 1 | 0 |
> 2751 | 65 | l|100 | 41 |16 |
>8192 |  5328 | 11 00 00 00 00 00 00 00
>  pgbench_accounts_pkey | 0 |  2 |  2751 | 1 |
> 2746 | 65 | l| 48 | 90 |16 |
>8192 |  5388 | 32 00 00 00 00 00 00 00
> ...
>
> The high key for the leftmost shows that only values below 0x11 belong
> on the first page. This is about 16 or 17 possible distinct values,
> and yet the page has 100 live items, and 41 dead items; in total,
> there is room for 367 items. It's only slightly better with other
> nearby pages. This is especially bad because once the keyspace gets
> split up this finely, it's *impossible* to reverse it -- it's more or
> less a permanent problem, at least until a REINDEX.

I've been thinking about this a lot, because this really does look
like a pathological case to me. I think that this workload is very
sensitive to how effective kill_prior_tuples/LP_DEAD hinting is. Or at
least, I can imagine that mechanism doing a lot better than it
actually manages to do here. I wonder if it's possible that commit
2ed5b87f9, which let MVCC snapshots not hold on to a pin on leaf
pages, should have considered workloads like this.

The whole point of that commit was to reduce blocking by VACUUM on
index scans, and I think that it was effective in doing that in many
important cases. My concern is that the commit added this, which could
make LP_DEAD hinting occur less frequently:

+ * If the pin was released after reading the page, then we re-read it.  If it
+ * has been modified since we read it (as determined by the LSN), we dare not
+ * flag any entries because it is possible that the old entry was vacuumed
+ * away and the TID was re-used by a completely different heap tuple.
  */
 void
-_bt_killitems(IndexAnd, ScanDesc scan, bool haveLock)
+_bt_killitems(IndexScanDesc scan)

Even if I'm basically right about this making LP_DEAD hinting occur
less frequently in cases like the one from Alik, it might have
actually been worth it even from the point of view of index bloat,
because VACUUM operations complete more frequently, and so there is an
indirect boost. It's not as if we've been that great at assessing
problems like this before now, and so it bears considering if we could
do better here without introducing much additional complexity. I just
don't know.

I would like to hear opinions on:

How much of a difference might this have actually made?

If it's a significant factor for any important workload, what can be
done about it? Does anyone have any ideas about a less restrictive
strategy for avoiding spuriously killing an index tuple whose heap TID
was just recycled?

ISTM that by doing this LSN check, _bt_killitems() is least likely to
work when and where it is most needed.

The most obvious way to correctly proceed even when LSN has changed
would be to revisit the heap page when the leaf page LSN check doesn't
work out, and revalidate the safety of proceeding with killing tuples.
At least with unique indexes. It's probably extremely unlikely that
the problem that we must avoid here will actually be observed in the
real world, so it seems likely that an approach like this will be
worth it.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-07-21 Thread Alik Khilazhev

Hello!

I realized that I was sending emails as HTML and latest patch is not visible in 
the archive now.
That’s why I am attaching it again.

I am sorry for that.



pgbench-zipf-05v.patch
Description: Binary data

—
Thanks and Regards,
Alik Khilazhev
Postgres Professional:
http://www.postgrespro.com
The Russian Postgres Company
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-07-21 Thread Alik Khilazhev

Hmmm. On second thought, maybe one or the other is enough, either restrict the parameter to values where the approximation is good, or put out a clear documentation about when the approximation is not very good, but it may be still useful even if not precise.So I would be in favor of expanding the documentation but not restricting the parameter beyond avoiding value 1.0.I have removed restriction and expanded documentation in attaching patch v5. Also I have recorded patch to CF 2017-09 —  https://commitfest.postgresql.org/14/1206/. 

pgbench-zipf-05v.patch
Description: Binary data
—Thanks and Regards,Alik KhilazhevPostgres Professional:http://www.postgrespro.comThe Russian Postgres Company

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-07-20 Thread Alik Khilazhev




> I think that developping a test would be much simpler with the improved tap 
> test infrastructure, so I would suggest to wait to know the result of the 
> corresponding patch.

Ok, I will wait then.

> Also, could you recod the patch to CF 2017-09?
> https://commitfest.postgresql.org/14/ 

Yea, I will send it.

—
Thanks and Regards,
Alik Khilazhev
Postgres Professional:
http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-07-20 Thread Fabien COELHO



Hello Alik,

About the maths: As already said, I'm not at ease with a random_zipfian 
function which does not display a (good) zipfian distribution. At the 
minimum the documentation should be clear about the approximations 
implied depending on the parameter value.


I add one more sentence to documentation to emphasize that degree of 
proximity depends on parameter .
And also I made restriction on 
parameter, now it can be only in range (0; 1)


Hmmm. On second thought, maybe one or the other is enough, either restrict 
the parameter to values where the approximation is good, or put out a 
clear documentation about when the approximation is not very good, but it 
may be still useful even if not precise.


So I would be in favor of expanding the documentation but not restricting 
the parameter beyond avoiding value 1.0.



[...]
Now it prints warning message if array overflowed. To print message only 
one time, it uses global flag, which is available for all threads. And 
theoretically message can be printed more than one time. [...]

So, should I spend time on solving this issue?


No. Just write a comment.

If the zipf cache is constant size, there is no point in using dynamic 
allocation, just declare an array…


Fixed. Does ZIPF_CACHE_SIZE = 15 is ok?


My guess is yes, till someone complains that it is not the case;-)


There should be non regression tests somehow. If the "improve pgbench
tap test infrastructure" get through, things should be added there.


I will send tests later, as separate patch.


I think that developping a test would be much simpler with the improved 
tap test infrastructure, so I would suggest to wait to know the result of 
the corresponding patch.


Also, could you recod the patch to CF 2017-09?
https://commitfest.postgresql.org/14/

--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-07-20 Thread Alik Khilazhev

Hello Fabien,I am attaching patch v4. On 19 Jul 2017, at 17:21, Fabien COELHO  wrote:About the maths: As already said, I'm not at ease with a random_zipfian function which does not display a (good) zipfian distribution. At the minimum the documentation should be clear about the approximations implied depending on the parameter value.I add one more sentence to documentation to emphasize that degree of proximity depends on parameter . And also I made restriction on parameter, now it can be only in range (0; 1)In the litterature the theta parameter seems to be often called alphaor s (eg see https://en.wikipedia.org/wiki/Zipf%27s_law). I would suggest tostick to "s" instead of "theta”?I have renamed it to “s”.Functions zipfZeta(n, theta) does not really computes the zeta(n) function,so I think that a better name should be chosen. It seems to computeH_{n,theta}, the generalized harmonic number. Idem "thetan" field in struct.Renamed zipfZeta to zipfGeneralizedHarmonic, zetan to harmonicn.The handling of cache overflow by randomly removing one entry looks likea strange idea. Rather remove the oldest entry?Replaced with Least Recently Used replacement algorithm.ISTM that it should print a warning once if the cache array overflows as performance would drop heavily.Now it prints warning message if array overflowed. To print message only one time, it uses global flag, which is available for all threads. And theoretically message can be printed more than one time. It could be solved easily using pg_atomic_test_set_flag() from src/include/port/atomics.h but it can not be used in pgbench because of following lines of code there:#ifdef FRONTEND#error "atomics.h may not be included from frontend code"#endifOr it can be fixed by using mutexes from pthread, but I think code become less readable and more complex in this case.So, should I spend time on solving this issue? If the zipf cache is constant size, there is no point in using dynamic allocation, just declare an array…Fixed. Does ZIPF_CACHE_SIZE = 15 is ok? There should be non regression tests somehow. If the "improve pgbenchtap test infrastructure" get through, things should be added there.I will send tests later, as separate patch.

pgbench-zipf-04v.patch
Description: Binary data
—Thanks and Regards,Alik KhilazhevPostgres Professional:http://www.postgrespro.comThe Russian Postgres Company

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-07-19 Thread Fabien COELHO



Hello Alik,


I am attaching patch v3.


Among other things I fixed small typo in description of 
random_exponential function in pgbench.sgml file.


Ok. Probably this typo should be committed separatly and independently.

A few comments about v3:

Patch applies cleanly, compiles, works.

About the maths: As already said, I'm not at ease with a random_zipfian 
function which does not display a (good) zipfian distribution. At the 
minimum the documentation should be clear about the approximations implied 
depending on the parameter value.


In the litterature the theta parameter seems to be often called alpha
or s (eg see https://en.wikipedia.org/wiki/Zipf%27s_law). I would suggest to
stick to "s" instead of "theta"?

About the code: looks simpler than the previous version, which is good.

Double constants should be used when the underlying type is a double,
instead of relying on implicit int to double promotion (0 -> 0.0, 1 -> 1.0).

Functions zipfZeta(n, theta) does not really computes the zeta(n) function,
so I think that a better name should be chosen. It seems to compute
H_{n,theta}, the generalized harmonic number. Idem "thetan" field in struct.

The handling of cache overflow by randomly removing one entry looks like
a strange idea. Rather remove the oldest entry?

ISTM that it should print a warning once if the cache array overflows as 
performance would drop heavily.


If the zipf cache is constant size, there is no point in using dynamic 
allocation, just declare an array...


Comments need updating: eg "theta parameter of previous execution" which
dates back when there was only one value.

There should be a comment to explain that the structure is in the thread
for thread safety.

There should be non regression tests somehow. If the "improve pgbench
tap test infrastructure" get through, things should be added there.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-07-17 Thread Fabien COELHO



Hello,

Is this bias expected from the drawing method, say because it is 
approximated and the approximation is weak at some points, or is there 
an issue with its implementation, says some shift which gets smoothed 
down for higher indexes?


I have checked paper where such implementation was proposed and there 
theta allowed only on range between 0 and 1. It seems like it is not 
guaranteed that it should work well when theta is more than 1.


Ok.

I see a significant issue with having a random_zipfian function which does 
not really return a zipfian distribution under some parameter values. If 
there is no better alternative, I would suggest to restrict the parameter 
for values between 0 and 1, or to find a better approximation for theta >= 
0.



I am attaching paper, see page 23.


Thanks for the paper. It reminds me that I intended to propose a 
parametric pseudo-random permutation for pgbench, some day.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-07-17 Thread Alik Khilazhev


> On 17 Jul 2017, at 13:51, Fabien COELHO  wrote:
> 
> 
> Is this bias expected from the drawing method, say because it is approximated 
> and the approximation is weak at some points, or is there an issue with its 
> implementation, says some shift which gets smoothed down for higher indexes?
> 

I have checked paper where such implementation was proposed and there theta 
allowed only on range between 0 and 1. It seems like it is not guaranteed that 
it should work well when theta is more than 1.

I am attaching paper, see page 23.


syntheticdatagen.pdf
Description: Adobe PDF document

—
Thanks and Regards,
Alik Khilazhev
Postgres Professional:
http://www.postgrespro.com
The Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-07-17 Thread Fabien COELHO




Ok, so you did not get the large bias for i=3. Strange.


I got large bias for i=3 and theta > 1 even with a million outcomes,


Ok. So this is similar to what I got.

Is this bias expected from the drawing method, say because it is 
approximated and the approximation is weak at some points, or is there an 
issue with its implementation, says some shift which gets smoothed down 
for higher indexes?


I am attaching patch v3. Among other things I fixed small typo in 
description of random_exponential function in pgbench.sgml file.


I'll look into that.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-07-17 Thread Alik Khilazhev

Hello Fabien,On 14 Jul 2017, at 17:51, Fabien COELHO  wrote:Ok, so you did not get the large bias for i=3. Strange.I got large bias for i=3 and theta > 1 even with a million outcomes, but for theta < 1 (I have tested on theta = 0.1 and 0.3) it showed quite good results.I am attaching patch v3. Among other things I fixed small typo in description of random_exponential function in pgbench.sgml file.

pgbench-zipf-03v.patch
Description: Binary data
—Thanks and Regards,Alik KhilazhevPostgres Professional:http://www.postgrespro.comThe Russian Postgres Company

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-07-14 Thread Peter Geoghegan

On Fri, Jul 14, 2017 at 6:37 AM, Alik Khilazhev
 wrote:
> I am attaching results of tests for 32 and 128 clients that were running for 
> 10 minutes, and TPS remains 305 and 115 ktps respectively.
>
> Tests was executed with configuration set for YCSB. And there is very 
> aggressively autovacuum, this can be reason why there is no decline appears 
> with increasing working time.
> Autovacuum config:
>
> autovacuum_max_workers = 8
> autovacuum_naptime = 10s
> autovacuum_vacuum_scale_factor = 0.1
> autovacuum_vacuum_cost_delay = 0ms
> autovacuum_vacuum_cost_limit = 1

I think that what this probably comes down to, more than anything
else, is that you have leftmost hot/bloated leaf pages like this:

  idx  | level | l_item | blkno | btpo_prev |
btpo_next | btpo_flags | type | live_items | dead_items |
avg_item_size | page_size | free_size | highkey
---+---++---+---+---++--+++---+---+---+-
 ...
 pgbench_accounts_pkey | 0 |  1 | 1 | 0 |
2751 | 65 | l|100 | 41 |16 |
   8192 |  5328 | 11 00 00 00 00 00 00 00
 pgbench_accounts_pkey | 0 |  2 |  2751 | 1 |
2746 | 65 | l| 48 | 90 |16 |
   8192 |  5388 | 32 00 00 00 00 00 00 00
...

The high key for the leftmost shows that only values below 0x11 belong
on the first page. This is about 16 or 17 possible distinct values,
and yet the page has 100 live items, and 41 dead items; in total,
there is room for 367 items. It's only slightly better with other
nearby pages. This is especially bad because once the keyspace gets
split up this finely, it's *impossible* to reverse it -- it's more or
less a permanent problem, at least until a REINDEX. You cannot merge
pages back together until one is completely empty, which in this case
and many cases will in fact never happen. Aggressive vacuuming is
probably helpful in part because it prevents the problem from ever
getting completely out of hand. That doesn't seem like a very
practical solution, though.

We should probably be treating unique indexes in a special way, since
inserting a new conflicting tuple necessarily supersedes whatever it
conflicted with. Postgres holds on to the old tuple today, but only
because the transaction might abort, or because an old snapshot might
be interested in old tuple versions. However, the general pattern with
unique indexes is that there must be one tuple visible to new
snapshots, and old versions are almost certainly going to became
garbage very quickly. Unique indexes really are quite special, which
nbtree doesn't take advantage of at all. If you read the coverage of
B-Trees within "Transaction Processing: Concepts and Techniques", and
many other publications, the general trend seems to be that unique
indexes have truly unique keys, based only on the user-visible key
values. They make a sharp distinction between primary and secondary
indexes, which doesn't really exist in Postgres (at least, not at the
access method level).

I suspect that the best long term solution is to add GIN-style
duplicate handling within nbtree for unique indexes, with special
pruning style optimizations to the posting list structure. This would
probably only happen with unique indexes. The useful thing about this
approach is it separates these two problems:

1. Representing what values are in the index for lookup, and their
latest row version.

2. Finding old row versions, in the less common case where you have an
old snapshot.

With a design like that, nbtree would never "cut up the keyspace too
finely" because of a temporary surge of UPDATE insertions. You still
get bloat, but you add it to a place where garbage collection can be
much better targeted. Under this scheme, it doesn't really matter so
much if garbage collection doesn't happen very frequently, because
there could be LP_DEAD style hints for the auxiliary posting list
structure. That could probably work based on atomic ops, and would
greatly reduce the number of pages that UPDATE workloads like this
dirtied.

It probably wouldn't make sense to add things like GIN's compression
of item pointers, since most data within the auxiliary posting list
structure is removed very quickly. I wouldn't expect btree_gin to be
faster for this workload today, because it doesn't have
kill_prior_tuple/LP_DEAD support, and because it doesn't support
unique indexes, and so cannot exploit the special situation that
exists with unique indexes.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-07-14 Thread Fabien COELHO



Algorithm works with theta less than 1. The only problem here is that 
theta can not be 1, because of next line of code


cell->alpha = 1. / (1 - theta);

That’s why I put such restriction. Now I see 2 possible solutions for that:
1) Exclude 1, and allow everything in range (0;+∞).


Yep.


2) Or just increase/decrease theta by very small number if it is 1.


Nope, this seems quite arbitrary.

I've executed scripts that you attached with different theta and number 
of outcomes(not n, n remains the same = 100) and I found out that for 
theta = 0.1 and big number of outcomes it gives distribution very 
similar to zipfian(for number of outcomes = 100 000, bias -6% to 8% in 
whole range and for NOO = 1000 000, bias is -2% to 2%).


Ok, so you did not get the large bias for i=3. Strange.

By, number of outcomes(NOO) I mean how many times random_zipfian was 
called. For example: pgbench -f compte_bench.sql -t 10 So, I think 
it works but works worse for small number of outcomes. And also we need 
to find optimal theta for better results.


Hmmm. I've run one million outcomes each time.

I'll check on the next version.

--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-07-14 Thread Alik Khilazhev


> On 13 Jul 2017, at 23:13, Peter Geoghegan  wrote:
> 
> I just figured out that "result.txt" is only a 60 second pgbench run.
> Is the same true for other tests?

Yes, other tests ran 60 seconds too.

> 
> It would be much more interesting to see runs that lasted 10 minutes
> or more.


I am attaching results of tests for 32 and 128 clients that were running for 10 
minutes, and TPS remains 305 and 115 ktps respectively. 

Tests was executed with configuration set for YCSB. And there is very 
aggressively autovacuum, this can be reason why there is no decline appears 
with increasing working time.
Autovacuum config:

autovacuum_max_workers = 8 
autovacuum_naptime = 10s
autovacuum_vacuum_scale_factor = 0.1
autovacuum_vacuum_cost_delay = 0ms
autovacuum_vacuum_cost_limit = 1

  idx  | level | l_item | blkno | btpo_prev | btpo_next | 
btpo_flags | type | live_items | dead_items | avg_item_size | page_size | 
free_size | highkey 
---+---++---+---+---++--+++---+---+---+-
 pgbench_accounts_pkey | 2 |  1 |   290 | 0 | 0 |   
   2 | r| 10 |  0 |15 |  8192 |  7956 | 
 pgbench_accounts_pkey | 1 |  1 | 3 | 0 |   289 |   
   0 | i|296 |  0 |15 |  8192 |  2236 | 
09 96 01 00 00 00 00 00
 pgbench_accounts_pkey | 1 |  2 |   289 | 3 |   575 |   
   0 | i|285 |  0 |15 |  8192 |  2456 | 
11 2c 03 00 00 00 00 00
 pgbench_accounts_pkey | 1 |  3 |   575 |   289 |   860 |   
   0 | i|285 |  0 |15 |  8192 |  2456 | 
19 c2 04 00 00 00 00 00
 pgbench_accounts_pkey | 1 |  4 |   860 |   575 |  1145 |   
   0 | i|285 |  0 |15 |  8192 |  2456 | 
21 58 06 00 00 00 00 00
 pgbench_accounts_pkey | 1 |  5 |  1145 |   860 |  1430 |   
   0 | i|285 |  0 |15 |  8192 |  2456 | 
29 ee 07 00 00 00 00 00
 pgbench_accounts_pkey | 1 |  6 |  1430 |  1145 |  1715 |   
   0 | i|285 |  0 |15 |  8192 |  2456 | 
31 84 09 00 00 00 00 00
 pgbench_accounts_pkey | 1 |  7 |  1715 |  1430 |  2000 |   
   0 | i|285 |  0 |15 |  8192 |  2456 | 
39 1a 0b 00 00 00 00 00
 pgbench_accounts_pkey | 1 |  8 |  2000 |  1715 |  2285 |   
   0 | i|285 |  0 |15 |  8192 |  2456 | 
41 b0 0c 00 00 00 00 00
 pgbench_accounts_pkey | 1 |  9 |  2285 |  2000 |  2570 |   
   0 | i|285 |  0 |15 |  8192 |  2456 | 
49 46 0e 00 00 00 00 00
 pgbench_accounts_pkey | 1 | 10 |  2570 |  2285 | 0 |   
   0 | i|177 |  0 |15 |  8192 |  4616 | 
 pgbench_accounts_pkey | 0 |  1 | 1 | 0 |  2751 |   
  65 | l|100 | 41 |16 |  8192 |  5328 | 
11 00 00 00 00 00 00 00
 pgbench_accounts_pkey | 0 |  2 |  2751 | 1 |  2746 |   
  65 | l| 48 | 90 |16 |  8192 |  5388 | 
32 00 00 00 00 00 00 00
 pgbench_accounts_pkey | 0 |  3 |  2746 |  2751 |  2749 |   
  65 | l| 47 |  6 |16 |  8192 |  7088 | 
5f 00 00 00 00 00 00 00
 pgbench_accounts_pkey | 0 |  4 |  2749 |  2746 |  2745 |   
  65 | l| 57 |  5 |16 |  8192 |  6908 | 
97 00 00 00 00 00 00 00
 pgbench_accounts_pkey | 0 |  5 |  2745 |  2749 |  2748 |   
  65 | l| 91 | 11 |16 |  8192 |  6108 | 
eb 00 00 00 00 00 00 00
 pgbench_accounts_pkey | 0 |  6 |  2748 |  2745 |  2752 |   
   1 | l| 65 |  0 |16 |  8192 |  6848 | 
28 01 00 00 00 00 00 00
 pgbench_accounts_pkey | 0 |  7 |  2752 |  2748 | 2 |   
  65 | l| 73 |  1 |16 |  8192 |  6668 | 
6f 01 00 00 00 00 00 00
 pgbench_accounts_pkey | 0 |  8 | 2 |  2752 |  2753 |   
  65 | l| 69 |  5 |16 |  8192 |  6668 | 
b3 01 00 00 00 00 00 00
 pgbench_accounts_pkey | 0 |  9 |  2753 | 2 |  2747 |   
  65 | l| 91 |  5 |16 |  8192 |  6228 | 
0a 02 00 00 00 00 00 00
 pgbench_accounts_pkey | 0 | 10 |  2747 |  2753 |  2754 |   
  65 | l| 87 |

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-07-14 Thread Alik Khilazhev

> On 13 Jul 2017, at 19:14, Fabien COELHO  wrote:
> 
> Documentation says that the closer theta is from 0 the flatter the 
> distribution
> but the implementation requires at least 1, including strange error messages:
> 
>  zipfian parameter must be greater than 1.00 (not 1.00)
> 
> Could theta be allowed between 0 & 1 ? I've tried forcing with theta = 0.1
> and it worked well, so I'm not sure that I understand the restriction.
> I also tried with theta=0.001 but it seemed less good.

Algorithm works with theta less than 1. The only problem here is that theta can 
not be 1, because of next line of code

cell->alpha = 1. / (1 - theta);

That’s why I put such restriction. Now I see 2 possible solutions for that:
1) Exclude 1, and allow everything in range (0;+∞).
2) Or just increase/decrease theta by very small number if it is 1. 

> I have also tried to check the distribution wrt the explanations, with the 
> attached scripts, n=100, theta=1.01/1.5/3.0: It does not seem to work, 
> there is repeatable +15% bias on i=3 and repeatable -3% to -30% bias for 
> values in i=10-100, this for different values of theta (1.01,1.5, 3.0).
> 
> If you try the script, beware to set parameters (theta, n) consistently.

I've executed scripts that you attached with different theta and number of 
outcomes(not n, n remains the same = 100) and I found out that for theta = 0.1 
and big number of outcomes it gives distribution very similar to zipfian(for 
number of outcomes = 100 000, bias -6% to 8% in whole range and for NOO = 1000 
000, bias is -2% to 2%).

By, number of outcomes(NOO) I mean how many times random_zipfian was called. 
For example:
pgbench -f compte_bench.sql -t 10

So, I think it works but works worse for small number of outcomes. And also we 
need to find optimal theta for better results.

— 
Thanks and Regards,
Alik Khilazhev
Postgres Professional:
http://www.postgrespro.com 
The Russian Postgres Company

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-07-13 Thread Peter Geoghegan

On Thu, Jul 13, 2017 at 12:49 PM, Peter Geoghegan  wrote:
> To reiterate what I say above:
>
> The number of leaf pages with dead items is 20 with this most recent
> run (128 clients, patched + unpatched). The leftmost internal page one
> level up from the leaf level contains 289 items. Whereas last time it
> was 353 items.
>
> That's a difference between having 20 hot/bloated leaf pages, and
> probably 84 hot/bloated pages, which I infer must have been the total
> number of bloated leaf pages within "result.txt". I think that
> something about all the "pgbench_index_*txt" tests are very different
> to what we see within "result.txt". It's as if there was a problem
> when "result.txt" ran, but that problem somehow didn't come up when
> you did new tests.

I just figured out that "result.txt" is only a 60 second pgbench run.
Is the same true for other tests?

It would be much more interesting to see runs that lasted 10 minutes
or more. Maybe with pgbench-tools. I would expect that a decline in
performance due to bloating the index could happen only after a
certain threshold was crossed. Things like HOT pruning and LP_DEAD
cleanup could be pretty effective, until finally a tipping point is
reached and they're much less effective.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-07-13 Thread Peter Geoghegan

On Thu, Jul 13, 2017 at 10:02 AM, Peter Geoghegan  wrote:
> The number of leaf pages at the left hand side of the leaf level seems
> to be ~50 less than the unpatched 128 client case was the first time
> around, which seems like a significant difference. I wonder why. Maybe
> autovacuum ran at the right/wrong time this time around?

To reiterate what I say above:

The number of leaf pages with dead items is 20 with this most recent
run (128 clients, patched + unpatched). The leftmost internal page one
level up from the leaf level contains 289 items. Whereas last time it
was 353 items.

That's a difference between having 20 hot/bloated leaf pages, and
probably 84 hot/bloated pages, which I infer must have been the total
number of bloated leaf pages within "result.txt". I think that
something about all the "pgbench_index_*txt" tests are very different
to what we see within "result.txt". It's as if there was a problem
when "result.txt" ran, but that problem somehow didn't come up when
you did new tests.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-07-13 Thread Peter Geoghegan

On Thu, Jul 13, 2017 at 4:38 AM, Alik Khilazhev
 wrote:
> I am attaching results of test for 32 and 128 clients for original and
> patched(_bt_doinsert) variants.

Thanks.

The number of leaf pages at the left hand side of the leaf level seems
to be ~50 less than the unpatched 128 client case was the first time
around, which seems like a significant difference. I wonder why. Maybe
autovacuum ran at the right/wrong time this time around?

Did you happen to record TPS for this most recent set of tests?

I notice one possibly interesting thing from these new numbers: the 32
client case is slightly more bloated when unique index enforcement is
removed.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-07-13 Thread Fabien COELHO



Hello Alik,

A few comments about the patch v2.

Patch applies and compiles.

Documentation says that the closer theta is from 0 the flatter the distribution
but the implementation requires at least 1, including strange error messages:

  zipfian parameter must be greater than 1.00 (not 1.00)

Could theta be allowed between 0 & 1 ? I've tried forcing with theta = 0.1
and it worked well, so I'm not sure that I understand the restriction.
I also tried with theta=0.001 but it seemed less good.

I have also tried to check the distribution wrt the explanations, with the 
attached scripts, n=100, theta=1.01/1.5/3.0: It does not seem to work, 
there is repeatable +15% bias on i=3 and repeatable -3% to -30% bias for 
values in i=10-100, this for different values of theta (1.01,1.5, 
3.0).


If you try the script, beware to set parameters (theta, n) consistently.

About the code:

Remove spaces and tabs at end of lines or on empty lines.

zipfn: I would suggest to move the pg_erand48 out and pass the result
instead of passing the thread. the erand call would move to getZipfianRand.

I'm wondering whether the "nearest" hack could be removed so as to simplify
the cache functions code...

Avoid editing lines without changes (spacesn/tabs?)
  -  thread->logfile = NULL; /* filled in later */
  +  thread->logfile = NULL; /* filled in later */

The documentation explaining the intuitive distribution talks about a N
but does not provide any hint about its value depending on the parameters.

There is an useless empty lien before "" after that.

--
Fabien.

compte_init.sql
Description: application/sql


compte_bench.sql
Description: application/sql


compte_expected.sql
Description: application/sql


compte_results.sql
Description: application/sql

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-07-13 Thread Alik Khilazhev

On 13 Jul 2017, at 00:20, Peter Geoghegan  wrote:Actually, I mean that I wonder how much of a difference it would makeif this entire block was commented out within _bt_doinsert():if (checkUnique != UNIQUE_CHECK_NO){    …}I am attaching results of test for 32 and 128 clients for original and patched(_bt_doinsert) variants.— Thanks and Regards,Alik KhilazhevPostgres Professional:http://www.postgrespro.comThe Russian Postgres Company  idx  | level | l_item | blkno | btpo_prev | btpo_next | 
btpo_flags | type | live_items | dead_items | avg_item_size | page_size | 
free_size | highkey 
---+---++---+---+---++--+++---+---+---+-
 pgbench_accounts_pkey | 2 |  1 |   290 | 0 | 0 |   
   2 | r| 10 |  0 |15 |  8192 |  7956 | 
 pgbench_accounts_pkey | 1 |  1 | 3 | 0 |   289 |   
   0 | i|298 |  0 |15 |  8192 |  2196 | 
09 96 01 00 00 00 00 00
 pgbench_accounts_pkey | 1 |  2 |   289 | 3 |   575 |   
   0 | i|285 |  0 |15 |  8192 |  2456 | 
11 2c 03 00 00 00 00 00
 pgbench_accounts_pkey | 1 |  3 |   575 |   289 |   860 |   
   0 | i|285 |  0 |15 |  8192 |  2456 | 
19 c2 04 00 00 00 00 00
 pgbench_accounts_pkey | 1 |  4 |   860 |   575 |  1145 |   
   0 | i|285 |  0 |15 |  8192 |  2456 | 
21 58 06 00 00 00 00 00
 pgbench_accounts_pkey | 1 |  5 |  1145 |   860 |  1430 |   
   0 | i|285 |  0 |15 |  8192 |  2456 | 
29 ee 07 00 00 00 00 00
 pgbench_accounts_pkey | 1 |  6 |  1430 |  1145 |  1715 |   
   0 | i|285 |  0 |15 |  8192 |  2456 | 
31 84 09 00 00 00 00 00
 pgbench_accounts_pkey | 1 |  7 |  1715 |  1430 |  2000 |   
   0 | i|285 |  0 |15 |  8192 |  2456 | 
39 1a 0b 00 00 00 00 00
 pgbench_accounts_pkey | 1 |  8 |  2000 |  1715 |  2285 |   
   0 | i|285 |  0 |15 |  8192 |  2456 | 
41 b0 0c 00 00 00 00 00
 pgbench_accounts_pkey | 1 |  9 |  2285 |  2000 |  2570 |   
   0 | i|285 |  0 |15 |  8192 |  2456 | 
49 46 0e 00 00 00 00 00
 pgbench_accounts_pkey | 1 | 10 |  2570 |  2285 | 0 |   
   0 | i|177 |  0 |15 |  8192 |  4616 | 
 pgbench_accounts_pkey | 0 |  1 | 1 | 0 |  2751 |   
  65 | l| 21 |225 |16 |  8192 |  3228 | 
14 00 00 00 00 00 00 00
 pgbench_accounts_pkey | 0 |  2 |  2751 | 1 |  2746 |   
  65 | l| 40 |182 |16 |  8192 |  3708 | 
3b 00 00 00 00 00 00 00
 pgbench_accounts_pkey | 0 |  3 |  2746 |  2751 |  2750 |   
  65 | l| 43 |240 |16 |  8192 |  2488 | 
65 00 00 00 00 00 00 00
 pgbench_accounts_pkey | 0 |  4 |  2750 |  2746 |  2745 |   
  65 | l| 51 | 57 |16 |  8192 |  5988 | 
97 00 00 00 00 00 00 00
 pgbench_accounts_pkey | 0 |  5 |  2745 |  2750 |  2756 |   
  65 | l| 42 | 47 |16 |  8192 |  6368 | 
c0 00 00 00 00 00 00 00
 pgbench_accounts_pkey | 0 |  6 |  2756 |  2745 |  2748 |   
  65 | l| 54 |139 |16 |  8192 |  4288 | 
f5 00 00 00 00 00 00 00
 pgbench_accounts_pkey | 0 |  7 |  2748 |  2756 |  2755 |   
  65 | l| 57 |333 |16 |  8192 |   348 | 
2d 01 00 00 00 00 00 00
 pgbench_accounts_pkey | 0 |  8 |  2755 |  2748 | 2 |   
  65 | l| 67 |308 |16 |  8192 |   648 | 
6f 01 00 00 00 00 00 00
 pgbench_accounts_pkey | 0 |  9 | 2 |  2755 |  2753 |   
  65 | l| 75 |280 |16 |  8192 |  1048 | 
b9 01 00 00 00 00 00 00
 pgbench_accounts_pkey | 0 | 10 |  2753 | 2 |  2747 |   
  65 | l| 83 |260 |16 |  8192 |  1288 | 
0b 02 00 00 00 00 00 00
 pgbench_accounts_pkey | 0 | 11 |  2747 |  2753 |  2754 |   
  65 | l| 91 |192 |16 |  8192 |  2488 | 
65 02 00 00 00 00 00 00
 pgbench_accounts_pkey | 0 | 12 |  2754 |  2747 | 4 |   
  65 | l|121 |196 |16

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-07-12 Thread Peter Geoghegan

On Wed, Jul 12, 2017 at 2:17 PM, Peter Geoghegan  wrote:
> I'd be interested in seeing the difference it makes if Postgres is
> built with the call to _bt_check_unique() commented out within
> nbtinsert.c.

Actually, I mean that I wonder how much of a difference it would make
if this entire block was commented out within _bt_doinsert():

if (checkUnique != UNIQUE_CHECK_NO)
{
...
}


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-07-12 Thread Peter Geoghegan

On Wed, Jul 12, 2017 at 1:55 PM, Alvaro Herrera
 wrote:
> Not to mention work done with a "buffer cleanup lock" held -- which is
> compounded by the fact that acquiring such a lock is prone to starvation
> if there are many scanners of that index.  I've seen a case where a very
> hot table is scanned so heavily that vacuum is starved for days waiting
> to acquire cleanup on a single page (vacuum was only able to finish
> because the app using the table was restarted).  I'm sure that a uniform
> distribution of keys, with a uniform distribution of values scanned,
> would give a completely different behavior than a highly skewed
> distribution where a single key receives a large fraction of the scans.

Good point.

I'd be interested in seeing the difference it makes if Postgres is
built with the call to _bt_check_unique() commented out within
nbtinsert.c. The UPDATE query doesn't ever change indexed columns, so
there should be no real need for the checking it does in this case.
We're seeing a lot of duplicates in at least part of the keyspace, and
yet the queries themselves should be as HOT-safe as possible.

Another possibly weird thing that I noticed is latency standard
deviation. This is from Alik's response to my request to run that SQL
query:

latency average = 1.375 ms
tps = 93085.250384 (including connections establishing)
tps = 93125.724773 (excluding connections establishing)
SQL script 1: /home/nglukhov/ycsb_read_zipf.sql
 - weight: 1 (targets 50.0% of total)
 - 2782999 transactions (49.8% of total, tps = 46364.447705)
 - latency average = 0.131 ms
 - latency stddev = 0.087 ms
SQL script 2: /home/nglukhov/ycsb_update_zipf.sql
 - weight: 1 (targets 50.0% of total)
 - 2780197 transactions (49.8% of total, tps = 46317.766703)
 - latency average = 2.630 ms
 - latency stddev = 14.092 ms

This is from the 128 client case -- the slow case.

Notice that the standard deviation is very high for
ycsb_update_zipf.sql. I wonder if this degrades because of some kind
of feedback loop, that reaches a kind of tipping point at higher
client counts.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-07-12 Thread Alvaro Herrera

Peter Geoghegan wrote:

> Now, that might not seem like that much of a difference, but if you
> consider how duplicates are handled in the B-Tree code, and how unique
> index enforcement works, I think it could be. It could lead to heavy
> buffer lock contention, because we sometimes do a lot of work with an
> exclusive buffer lock held.

Not to mention work done with a "buffer cleanup lock" held -- which is
compounded by the fact that acquiring such a lock is prone to starvation
if there are many scanners of that index.  I've seen a case where a very
hot table is scanned so heavily that vacuum is starved for days waiting
to acquire cleanup on a single page (vacuum was only able to finish
because the app using the table was restarted).  I'm sure that a uniform
distribution of keys, with a uniform distribution of values scanned,
would give a completely different behavior than a highly skewed
distribution where a single key receives a large fraction of the scans.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-07-12 Thread Peter Geoghegan

On Wed, Jul 12, 2017 at 12:30 PM, Peter Geoghegan  wrote:
> On Wed, Jul 12, 2017 at 4:28 AM, Alik Khilazhev
>  wrote:
>> I am attaching results of query that you sent. It shows that there is
>> nothing have changed after executing tests.
>
> But something did change! In the case where performance was good, all
> internal pages on the level above the leaf level have exactly 285
> items, excluding the rightmost page, which unsurprisingly didn't fill.
> However, after increasing client count to get the slow case, the "hot"
> part of the keyspace (the leaf pages owned by the first/leftmost
> internal page on that level) has 353 items rather than 285.

Could you please run the query again for both cases, but this time
change it to get items from the leaf level too, and not just internal
levels? Just remove the "wheretype != 'l' " clause in one of the CTEs.
That should do it.

It's probably the case that the hot part of the B-tree is actually
significantly less than 353 items (or 285 items), and so the "before"
and "after" difference in how many pages are used for the hot part of
the keyspace could be quite a lot larger than my initial estimate. It
could be that the granularity that is shown on the internal pages is
insufficient to see just how bad the problem is.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-07-12 Thread Peter Geoghegan

On Wed, Jul 12, 2017 at 4:28 AM, Alik Khilazhev
wrote:
> I am attaching results of query that you sent. It shows that there is
> nothing have changed after executing tests.

But something did change! In the case where performance was good, all
internal pages on the level above the leaf level have exactly 285
items, excluding the rightmost page, which unsurprisingly didn't fill.
However, after increasing client count to get the slow case, the "hot"
part of the keyspace (the leaf pages owned by the first/leftmost
internal page on that level) has 353 items rather than 285.

Now, that might not seem like that much of a difference, but if you
consider how duplicates are handled in the B-Tree code, and how unique
index enforcement works, I think it could be. It could lead to heavy
buffer lock contention, because we sometimes do a lot of work with an
exclusive buffer lock held.

This is something that I go into a lot of detail on in the Wiki page
on Key normalization:
https://wiki.postgresql.org/wiki/Key_normalization#Avoiding_unnecessary_unique_index_enforcement

Now, I'm not saying that I know for sure that that's what it is. It
seems like a good area to investigate, though. Even if it wasn't
buffer lock contention, we're still talking about the difference
between the hot part of the B-Tree being about 353 pages, versus 285.
Buffer lock contention could naturally limit the growth in size to
"only" 353, by slowing everything down.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-07-12 Thread Alik Khilazhev

Hello!

I want to say that our company is already engaged in the search for the causes 
of the problem and their solution. And also we have few experimental patches 
that increases performance for 1000 clients by several times.
 
In addition, I have fixed threadsafety issues and implemented per-thread cache 
for zeta values. See attached patch.


pgbench-zipf-02v.patch
Description: Binary data
—
Thanks and Regards,
Alik Khilazhev
Postgres Professional:
http://www.postgrespro.com
The Russian Postgres Company
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-07-12 Thread Alik Khilazhev

On 7 Jul 2017, at 21:53, Peter Geoghegan  wrote:Is it possible for you to instrument the number of B-Tree pageaccesses using custom instrumentation for pgbench_accounts_pkey?If that seems like too much work, then it would still be interestingto see what the B-Tree keyspace looks like before and after varyingthe "nclient" count from, say, 32 to 128. Maybe there is a significantdifference in how balanced or skewed it is in each case. Or, the indexcould simply be more bloated.There is a query that I sometimes use, that itself uses pageinspect,to summarize the keyspace quickly. It shows you the highkey for everyinternal page, starting from the root and working down to the lowestinternal page level (the one just before the leaf level -- level 1),in logical/keyspace order. You can use it to visualize thedistribution of values. It could easily include the leaf level, too,but that's less interesting and tends to make the query take ages. Iwonder what the query will show here.Here is the query:…I am attaching results of query that you sent. It shows that there is nothing have changed after executing tests. ...before 128

  idx  | level | l_item | blkno | btpo_prev | btpo_next | 
btpo_flags | type | live_items | dead_items | avg_item_size | page_size | 
free_size | highkey 
-â-+---++---+---+---++--+++---+---+---+-
 pgbench_accounts_pkey | 2 |  1 |   290 | 0 | 0 |   
   2 | r| 10 |  0 |15 |  8192 |  7956 | 
 pgbench_accounts_pkey | 1 |  1 | 3 | 0 |   289 |   
   0 | i|285 |  0 |15 |  8192 |  2456 | 
09 96 01 00 00 00 00 00
 pgbench_accounts_pkey | 1 |  2 |   289 | 3 |   575 |   
   0 | i|285 |  0 |15 |  8192 |  2456 | 
11 2c 03 00 00 00 00 00
 pgbench_accounts_pkey | 1 |  3 |   575 |   289 |   860 |   
   0 | i|285 |  0 |15 |  8192 |  2456 | 
19 c2 04 00 00 00 00 00
 pgbench_accounts_pkey | 1 |  4 |   860 |   575 |  1145 |   
   0 | i|285 |  0 |15 |  8192 |  2456 | 
21 58 06 00 00 00 00 00
 pgbench_accounts_pkey | 1 |  5 |  1145 |   860 |  1430 |   
   0 | i|285 |  0 |15 |  8192 |  2456 | 
29 ee 07 00 00 00 00 00
 pgbench_accounts_pkey | 1 |  6 |  1430 |  1145 |  1715 |   
   0 | i|285 |  0 |15 |  8192 |  2456 | 
31 84 09 00 00 00 00 00
 pgbench_accounts_pkey | 1 |  7 |  1715 |  1430 |  2000 |   
   0 | i|285 |  0 |15 |  8192 |  2456 | 
39 1a 0b 00 00 00 00 00
 pgbench_accounts_pkey | 1 |  8 |  2000 |  1715 |  2285 |   
   0 | i|285 |  0 |15 |  8192 |  2456 | 
41 b0 0c 00 00 00 00 00
 pgbench_accounts_pkey | 1 |  9 |  2285 |  2000 |  2570 |   
   0 | i|285 |  0 |15 |  8192 |  2456 | 
49 46 0e 00 00 00 00 00
 pgbench_accounts_pkey | 1 | 10 |  2570 |  2285 | 0 |   
   0 | i|177 |  0 |15 |  8192 |  4616 | 
(11 rows)

latency average = 1.375 ms
tps = 93085.250384 (including connections establishing)
tps = 93125.724773 (excluding connections establishing)
SQL script 1: /home/nglukhov/ycsb_read_zipf.sql
 - weight: 1 (targets 50.0% of total)
 - 2782999 transactions (49.8% of total, tps = 46364.447705)
 - latency average = 0.131 ms
 - latency stddev = 0.087 ms
SQL script 2: /home/nglukhov/ycsb_update_zipf.sql
 - weight: 1 (targets 50.0% of total)
 - 2780197 transactions (49.8% of total, tps = 46317.766703)
 - latency average = 2.630 ms
 - latency stddev = 14.092 ms

after 128

  idx  | level | l_item | blkno | btpo_prev | btpo_next | 
btpo_flags | type | live_items | dead_items | avg_item_size | page_size | 
free_size | highkey 
â--+---++---+---+---++--+++---+---+---+-
 pgbench_accounts_pkey | 2 |  1 |   290 | 0 | 0 |   
   2 | r| 10 |  0 |15 |  8192 |  7956 | 
 pgbench_accounts_pkey | 1 |  1 | 3 | 0 |   289 |   
   0 | i|353 |  0 |15 |  8192 |  1096 | 
09 96 01 00 00 00 00 00
 pgbench_accounts_pkey | 1 |  2 |   289 | 3 |   575 |   
   0 | i|285 |  0 |15 |  8192 |  2456 | 
11 2c 03 00 00 00 00

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-07-10 Thread Fabien COELHO



Hello Alik,

Your description is not very precise. What version of Postgres is used? 
If there is a decline, compared to which version? Is there a link to 
these results?


Benchmark have been done in master v10. I am attaching image with results:
.


Ok, thanks.

More precision would be helpful, such as the exact pgbench option used (eg 
how many client per thread in pgbench, how long does it run, prepared 
transactions, ...).


Intuitively, contention should explain a saturation of the tps 
performance, because more clients are not effective to improve tps as the 
wait for other clients, and the latency would degrade.


But it is unclear to me why the tps would be reduced even with lock 
contention, so something seems amiss.


Performance debugging by mail is an uneasy task.

Maybe you could try zipf with unlogged tables, to check whether skipping 
the WAL write does something.


Also Amit advice about the perf report looks useful.

Given the explanations, the random draw mostly hits values at the 
beginning of the interval, so when the number of client goes higher one 
just get locking contention on the updated row?


Yes, exactly.


Ok. The uniform distribution run, if all other parameters are equal, gives 
a hint about the potential performance when the performance bottleneck is 
hit.


On Workload A with uniform distribution PostgreSQL shows better results 
than MongoDB and MySQL(see attachment). Also you can notice that for 
small number of clients type of distribution does not affect on tps on 
MySQL.


Ok. I assume that you use pgbench for pg and other ad-hoc tools for the 
other targets (mysql & mongodb).


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-07-10 Thread Amit Kapila

On Mon, Jul 10, 2017 at 12:19 PM, Alik Khilazhev  wrote:

> Hello, Fabien!
>
> Your description is not very precise. What version of Postgres is used? If
> there is a decline, compared to which version? Is there a link to these
> results?
>
>
> Benchmark have been done in master v10. I am attaching image with results:
> .
>
>
It will be interesting to see what the profiling data with perf says about
this for PostgreSQL.  Can you try to get the perf report?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-07-10 Thread Alik Khilazhev

Hello, Fabien!

> Your description is not very precise. What version of Postgres is used? If 
> there is a decline, compared to which version? Is there a link to these 
> results?

Benchmark have been done in master v10. I am attaching image with results:
.

> Indeed, the function computation is over expensive, and the numerical 
> precision of the implementation is doubtful.
> 
> If there is no better way to compute this function, ISTM that it should be 
> summed in reverse order to accumulate small values first, from (1/n)^s + ... 
> + (1/2)^ s. As 1/1 == 1, the corresponding term is 1, no point in calling pow 
> for this one, so it could be:
> 
>   double ans = 0.0;
>   for (i = n; i >= 2; i--)
> ans += pow(1. / i, theta);
>   return 1.0 + ans;

You are right, it’s better to reverse order.

> If the functions when actually used is likely to be called with different 
> parameters, then some caching beyond the last value would seem in order. 
> Maybe a small fixed size array?
> 
> However, it should be somehow thread safe, which does not seem to be the case 
> with the current implementation. Maybe a per-thread cache? Or use a lock only 
> to update a shared cache? At least it should avoid locking to read values…

Yea, I forget about thread-safety. I will implement per-thread cache with small 
fixed array.

> Given the explanations, the random draw mostly hits values at the beginning 
> of the interval, so when the number of client goes higher one just get 
> locking contention on the updated row?

Yes, exactly. 

> ISTM that also having the tps achieved with a flat distribution would allow 
> to check this hypothesis.

On Workload A with uniform distribution PostgreSQL shows better results than 
MongoDB and MySQL(see attachment). Also you can notice that for small number of 
clients  type of distribution does not affect on tps on MySQL. 



And it’s important to mention that postgres run with option 
synchronous_commit=off, to satisfy  durability MongoDB 
writeConcern=1=false. In this mode there is possibility to lose all 
changes in the last second. If we run postgres with max durability MongoDB will 
lag far behind. 
---
Thanks and Regards,
Alik Khilazhev
Postgres Professional:
http://www.postgrespro.com 
The Russian Postgres Company

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-07-07 Thread Peter Geoghegan

On Fri, Jul 7, 2017 at 12:59 PM, Alvaro Herrera
 wrote:
> Hmm, this seems potentially very useful.  Care to upload it to
> https://wiki.postgresql.org/wiki/Category:Snippets ?

Sure. I've added it here, under "index maintenance":

https://wiki.postgresql.org/wiki/Index_Maintenance#Summarize_keyspace_of_a_B-Tree_index

It would be a nice additional touch if there was an easy way of taking
the on-disk representation of index tuples (in this case that would be
little-endian signed integers from bt_page_items()), and from that
output actual typed values. Maybe just for a few select datatypes.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-07-07 Thread Alvaro Herrera

Peter Geoghegan wrote:

> Here is the query:
> 
> with recursive index_details as (
>   select
> 'pgbench_accounts_pkey'::text idx
> ), [...]

Hmm, this seems potentially very useful.  Care to upload it to
https://wiki.postgresql.org/wiki/Category:Snippets ?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-07-07 Thread Peter Geoghegan

On Fri, Jul 7, 2017 at 12:45 AM, Alik Khilazhev
 wrote:
> On scale = 10(1 million rows) it gives following results on machine with 144 
> cores(with synchronous_commit=off):
> nclientstps
> 1   8842.401870
> 2   18358.140869
> 4   45999.378785
> 8   88713.743199
> 16  170166.998212
> 32  290069.221493
> 64  178128.030553
> 128 88712.825602
> 256 38364.937573
> 512 13512.765878
> 10006188.136736

Is it possible for you to instrument the number of B-Tree page
accesses using custom instrumentation for pgbench_accounts_pkey?

If that seems like too much work, then it would still be interesting
to see what the B-Tree keyspace looks like before and after varying
the "nclient" count from, say, 32 to 128. Maybe there is a significant
difference in how balanced or skewed it is in each case. Or, the index
could simply be more bloated.

There is a query that I sometimes use, that itself uses pageinspect,
to summarize the keyspace quickly. It shows you the highkey for every
internal page, starting from the root and working down to the lowest
internal page level (the one just before the leaf level -- level 1),
in logical/keyspace order. You can use it to visualize the
distribution of values. It could easily include the leaf level, too,
but that's less interesting and tends to make the query take ages. I
wonder what the query will show here.

Here is the query:

with recursive index_details as (
  select
'pgbench_accounts_pkey'::text idx
),
size_in_pages_index as (
  select
(pg_relation_size(idx::regclass) / (2^13))::int4 size_pages
  from
index_details
),
page_stats as (
  select
index_details.*,
stats.*
  from
index_details,
size_in_pages_index,
lateral (select i from generate_series(1, size_pages - 1) i) series,
lateral (select * from bt_page_stats(idx, i)) stats),
internal_page_stats as (
  select
*
  from
page_stats
  where
type != 'l'),
meta_stats as (
  select
*
  from
index_details s,
lateral (select * from bt_metap(s.idx)) meta),
internal_items as (
  select
*
  from
internal_page_stats
  order by
btpo desc),
-- XXX: Note ordering dependency within this CTE, on internal_items
ordered_internal_items(item, blk, level) as (
  select
1,
blkno,
btpo
  from
internal_items
  where
btpo_prev = 0
and btpo = (select level from meta_stats)
  union
  select
case when level = btpo then o.item + 1 else 1 end,
blkno,
btpo
  from
internal_items i,
ordered_internal_items o
  where
i.btpo_prev = o.blk or (btpo_prev = 0 and btpo = o.level - 1)
)
select
  idx,
  btpo as level,
  item as l_item,
  blkno,
  btpo_prev,
  btpo_next,
  btpo_flags,
  type,
  live_items,
  dead_items,
  avg_item_size,
  page_size,
  free_size,
  -- Only non-rightmost pages have high key.
  case when btpo_next != 0 then (select data from bt_page_items(idx,
blkno) where itemoffset = 1) end as highkey
from
  ordered_internal_items o
  join internal_items i on o.blk = i.blkno
order by btpo desc, item;

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-07-07 Thread Peter Geoghegan

On Fri, Jul 7, 2017 at 5:17 AM, Robert Haas  wrote:
> How is that possible?  In a Zipfian distribution, no matter how big
> the table is, almost all of the updates will be concentrated on a
> handful of rows - and updates to any given row are necessarily
> serialized, or so I would think.  Maybe MongoDB can be fast there
> since there are no transactions, so it can just lock the row slam in
> the new value and unlock the row, all (I suppose) without writing WAL
> or doing anything hard.

If you're not using the Wired Tiger storage engine, than the locking
is at the document level, which means that a Zipfian distribution is
no worse than any other as far as lock contention goes. That's one
possible explanation. Another is that indexed organized tables
naturally have much better locality, which matters at every level of
the memory hierarchy.

> I'm more curious about why we're performing badly than I am about a
> general-purpose random_zipfian function.  :-)

I'm interested in both. I think that a random_zipfian function would
be quite helpful for modeling certain kinds of performance problems,
like CPU cache misses incurred at the page level.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-07-07 Thread Robert Haas

On Fri, Jul 7, 2017 at 3:45 AM, Alik Khilazhev
 wrote:
> PostgreSQL shows very bad results in YCSB Workload A (50% SELECT and 50% 
> UPDATE of random row by PK) on benchmarking with big number of clients using 
> Zipfian distribution. MySQL also has decline but it is not significant as it 
> is in PostgreSQL. MongoDB does not have decline at all.

How is that possible?  In a Zipfian distribution, no matter how big
the table is, almost all of the updates will be concentrated on a
handful of rows - and updates to any given row are necessarily
serialized, or so I would think.  Maybe MongoDB can be fast there
since there are no transactions, so it can just lock the row slam in
the new value and unlock the row, all (I suppose) without writing WAL
or doing anything hard.  But MySQL is going to have to hold the row
lock until transaction commit just like we do, or so I would think.
Is it just that their row locking is way faster than ours?

I'm more curious about why we're performing badly than I am about a
general-purpose random_zipfian function.  :-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-07-07 Thread Fabien COELHO



Hello Alik,

PostgreSQL shows very bad results in YCSB Workload A (50% SELECT and 50% 
UPDATE of random row by PK) on benchmarking with big number of clients 
using Zipfian distribution. MySQL also has decline but it is not 
significant as it is in PostgreSQL. MongoDB does not have decline at 
all. And if pgbench would have Zipfian distribution random number 
generator, everyone will be able to make research on this topic without 
using YCSB.


Your description is not very precise. What version of Postgres is used? If 
there is a decline, compared to which version? Is there a link to these 
results?



This is the reason why I am currently working on random_zipfian function.


The bottleneck of algorithm that I use is that it calculates zeta 
function (it has linear complexity - 
https://en.wikipedia.org/wiki/Riemann_zeta_function). It my cause 
problems on generating huge amount of big numbers.


Indeed, the function computation is over expensive, and the numerical 
precision of the implementation is doubtful.


If there is no better way to compute this function, ISTM that it should be 
summed in reverse order to accumulate small values first, from (1/n)^s + 
... + (1/2)^ s. As 1/1 == 1, the corresponding term is 1, no point in 
calling pow for this one, so it could be:


   double ans = 0.0;
   for (i = n; i >= 2; i--)
 ans += pow(1. / i, theta);
   return 1.0 + ans;

That’s why I added caching for zeta value. And it works good for cases 
when random_zipfian called with same parameters in script. For example:


That’s why I have a question: should I implement support of caching zeta 
values for calls with different parameters, or not?


I do not envision the random_zipfian function to be used widely, but if it 
is useful to someone this is fine with me. Could an inexpensive 
exponential distribution be used instead for the same benchmarking 
purpose?


If the functions when actually used is likely to be called with different 
parameters, then some caching beyond the last value would seem in order. 
Maybe a small fixed size array?


However, it should be somehow thread safe, which does not seem to be the 
case with the current implementation. Maybe a per-thread cache? Or use a 
lock only to update a shared cache? At least it should avoid locking to 
read values...



P.S. I attaching patch and script - analogue of YCSB Workload A.
Run benchmark with command:
$ pgbench -f  ycsb_read_zipf.sql -f  ycsb_update_zipf.sql


Given the explanations, the random draw mostly hits values at the 
beginning of the interval, so when the number of client goes higher one 
just get locking contention on the updated row?


ISTM that also having the tps achieved with a flat distribution would 
allow to check this hypothesis.


On scale = 10(1 million rows) it gives following results on machine with 
144 cores(with synchronous_commit=off):



nclientstps
1   8842.401870
2   18358.140869
4   45999.378785
8   88713.743199
16  170166.998212
32  290069.221493
64  178128.030553
128 88712.825602
256 38364.937573
512 13512.765878
10006188.136736


--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] [WIP] Zipfian distribution in pgbench

2017-07-07 Thread Alik Khilazhev

Hello!

PostgreSQL shows very bad results in YCSB Workload A (50% SELECT and 50% UPDATE 
of random row by PK) on benchmarking with big number of clients using Zipfian 
distribution. MySQL also has decline but it is not significant as it is in 
PostgreSQL. MongoDB does not have decline at all. And if pgbench would have 
Zipfian distribution random number generator, everyone will be able to make 
research on this topic without using YCSB. 
 
This is the reason why I am currently working on random_zipfian function.

The bottleneck of algorithm that I use is that it calculates zeta function (it 
has linear complexity - https://en.wikipedia.org/wiki/Riemann_zeta_function). 
It my cause problems on generating huge amount of big numbers. 

That’s why I added caching for zeta value. And it works good for cases when 
random_zipfian called with same parameters in script. For example:

… 
\set a random_zipfian(1, 100, 1.2)
\set b random_zipfian(1, 100, 1.2)
…

In other case, second call will override cache of first and caching does not 
make any sense:
…
\set a random_zipfian(1, 100, 1.2)
\set b random_zipfian(1, 200, 1.4)
… 

That’s why I have a question: should I implement support of caching zeta values 
for calls with different parameters, or not? 

P.S. I attaching patch and script - analogue of YCSB Workload A.
Run benchmark with command:
$ pgbench -f  ycsb_read_zipf.sql -f  ycsb_update_zipf.sql

On scale = 10(1 million rows) it gives following results on machine with 144 
cores(with synchronous_commit=off):
nclientstps
1   8842.401870
2   18358.140869
4   45999.378785
8   88713.743199
16  170166.998212
32  290069.221493
64  178128.030553
128 88712.825602
256 38364.937573
512 13512.765878
10006188.136736


ycsb_read_zipf.sql
Description: application/sql


pgbench-zipf-01v.patch
Description: Binary data


ycsb_update_zipf.sql
Description: application/sql

—
Thanks and Regards,
Alik Khilazhev
Postgres Professional:
http://www.postgrespro.com
The Russian Postgres Company
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

61 matches

Mail list logo