Re: [HACKERS] [WIP] Zipfian distribution in pgbench

Fabien COELHO Fri, 07 Jul 2017 04:45:57 -0700


Hello Alik,

PostgreSQL shows very bad results in YCSB Workload A (50% SELECT and 50%UPDATE of random row by PK) on benchmarking with big number of clientsusing Zipfian distribution. MySQL also has decline but it is notsignificant as it is in PostgreSQL. MongoDB does not have decline atall. And if pgbench would have Zipfian distribution random numbergenerator, everyone will be able to make research on this topic withoutusing YCSB.

Your description is not very precise. What version of Postgres is used? Ifthere is a decline, compared to which version? Is there a link to theseresults?

This is the reason why I am currently working on random_zipfian function.

The bottleneck of algorithm that I use is that it calculates zetafunction (it has linear complexity -https://en.wikipedia.org/wiki/Riemann_zeta_function). It my causeproblems on generating huge amount of big numbers.

Indeed, the function computation is over expensive, and the numericalprecision of the implementation is doubtful.

If there is no better way to compute this function, ISTM that it should besummed in reverse order to accumulate small values first, from (1/n)^s +... + (1/2)^ s. As 1/1 == 1, the corresponding term is 1, no point incalling pow for this one, so it could be:


       double ans = 0.0;
       for (i = n; i >= 2; i--)
             ans += pow(1. / i, theta);
       return 1.0 + ans;

That’s why I added caching for zeta value. And it works good for caseswhen random_zipfian called with same parameters in script. For example:

That’s why I have a question: should I implement support of caching zetavalues for calls with different parameters, or not?

I do not envision the random_zipfian function to be used widely, but if itis useful to someone this is fine with me. Could an inexpensiveexponential distribution be used instead for the same benchmarkingpurpose?

If the functions when actually used is likely to be called with differentparameters, then some caching beyond the last value would seem in order.Maybe a small fixed size array?

However, it should be somehow thread safe, which does not seem to be thecase with the current implementation. Maybe a per-thread cache? Or use alock only to update a shared cache? At least it should avoid locking toread values...

P.S. I attaching patch and script - analogue of YCSB Workload A.
Run benchmark with command:
$ pgbench -f  ycsb_read_zipf.sql -f  ycsb_update_zipf.sql

Given the explanations, the random draw mostly hits values at thebeginning of the interval, so when the number of client goes higher onejust get locking contention on the updated row?

ISTM that also having the tps achieved with a flat distribution wouldallow to check this hypothesis.

On scale = 10(1 million rows) it gives following results on machine with144 cores(with synchronous_commit=off):

        nclients        tps
        1               8842.401870
        2               18358.140869
        4               45999.378785
        8               88713.743199
        16              170166.998212
        32              290069.221493
        64              178128.030553
        128             88712.825602
        256             38364.937573
        512             13512.765878
        1000    6188.136736


--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

Reply via email to