Hi, I'm trying to use random_zipfian() for benchmarking of skewed data sets, and I ran head-first into an issue with rather excessive CPU costs.
Consider an example like this: pgbench -i -s 10000 test pgbench -s 10000 -f zipf.sql -T 30 test where zipf.sql does this: \SET id random_zipfian(1, 100000 * :scale, 0.1) BEGIN; SELECT * FROM pgbench_accounts WHERE aid = :id; END; Unfortunately, this produces something like this: transaction type: zipf.sql scaling factor: 10000 query mode: simple number of clients: 1 number of threads: 1 duration: 30 s number of transactions actually processed: 1 latency average = 43849.143 ms tps = 0.022805 (including connections establishing) tps = 0.022806 (excluding connections establishing) which is somewhat ... not great, I guess. This happens because generalizedHarmonicNumber() does this: for (i = n; i > 1; i--) ans += pow(i, -s); where n happens to be 1000000000 (range passed to random_zipfian), so the loop takes quite a bit of time. So much that we only ever complete a single transaction, because this work happens in the context of the first transction, and so it counts into the 30-second limit. The docs actually do mention performance of this function: The function's performance is poor for parameter values close and above 1.0 and on a small range. But that does not seem to cover the example I just posted, because 0.1 is not particularly close to 1.0 (or above 1.0), and 1e9 values hardly counts as "small range". I see this part of random_zipfian comes from "Quickly Generating Billion-Record Synthetic Databases" which deals with generating data sets, where wasting a bit of time is not a huge deal. But when running benchmarks it matters much more. So maybe there's a disconnect here? Interestingly enough, I don't see this issue on values above 1.0, no matter how close to 1.0 those are. Although the throughput seems lower than with uniform access, so this part of random_zipfian is not quite cheap either. Now, I don't know what to do about this. Ideally, we'd have a faster algorithm to generate zipfian distributions - I don't know if such thing exists though. Or maybe we could precompute the distribution first, not counting it into the benchmark duration. But I guess the very least we can/should do is improving the docs to make it more obvious which cases are expected to be slow. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services