Damage control for planner's get_actual_variable_endpoint() runaway

Jakub Wartak Mon, 21 Nov 2022 04:01:08 -0800

Hi hackers,

the case of planner's
src/backend/utils/adt/selfuncs.c:get_actual_variable_endpoint()
spending literally seconds seems to be well known fact across hackers
(in the extreme wild case it can be over 1+ hour on VLDBs). For those
unfamiliar it is planner estimation that tries to read real table
index (including deadrows) until min/max. It is blackbox mechanism
that works without any warning which often is hugely affected by
number of dead tuples in indexes and there's no on/off switch or
built-in limitation of how far it can go. It was discussed on pgsql
mailing lists several times [1]-[5]. It almost seems like it works
fine in 99.9% cases, until it doesn't and blows up big time on larger
systems and from there operator doesn't have a lot of choices [a lot
of time being already wasted on identifying the root-cause being the
planner]:
1) one can properly VACUUM (which everybody seem to agree is the
proper way to go, but it is often problematic due to other various
circumstances, especially on big tables without serious partitioning
strategies) - again this might be very time consuming
2) one cannot trade a lot CPU/IO burning on planning (actually
fetching indexes on multi-TB tables) to less accurate plans, and
realistically speaking rewriting queries is often impossible
3) application might not support enable prepared statements and even
if then simple queries/reports are also affected
4) there is no visibility into how much activity is spent on btree
index get_actual_variable_endpoint() alone, so one cannot estimate the
system-wide impact


I would like to trigger the discussion on how to give at least partial
control to the end-user of what the optimizer performs. I was thinking
about several things and each of those has pros and cons:

a) the attached quick patch (by Simon Riggs) that put maximum allowed
cost constraints on the index-lookup machinery as a safeguard (that
#define is debatable; in my testing it reduced the hit from ~850ms to
0.6ms +/- 0.3ms at the current value of 20).
b) I was wondering about creating a new wait class "Planner" with the
event "ReadingMinMaxIndex" (or similar). The obvious drawback is the
naming/categorization as wait events are ... well.. as the name "WAIT"
implies, while those btree lookups could easily be ON-CPU activity.
c) Any other idea, e.g. see [3] or [5] (cache was being proposed).
d) For completeness : a new GUC/knob to completely disable the
functionality (debug_optimizer_minmax_est), but that's actually
trimmed functionality of the patch.
e) I was also wondering about some DEBUG/NOTICE elog() when taking
more than let's say arbitrary 10s, but that could easily spam the log
file

Reproducer on a very small dataset follows. Please note the reproducer
here shows the effect on 1st run EXPLAIN, however in real observed
situation (multi-TB unpartitioned table) each consecutive planner
operation (just EXPLAIN) on that index was affected (I don't know why
LP_DEAD/hints cleaning was not kicking in, but maybe it was, but given
the scale of the problem it was not helping much).

-Jakub Wartak.

[1] - 
https://www.postgresql.org/message-id/flat/54446AE2.6080909%40BlueTreble.com#f436bb41cf044b30eeec29472a13631e
[2] - 
https://www.postgresql.org/message-id/flat/db7111f2-05ef-0ceb-c013-c34adf4f4121%40gmail.com
[3] - 
https://www.postgresql.org/message-id/flat/05C72CF7-B5F6-4DB9-8A09-5AC897653113%40yandex.ru
(SnapshotAny vs SnapshotDirty discussions between Tom and Robert)
[4] - 
https://www.postgresql.org/message-id/flat/CAECtzeVPM4Oi6dTdqVQmjoLkDBVChNj7ed3hNs1RGrBbwCJ7Cw%40mail.gmail.com
[5] - https://postgrespro.com/list/thread-id/2436130 (cache)

s1:
=# drop table test;
=# create table test (id bigint primary key) with (autovacuum_enabled = 'off');
=# insert into test select generate_series(1,10000000); -- ~310MB
table, ~210MB index

s2/start the long running transaction:
=# begin;
=# select min(id) from test;

s1:
=# delete from test where id>1000000;
=# analyze test;
=# set enable_indexonlyscan = 'off'; -- just in case to force
BitmapHeapScans which according to backend/access/nbtree/README
won'tset LP_DEAD, but my bt_page_items() tests indicate that it does
(??)
=# set enable_indexscan = 'off';
=# explain (buffers, verbose) select * from test where id > 11000000;
=>  Planning: Buffers: shared hit=9155 read=55276 dirtied=55271
written=53617 / Time: 851.761 ms
=# explain (buffers, verbose) select * from test where id > 11000000;
=>  Planning: Buffers: shared read=24594 / Time: 49.172 ms
=# vacuum (verbose) test; => index scan needed: 39824 pages from table
(90.00% of total) had 9000000 dead item identifiers removed
=# explain (buffers, verbose) select * from test where id > 11000000;
=>  Planning: Buffers: shared hit=14 read=3 / Time: 0.550 ms

with patch, the results are:
p=# explain (buffers, verbose) select * from test where id > 11000000;
=> Planning: / Buffers: shared hit=17 read=6 dirtied=3 written=5 =>
Time: 0.253 ms
p=# explain (buffers, verbose) select * from test where id > 11000000;
=> Planning: / Buffers: shared hit=11 read=2 dirtied=2 => Time: 0.576
ms
so there's no dramatic hit.

apply_cost_limitation_get_actual_variable_endpoint.patch
Description: Binary data

Damage control for planner's get_actual_variable_endpoint() runaway

Reply via email to