Hi hackers, the case of planner's src/backend/utils/adt/selfuncs.c:get_actual_variable_endpoint() spending literally seconds seems to be well known fact across hackers (in the extreme wild case it can be over 1+ hour on VLDBs). For those unfamiliar it is planner estimation that tries to read real table index (including deadrows) until min/max. It is blackbox mechanism that works without any warning which often is hugely affected by number of dead tuples in indexes and there's no on/off switch or built-in limitation of how far it can go. It was discussed on pgsql mailing lists several times [1]-[5]. It almost seems like it works fine in 99.9% cases, until it doesn't and blows up big time on larger systems and from there operator doesn't have a lot of choices [a lot of time being already wasted on identifying the root-cause being the planner]: 1) one can properly VACUUM (which everybody seem to agree is the proper way to go, but it is often problematic due to other various circumstances, especially on big tables without serious partitioning strategies) - again this might be very time consuming 2) one cannot trade a lot CPU/IO burning on planning (actually fetching indexes on multi-TB tables) to less accurate plans, and realistically speaking rewriting queries is often impossible 3) application might not support enable prepared statements and even if then simple queries/reports are also affected 4) there is no visibility into how much activity is spent on btree index get_actual_variable_endpoint() alone, so one cannot estimate the system-wide impact
I would like to trigger the discussion on how to give at least partial control to the end-user of what the optimizer performs. I was thinking about several things and each of those has pros and cons: a) the attached quick patch (by Simon Riggs) that put maximum allowed cost constraints on the index-lookup machinery as a safeguard (that #define is debatable; in my testing it reduced the hit from ~850ms to 0.6ms +/- 0.3ms at the current value of 20). b) I was wondering about creating a new wait class "Planner" with the event "ReadingMinMaxIndex" (or similar). The obvious drawback is the naming/categorization as wait events are ... well.. as the name "WAIT" implies, while those btree lookups could easily be ON-CPU activity. c) Any other idea, e.g. see [3] or [5] (cache was being proposed). d) For completeness : a new GUC/knob to completely disable the functionality (debug_optimizer_minmax_est), but that's actually trimmed functionality of the patch. e) I was also wondering about some DEBUG/NOTICE elog() when taking more than let's say arbitrary 10s, but that could easily spam the log file Reproducer on a very small dataset follows. Please note the reproducer here shows the effect on 1st run EXPLAIN, however in real observed situation (multi-TB unpartitioned table) each consecutive planner operation (just EXPLAIN) on that index was affected (I don't know why LP_DEAD/hints cleaning was not kicking in, but maybe it was, but given the scale of the problem it was not helping much). -Jakub Wartak. [1] - https://www.postgresql.org/message-id/flat/54446AE2.6080909%40BlueTreble.com#f436bb41cf044b30eeec29472a13631e [2] - https://www.postgresql.org/message-id/flat/db7111f2-05ef-0ceb-c013-c34adf4f4121%40gmail.com [3] - https://www.postgresql.org/message-id/flat/05C72CF7-B5F6-4DB9-8A09-5AC897653113%40yandex.ru (SnapshotAny vs SnapshotDirty discussions between Tom and Robert) [4] - https://www.postgresql.org/message-id/flat/CAECtzeVPM4Oi6dTdqVQmjoLkDBVChNj7ed3hNs1RGrBbwCJ7Cw%40mail.gmail.com [5] - https://postgrespro.com/list/thread-id/2436130 (cache) s1: =# drop table test; =# create table test (id bigint primary key) with (autovacuum_enabled = 'off'); =# insert into test select generate_series(1,10000000); -- ~310MB table, ~210MB index s2/start the long running transaction: =# begin; =# select min(id) from test; s1: =# delete from test where id>1000000; =# analyze test; =# set enable_indexonlyscan = 'off'; -- just in case to force BitmapHeapScans which according to backend/access/nbtree/README won'tset LP_DEAD, but my bt_page_items() tests indicate that it does (??) =# set enable_indexscan = 'off'; =# explain (buffers, verbose) select * from test where id > 11000000; => Planning: Buffers: shared hit=9155 read=55276 dirtied=55271 written=53617 / Time: 851.761 ms =# explain (buffers, verbose) select * from test where id > 11000000; => Planning: Buffers: shared read=24594 / Time: 49.172 ms =# vacuum (verbose) test; => index scan needed: 39824 pages from table (90.00% of total) had 9000000 dead item identifiers removed =# explain (buffers, verbose) select * from test where id > 11000000; => Planning: Buffers: shared hit=14 read=3 / Time: 0.550 ms with patch, the results are: p=# explain (buffers, verbose) select * from test where id > 11000000; => Planning: / Buffers: shared hit=17 read=6 dirtied=3 written=5 => Time: 0.253 ms p=# explain (buffers, verbose) select * from test where id > 11000000; => Planning: / Buffers: shared hit=11 read=2 dirtied=2 => Time: 0.576 ms so there's no dramatic hit.
apply_cost_limitation_get_actual_variable_endpoint.patch
Description: Binary data