Re: Reducing planning time on tables with many indexes
On 2022-Aug-19, David Geier wrote: > Beyond that I did some off-CPU profiling to precisely track down which lock > serializes execution. It turned out to be the MyProc::fpInfoLock lightweight > lock. This lock is used in the fast path of the heavyweight lock. In the > contenting case, fpInfoLock is acquired in LW_EXCLUSIVE mode to (1) check if > there is no other process holding a stronger lock, and if not, to reserve a > process local fast path lock slot and (2) to return the fast path lock slots > all in one go. To do so, the current implementation always linearly iterates > over all lock slots. Ah, so this is the aspect that you mentioned to me today. I definitely think that this analysis deserves its own thread, and the fix is its own separate patch. > I have attached the patch to improve the heavyweight lock fast path. It also > for now contains moving out _bt_getrootheight(). For workloads where the > same set of locks is used over and over again, it only needs on average a > single loop iteration to find the relation (instead of a linear scan > before). This allows to increase the number of fast path locks by a lot. In > this patch I increased them from 16 to 64. The code can be further improved > for cases where to be locked relations change frequently and therefore the > chance of not finding a relation and because of that having to linearly > search the whole array is higher. I suggest to put each change in a separate patch: 1. improve fast-path lock algorithm to find the element, perhaps together with increasing the number of elements in the array 2. change _bt_getrootheight However, since patch (1) may have nontrivial performance implications, you would also need to justify the change: not only that improves the case where many locks are acquired, but also that it does not make the case with few locks worse. I strongly suggest to not include C++ comments or any other dirtiness in the patch, as that might deter some potential reviewers. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "It takes less than 2 seconds to get to 78% complete; that's a good sign. A few seconds later it's at 90%, but it seems to have stuck there. Did somebody make percentages logarithmic while I wasn't looking?" http://smylers.hates-software.com/2005/09/08/1995c749.html
Re: Reducing planning time on tables with many indexes
On 8/1/22 15:33, David Geier wrote: Hi Tom, On Wed, Jul 27, 2022 at 7:15 PM Tom Lane wrote: I wrote: > Unfortunately, as things stand today, the planner needs more than the > right to look at the indexes' schemas, because it makes physical accesses > to btree indexes to find out their tree height (and I think there are some > comparable behaviors in other AMs). I've never particularly cared for > that implementation, and would be glad to rip out that behavior if we can > find another way. Maybe we could persuade VACUUM or ANALYZE to store that > info in the index's pg_index row, or some such, and then the planner > could use it with no lock? It seems like _bt_getrootheight() first checks if the height is cached and only if it isn't it accesses index meta pages. If the index locks are only taken for the sake of _bt_getrootheight() accessing index meta pages in case they are not cached, maybe the index locks could be taken conditionally. However, postponing the call to where it is really needed sounds even better. A first step here could just be to postpone fetching _bt_getrootheight() until we actually need it during cost estimation. That would avoid the need to do it at all for indexes that indxpath.c discards as irrelevant, which is a decision made on considerably more information than the proposed patch uses. Hi Tom, I gave the idea of moving _bt_getrootheight() into costsize.c and filling IndexOptInfo in get_relation_info() via syscache instead of relcache a try, but didn't get very far. Moving out _bt_getrootheight() was straightforward, and we should do nevertheless. However, it seems like get_relation_info() strongly depends on the index's Relation for quite some stuff. A fair amount of fields I could actually fill from syscache, but there are some that either need data not stored in syscache (e.g. estimate_rel_size(), Relation::rd_smgr needed by RelationGetNumberOfBlocksInFork()) or need fields that are cached in the index's Relation and would have to be recomputed otherwise (e.g. Relation::rd_indexprs filled by RelationGetIndexExpressions(), Relation::rd_indpred filled by RelationGetIndexPredicate()). Even if we could somehow obtain the missing info from somewhere, recomputing the otherwise cached fields from Relation would likely cause a significant slowdown in the serial case. Beyond that I did some off-CPU profiling to precisely track down which lock serializes execution. It turned out to be the MyProc::fpInfoLock lightweight lock. This lock is used in the fast path of the heavyweight lock. In the contenting case, fpInfoLock is acquired in LW_EXCLUSIVE mode to (1) check if there is no other process holding a stronger lock, and if not, to reserve a process local fast path lock slot and (2) to return the fast path lock slots all in one go. To do so, the current implementation always linearly iterates over all lock slots. The corresponding call stacks are: get_relation_info() CommitTransaction() index_open() ResourceOwnerRelease() relation_open() ResourceOwnerReleaseInternal() LockRelationOid() ProcReleaseLocks() LockAcquireExtended() LockReleaseAll() <-- called twice from ProcReleaseLocks() LWLockAcquire() On top of that there are only 16 fast path lock slots. One slot is always taken up by the parent relation, leaving only 15 slots for the indexes. As soon as a session process runs out of slots, it falls back to the normal lock path which has to mess around with the lock table. To do so it also acquires a lightweight lock in LW_EXCLUSIVE mode. This lightweight lock however is partitioned and therefore does not content. Hence, normal lock acquisition is slower but contents less. To prove above findings I increased the number of fast path lock slots per connection and optimized FastPathGrantRelationLock() and FastPathUnGrantRelationLock(). With these changes the lock contention disappeared and the workload scales linearly (the code I tested with also included moving out _bt_getrootheight()): | Parallel streams | TPS | TPS / stream | |--|--|---| | 1 | 5,253 | 5,253 | | 10 | 51,406 | 5,140 | | 20 | 101,401 | 5,070 | | 30 | 152,023 | 5,067 | | 40 | 200,607 | 5,015 | | 50 | 245,359 | 4,907 | | 60 | 302,994 | 5,049 | However, with the very same setup, the index filtering approach yields 486k TPS with 60 streams and 9,827 TPS with a single stream. The single stream number shows that this is not because it scales even better, but just because less work is spent during planning. A quick perf session showed that a fair amount of time is spent to get the rela
Re: Reducing planning time on tables with many indexes
On 27.07.22, 18:39, "Tom Lane" wrote: [External Email] David Geier writes: > We tracked down the root cause of this slowdown to lock contention in > 'get_relation_info()'. The index lock of every single index of every single > table used in that query is acquired. We attempted a fix by pre-filtering > out all indexes that anyways cannot be used with a certain query, without > taking the index locks (credits to Luc Vlaming for idea and > implementation). The patch does so by caching the columns present in every > index, inside 'struct Relation', similarly to 'rd_indexlist'. I wonder how much thought you gave to the costs imposed by that extra cache space. We have a lot of users who moan about relcache bloat already. But more to the point, I do not buy the assumption that an index's set of columns is a good filter for which indexes are of interest. A trivial counterexample from the regression database is regression=# explain select count(*) from tenk1; QUERY PLAN Aggregate (cost=219.28..219.29 rows=1 width=8) -> Index Only Scan using tenk1_hundred on tenk1 (cost=0.29..194.28 rows=100 00 width=0) (2 rows) It looks to me like the patch also makes unwarranted assumptions about being able to discard all but the smallest index having a given set of columns. This would, for example, possibly lead to dropping the index that has the most useful sort order, or that has the operator class needed to support a specific WHERE clause. Thanks for checking out the patch! Just to make sure we're on the same page: we're only making this assumption if you select no fields at all. If you select any fields at all it will check for column overlap, and if there's any overlap with any referenced field, then the index will not be filtered out. For producing a row count with no referenced fields it is true that it should select the truly cheapest index to produce the row count and there should be some Index-am callback introduced for that. For now it was just a quick-and-dirty solution. Wouldn't a callback that would estimate the amount of data read be good enough though? For sort orders the field to sort by should be listed and hence the index should not be filtered out, or what am I missing? Likely I've missed some fields that are referenced somehow (potentially indirectly), but that shouldn't disqualify the approach completely. In short, I'm not sure I buy this concept at all. I think it might be more useful to attack the locking overhead more directly. I kind of wonder why we need per-index locks at all during planning --- I think that we already interpret AccessShareLock on the parent table as being sufficient to block schema changes on existing indexes. Could you elaborate as to why this approach is not good enough? To me it seems that avoiding work ahead of time is generally useful. Or are you worried that we remove too much? Unfortunately, as things stand today, the planner needs more than the right to look at the indexes' schemas, because it makes physical accesses to btree indexes to find out their tree height (and I think there are some comparable behaviors in other AMs). I've never particularly cared for that implementation, and would be glad to rip out that behavior if we can find another way. Maybe we could persuade VACUUM or ANALYZE to store that info in the index's pg_index row, or some such, and then the planner could use it with no lock? regards, tom lane The thing you're touching on is specific for a btree. Not sure this generalizes to all index types that are out there though? I could see there being some property that allows you to be "no-lock", and then a callback that allows you to cache some generic data that can be transformed when the indexopt info structs are filled. Is that roughly what you have in mind? Best, Luc
Re: Reducing planning time on tables with many indexes
Hi Tom, On Wed, Jul 27, 2022 at 6:39 PM Tom Lane wrote: > David Geier writes: > > We tracked down the root cause of this slowdown to lock contention in > > 'get_relation_info()'. The index lock of every single index of every > single > > table used in that query is acquired. We attempted a fix by pre-filtering > > out all indexes that anyways cannot be used with a certain query, without > > taking the index locks (credits to Luc Vlaming for idea and > > implementation). The patch does so by caching the columns present in > every > > index, inside 'struct Relation', similarly to 'rd_indexlist'. > > I wonder how much thought you gave to the costs imposed by that extra > cache space. We have a lot of users who moan about relcache bloat > already. The current representation could be compacted (e.g. by storing the column indexes as 16-bit integers, instead of using a Bitmapset). That should make the additional space needed neglectable compared to the current size of RelationData. On top of that we could maybe reorder the members of RelationData to save padding bytes. Currently, there is lots of interleaving of members of different sizes. Even when taking cache locality into consideration it looks like a fair amount could be saved, probably accounting for the additional space needed to store the index column data. But more to the point, I do not buy the assumption that > an index's set of columns is a good filter for which indexes are of > interest. A trivial counterexample from the regression database is > > regression=# explain select count(*) from tenk1; > QUERY PLAN > > > > > > Aggregate (cost=219.28..219.29 rows=1 width=8) >-> Index Only Scan using tenk1_hundred on tenk1 (cost=0.29..194.28 > rows=100 > 00 width=0) > (2 rows) > > Only for queries without index conditions, the current version of the patch simply discards all indexes but the one with the least columns. This is case (3) in s64_IsUnnecessaryIndex(). This is an over-simplification with the goal of picking the index which yields least I/O. For single column indexes that works, but it can fall short for multi-column indexes (e.g. [INT, TEXT] index vs [INT, INT]t index have both 2 columns but the latter would be better suited when there's no other index and we want to read the first integer column). What we should do here instead is to discard indexes based on storage size. > It looks to me like the patch also makes unwarranted assumptions about > being able to discard all but the smallest index having a given set > of columns. This would, for example, possibly lead to dropping the > index that has the most useful sort order, or that has the operator > class needed to support a specific WHERE clause.t > Why would that be? If we keep all indexes that contain columns that are used by the query, we also keep the indexes which provide a certain sort order or operator class. > > In short, I'm not sure I buy this concept at all. I think it might > be more useful to attack the locking overhead more directly. I kind > of wonder why we need per-index locks at all during planning --- > I think that we already interpret AccessShareLock on the parent table > as being sufficient to block schema changes on existing indexes. > > As I said in the reply to your other mail, there's huge savings also in the serial case where lock contention is not an issue. It seems like pre-filtering saves work down the road. I'll do some perf runs to track down where exactly the savings come from. One source I can think of is only having to consider a subset of all indexes during path creation. > Unfortunately, as things stand today, the planner needs more than the > right to look at the indexes' schemas, because it makes physical accesses > to btree indexes to find out their tree height (and I think there are some > comparable behaviors in other AMs). I've never particularly cared for > that implementation, and would be glad to rip out that behavior if we can > find another way. Maybe we could persuade VACUUM or ANALYZE to store that > info in the index's pg_index row, or some such, and then the planner > could use it with no lock? > > That's another interesting approach, but I would love to save the planner cycles for the sequential case. -- David Geier (ServiceNow)
Re: Reducing planning time on tables with many indexes
Hi Tom, On Wed, Jul 27, 2022 at 7:15 PM Tom Lane wrote: > I wrote: > > Unfortunately, as things stand today, the planner needs more than the > > right to look at the indexes' schemas, because it makes physical accesses > > to btree indexes to find out their tree height (and I think there are > some > > comparable behaviors in other AMs). I've never particularly cared for > > that implementation, and would be glad to rip out that behavior if we can > > find another way. Maybe we could persuade VACUUM or ANALYZE to store > that > > info in the index's pg_index row, or some such, and then the planner > > could use it with no lock? > It seems like _bt_getrootheight() first checks if the height is cached and only if it isn't it accesses index meta pages. If the index locks are only taken for the sake of _bt_getrootheight() accessing index meta pages in case they are not cached, maybe the index locks could be taken conditionally. However, postponing the call to where it is really needed sounds even better. > > A first step here could just be to postpone fetching _bt_getrootheight() > until we actually need it during cost estimation. That would avoid the > need to do it at all for indexes that indxpath.c discards as irrelevant, > which is a decision made on considerably more information than the > proposed patch uses. > > Having done that, you could look into revising plancat.c to fill the > IndexOptInfo structs from catcache entries instead of opening the > index per se. (You'd have to also make sure that the appropriate > index locks are acquired eventually, for indexes the query does use > at runtime. I think that's the case, but I'm not sure if anything > downstream has been optimized on the assumption the planner did it.) > > I can give this a try. That way we would get rid of the scalability issues. However, what about the runtime savings observed with a single query stream? In that case there is no contention, so it seems like having less indexes to look at further down the road, also yields substantial savings. Any clue where exactly these savings might come from? Or is it actually the calls to _bt_getrootheight()? I can also do a few perf runs to track that down. > This'd probably get us a large part of the way there. Further > optimization of acquisition of tree height etc could be an > optional follow-up. > > regards, tom lane >
Re: Reducing planning time on tables with many indexes
I wrote: > Unfortunately, as things stand today, the planner needs more than the > right to look at the indexes' schemas, because it makes physical accesses > to btree indexes to find out their tree height (and I think there are some > comparable behaviors in other AMs). I've never particularly cared for > that implementation, and would be glad to rip out that behavior if we can > find another way. Maybe we could persuade VACUUM or ANALYZE to store that > info in the index's pg_index row, or some such, and then the planner > could use it with no lock? A first step here could just be to postpone fetching _bt_getrootheight() until we actually need it during cost estimation. That would avoid the need to do it at all for indexes that indxpath.c discards as irrelevant, which is a decision made on considerably more information than the proposed patch uses. Having done that, you could look into revising plancat.c to fill the IndexOptInfo structs from catcache entries instead of opening the index per se. (You'd have to also make sure that the appropriate index locks are acquired eventually, for indexes the query does use at runtime. I think that's the case, but I'm not sure if anything downstream has been optimized on the assumption the planner did it.) This'd probably get us a large part of the way there. Further optimization of acquisition of tree height etc could be an optional follow-up. regards, tom lane
Re: Reducing planning time on tables with many indexes
David Geier writes: > We tracked down the root cause of this slowdown to lock contention in > 'get_relation_info()'. The index lock of every single index of every single > table used in that query is acquired. We attempted a fix by pre-filtering > out all indexes that anyways cannot be used with a certain query, without > taking the index locks (credits to Luc Vlaming for idea and > implementation). The patch does so by caching the columns present in every > index, inside 'struct Relation', similarly to 'rd_indexlist'. I wonder how much thought you gave to the costs imposed by that extra cache space. We have a lot of users who moan about relcache bloat already. But more to the point, I do not buy the assumption that an index's set of columns is a good filter for which indexes are of interest. A trivial counterexample from the regression database is regression=# explain select count(*) from tenk1; QUERY PLAN Aggregate (cost=219.28..219.29 rows=1 width=8) -> Index Only Scan using tenk1_hundred on tenk1 (cost=0.29..194.28 rows=100 00 width=0) (2 rows) It looks to me like the patch also makes unwarranted assumptions about being able to discard all but the smallest index having a given set of columns. This would, for example, possibly lead to dropping the index that has the most useful sort order, or that has the operator class needed to support a specific WHERE clause. In short, I'm not sure I buy this concept at all. I think it might be more useful to attack the locking overhead more directly. I kind of wonder why we need per-index locks at all during planning --- I think that we already interpret AccessShareLock on the parent table as being sufficient to block schema changes on existing indexes. Unfortunately, as things stand today, the planner needs more than the right to look at the indexes' schemas, because it makes physical accesses to btree indexes to find out their tree height (and I think there are some comparable behaviors in other AMs). I've never particularly cared for that implementation, and would be glad to rip out that behavior if we can find another way. Maybe we could persuade VACUUM or ANALYZE to store that info in the index's pg_index row, or some such, and then the planner could use it with no lock? regards, tom lane
Re: [PoC] Reducing planning time on tables with many indexes
Sorry, by accident I sent this one out twice. -- David Geier (ServiceNow) On Wed, Jul 27, 2022 at 2:42 PM David Geier wrote: > Hi hackers, > > We came across a slowdown in planning, where queries use tables with many > indexes. In setups with wide tables it is not uncommon to have easily > 10-100 indexes on a single table. The slowdown is already visible in serial > workloads with just a handful of indexes, but gets drastically amplified > when running queries with more indexes in parallel at high throughput. > > We measured the TPS and planning time of running parallel streams of > simple point look-up queries on a single empty table with 60 columns and 60 > indexes. The query used is 'SELECT * FROM synth_table WHERE col5 = 42'. No > rows are returned because the table is empty. We used a machine with 64 > physical CPU cores. The schema and sysbench script to reproduce these > numbers are attached. We used the TPS as reported by sysbench and obtained > planning time by running 'EXPLAIN ANALYZE' on the same query in a > separately opened connection. We averaged the planning time of 3 successive > 'EXPLAIN ANALYZE' runs. sysbench ran on the same machine with varying > numbers of threads using the following command line: > > sysbench repro.lua --db-driver=pgsql --pgsql-host=localhost > --pgsql-db=postgres --pgsql-port=? --pgsql-user=? --pgsql-password=? > --report-interval=1 --threads=64 run > > The following table shows the results. It is clearly visible that the TPS > flatten out already at 8 parallel streams, while the planning time is > increasing drastically. > > Parallel streams | TPS (before) | Planning time (before) > -|--|--- > 1| 5,486 | 0.13 ms > 2| 8,098 | 0.22 ms > 4| 15,013 | 0.19 ms > 8| 27,107 | 0.29 ms > 16 | 30,938 | 0.43 ms > 32 | 26,330 | 1.68 ms > 64 | 24,314 | 2.48 ms > > We tracked down the root cause of this slowdown to lock contention in > 'get_relation_info()'. The index lock of every single index of every single > table used in that query is acquired. We attempted a fix by pre-filtering > out all indexes that anyways cannot be used with a certain query, without > taking the index locks (credits to Luc Vlaming for idea and > implementation). The patch does so by caching the columns present in every > index, inside 'struct Relation', similarly to 'rd_indexlist'. Then, before > opening (= locking) the indexes in 'get_relation_info()', we check if the > index can actually contribute to the query and if not it is discarded right > away. Caching the index info saves considerable work for every query run > subsequently, because less indexes must be inspected and thereby locked. > This way we also save cycles in any code that later on goes over all > relation indexes. > > The work-in-progress version of the patch is attached. It is still fairly > rough (e.g. uses a global variable, selects the best index in scans without > restrictions by column count instead of physical column size, is missing > some renaming, etc.), but shows the principle. > > The following table shows the TPS, planning time and speed-ups after > applying the patch and rerunning above described benchmark. Now, the > planning time remains roughly constant and TPS roughly doubles each time > the number of parallel streams is doubled. The higher the stream count the > more severe the lock contention is and the more pronounced the gained > speed-up gets. Interestingly, even for a single query stream the speed-up > in planning time is already very significant. This applies also for lower > index counts. For example just with 10 indexes the TPS for a single query > stream goes from 9,159 to 12,558. We can do more measurements if there is > interest in details for a lower number of indexes. > > Parallel streams | TPS (after) | Planning time (after) | Speed-up TPS | > Speed-up planning > > -|-|---|--|-- > 1| 10,344 | 0.046 | 1.9x| > 2.8x > 2| 20,140 | 0.045 ms | 2.5x| > 4.9x > 4| 40,349 | 0.047 ms | 2.7x| > 4.0x > 8| 80,121 | 0.046 ms | 3.0x| > 6.3x > 16 | 152,632 | 0.051 ms | 4.9x| > 8.4x > 32 | 301,359 | 0.052 ms | 11.4x| > 32.3x > 64 | 525,115 | 0.062 ms | 21.6x| > 40.0x > > We are happy to receive your feedback and polish up the patch. > > -- > David Geier > (ServiceNow) >
[PoC] Reducing planning time on tables with many indexes
Hi hackers, We came across a slowdown in planning, where queries use tables with many indexes. In setups with wide tables it is not uncommon to have easily 10-100 indexes on a single table. The slowdown is already visible in serial workloads with just a handful of indexes, but gets drastically amplified when running queries with more indexes in parallel at high throughput. We measured the TPS and planning time of running parallel streams of simple point look-up queries on a single empty table with 60 columns and 60 indexes. The query used is 'SELECT * FROM synth_table WHERE col5 = 42'. No rows are returned because the table is empty. We used a machine with 64 physical CPU cores. The schema and sysbench script to reproduce these numbers are attached. We used the TPS as reported by sysbench and obtained planning time by running 'EXPLAIN ANALYZE' on the same query in a separately opened connection. We averaged the planning time of 3 successive 'EXPLAIN ANALYZE' runs. sysbench ran on the same machine with varying numbers of threads using the following command line: sysbench repro.lua --db-driver=pgsql --pgsql-host=localhost --pgsql-db=postgres --pgsql-port=? --pgsql-user=? --pgsql-password=? --report-interval=1 --threads=64 run The following table shows the results. It is clearly visible that the TPS flatten out already at 8 parallel streams, while the planning time is increasing drastically. Parallel streams | TPS (before) | Planning time (before) -|--|--- 1| 5,486 | 0.13 ms 2| 8,098 | 0.22 ms 4| 15,013 | 0.19 ms 8| 27,107 | 0.29 ms 16 | 30,938 | 0.43 ms 32 | 26,330 | 1.68 ms 64 | 24,314 | 2.48 ms We tracked down the root cause of this slowdown to lock contention in 'get_relation_info()'. The index lock of every single index of every single table used in that query is acquired. We attempted a fix by pre-filtering out all indexes that anyways cannot be used with a certain query, without taking the index locks (credits to Luc Vlaming for idea and implementation). The patch does so by caching the columns present in every index, inside 'struct Relation', similarly to 'rd_indexlist'. Then, before opening (= locking) the indexes in 'get_relation_info()', we check if the index can actually contribute to the query and if not it is discarded right away. Caching the index info saves considerable work for every query run subsequently, because less indexes must be inspected and thereby locked. This way we also save cycles in any code that later on goes over all relation indexes. The work-in-progress version of the patch is attached. It is still fairly rough (e.g. uses a global variable, selects the best index in scans without restrictions by column count instead of physical column size, is missing some renaming, etc.), but shows the principle. The following table shows the TPS, planning time and speed-ups after applying the patch and rerunning above described benchmark. Now, the planning time remains roughly constant and TPS roughly doubles each time the number of parallel streams is doubled. The higher the stream count the more severe the lock contention is and the more pronounced the gained speed-up gets. Interestingly, even for a single query stream the speed-up in planning time is already very significant. This applies also for lower index counts. For example just with 10 indexes the TPS for a single query stream goes from 9,159 to 12,558. We can do more measurements if there is interest in details for a lower number of indexes. Parallel streams | TPS (after) | Planning time (after) | Speed-up TPS | Speed-up planning -|-|---|--|-- 1| 10,344 | 0.046 | 1.9x| 2.8x 2| 20,140 | 0.045 ms | 2.5x| 4.9x 4| 40,349 | 0.047 ms | 2.7x| 4.0x 8| 80,121 | 0.046 ms | 3.0x| 6.3x 16 | 152,632 | 0.051 ms | 4.9x| 8.4x 32 | 301,359 | 0.052 ms | 11.4x| 32.3x 64 | 525,115 | 0.062 ms | 21.6x| 40.0x We are happy to receive your feedback and polish up the patch. -- David Geier (ServiceNow) From 60e300b5cac8f527e61483296df81ab783670ac9 Mon Sep 17 00:00:00 2001 From: David Geier Date: Fri, 15 Jul 2022 11:53:27 +0200 Subject: [PATCH] Index filtering --- src/backend/optimizer/path/Makefile | 3 +- src/backend/optimizer/path/index_filtering.c | 220 ++ src/backend/optimizer/plan/planmain.c | 7 + src/backend/optimizer/util/plancat.c | 124 ++ src/backend/utils/cache/relcache.c| 17 +- src/include/optimizer
Reducing planning time on tables with many indexes
Hi hackers, We came across a slowdown in planning, where queries use tables with many indexes. In setups with wide tables it is not uncommon to have easily 10-100 indexes on a single table. The slowdown is already visible in serial workloads with just a handful of indexes, but gets drastically amplified when running queries with more indexes in parallel at high throughput. We measured the TPS and planning time of running parallel streams of simple point look-up queries on a single empty table with 60 columns and 60 indexes. The query used is 'SELECT * FROM synth_table WHERE col5 = 42'. No rows are returned because the table is empty. We used a machine with 64 physical CPU cores. The schema and sysbench script to reproduce these numbers are attached. We used the TPS as reported by sysbench and obtained planning time by running 'EXPLAIN ANALYZE' on the same query in a separately opened connection. We averaged the planning time of 3 successive 'EXPLAIN ANALYZE' runs. sysbench ran on the same machine with varying numbers of threads using the following command line: sysbench repro.lua --db-driver=pgsql --pgsql-host=localhost --pgsql-db=postgres --pgsql-port=? --pgsql-user=? --pgsql-password=? --report-interval=1 --threads=64 run The following table shows the results. It is clearly visible that the TPS flatten out already at 8 parallel streams, while the planning time is increasing drastically. Parallel streams | TPS (before) | Planning time (before) -|--|--- 1| 5,486 | 0.13 ms 2| 8,098 | 0.22 ms 4| 15,013 | 0.19 ms 8| 27,107 | 0.29 ms 16 | 30,938 | 0.43 ms 32 | 26,330 | 1.68 ms 64 | 24,314 | 2.48 ms We tracked down the root cause of this slowdown to lock contention in 'get_relation_info()'. The index lock of every single index of every single table used in that query is acquired. We attempted a fix by pre-filtering out all indexes that anyways cannot be used with a certain query, without taking the index locks (credits to Luc Vlaming for idea and implementation). The patch does so by caching the columns present in every index, inside 'struct Relation', similarly to 'rd_indexlist'. Then, before opening (= locking) the indexes in 'get_relation_info()', we check if the index can actually contribute to the query and if not it is discarded right away. Caching the index info saves considerable work for every query run subsequently, because less indexes must be inspected and thereby locked. This way we also save cycles in any code that later on goes over all relation indexes. The work-in-progress version of the patch is attached. It is still fairly rough (e.g. uses a global variable, selects the best index in scans without restrictions by column count instead of physical column size, is missing some renaming, etc.), but shows the principle. The following table shows the TPS, planning time and speed-ups after applying the patch and rerunning above described benchmark. Now, the planning time remains roughly constant and TPS roughly doubles each time the number of parallel streams is doubled. The higher the stream count the more severe the lock contention is and the more pronounced the gained speed-up gets. Interestingly, even for a single query stream the speed-up in planning time is already very significant. This applies also for lower index counts. For example just with 10 indexes the TPS for a single query stream goes from 9,159 to 12,558. We can do more measurements if there is interest in details for a lower number of indexes. Parallel streams | TPS (after) | Planning time (after) | Speed-up TPS | Speed-up planning -|-|---|--|-- 1| 10,344 | 0.046 | 1.9x| 2.8x 2| 20,140 | 0.045 ms | 2.5x| 4.9x 4| 40,349 | 0.047 ms | 2.7x| 4.0x 8| 80,121 | 0.046 ms | 3.0x| 6.3x 16 | 152,632 | 0.051 ms | 4.9x| 8.4x 32 | 301,359 | 0.052 ms | 11.4x| 32.3x 64 | 525,115 | 0.062 ms | 21.6x| 40.0x We are happy to receive your feedback and polish up the patch. -- David Geier (ServiceNow) function thread_init() drv = sysbench.sql.driver() con = drv:connect() end function thread_done() con:disconnect() end function event() con:query('SELECT * FROM synth_table WHERE col05 = 52') end schema.sql Description: application/sql From 60e300b5cac8f527e61483296df81ab783670ac9 Mon Sep 17 00:00:00 2001 From: David Geier Date: Fri, 15 Jul 2022 11:53:27 +0200 Subject: [PATCH] Index filtering --- src/backend/optimizer/path/Makefile | 3 +- src/backend/optimizer