RE: temp table on commit delete rows performance issue
> It seems that in your patch, WAL logging is skipped for all tables, not just > temporary tables. This code path is only used in two cases though: * For the temporary tables ON COMMIT DROP * For truncating tables that were created in the same transaction, or which were already truncated in the same transaction (this is some special case in the TRUNCATE command) In both cases I believe it's not necessary to log the lock, as the table doesn't exist on replica yet or the exclusive lock has already been obtained and logged previously. Regular TRUNCATE commands go through a completely different code path, as these need to be rollbackable if the transaction aborts. > Upon further consideration, do we really need to acquire AccessExclusiveLocks > for temporary tables? Since temporary tables can only be accessed within the > current session, perhaps we can make the following optimizations: This one I'm less sure of if it's correct in all cases. Logically it makes sense that no other backends can access it, however I see some threads [1] that suggest that it's technically possible for other backends to take locks on these tables, so it's not *that* obvious there are no edge cases. [1] https://postgrespro.com/list/thread-id/2477885
RE: temp table on commit delete rows performance issue
> I also encountered the similar performance issue with temporary tables > andprovided a patch to optimize the truncate performance during commit > in [1]. Interesting, that is definitely another good way to improve the performance, especially with a large number of temp tables. I think the two optimizations can actually work well together. Your optimization on only truncating the tables that are actually used. Combined with a patch like attached which makes sure that no WAL is generated at all for the ON COMMIT DELETE ROWS operation. On my test system this reduces WAL generation for the pgbench test case I posted previously to 0 (and therefore brought WAL replay process CPU usage from 100% CPU and lagging behind to only 0% CPU usage) -Floris 0001-Remove-any-WAL-logging-for-ON-COMMIT-DELETE-ROWS.patch Description: 0001-Remove-any-WAL-logging-for-ON-COMMIT-DELETE-ROWS.patch
RE: temp table on commit delete rows performance issue
> > I didn't investigate your particular issue but generally speaking creating a > table, even a temporary one, is an expensive operation. > > Note that it's far from being a seperate file on the disk. It affects catalog > tables, shared buffers, all the corresponding locks, etc. If you have indexes > for a temporary table it makes the situation ever worse. Sooner or later > VACUUM will happen for your bloated catalog, and this is not fun under > heavy load. Yes, creation is expensive, but note that this test case does not create the temp table for every transaction. It creates it once on startup of the connection. So there'll be no catalog bloat as the connections are generally long-lived. The part that remains to be expensive (but which was unexpected to me) is the truncate that does happen for every transaction. > > Is there any particular reason why you don't want to simply change the target > table directly? If you do it in a transaction you are safe. > This is a fair point which I didn't explain in my earlier message. We find that for various use cases the temp table approach is a really convenient way of interacting with Postgres. You can just easily binary COPY whatever data you want into it without looking too much at data size and then handling it from there on the db-side. Doing the same without this, you'd need to resort to passing the data as query parameters, but this has complications like a limit in number of bound params per query. It's definitely possible to use as well, but the binary COPY to temp table is quite convenient. -Floris
temp table on commit delete rows performance issue
Hi hackers, I'm looking for some input on an issue I've observed. A common pattern I've seen is using temporary tables to put data in before updating the real tables. Something roughly like: On session start: CREATE TEMP TABLE temp_t1 (...) ON COMMIT DELETE ROWS; On update: BEGIN; COPY temp_t1 FROM STDIN (FORMAT BINARY); INSERT INTO t1 (SELECT * FROM temp_t1 ...) ON CONFLICT DO UPDATE SET ...; -- potentially some other operations on temp table to put data into real table t1 COMMIT; This pattern starts to break down under certain exceptional circumstances of high concurrency. The "ON COMMIT DELETE ROWS" does a truncate that is fairly expensive and doesn't work well in high-concurrency scenarios. It's especially noticeable under following circumstances: - high max_connections setting - high number of temp tables per session - concurrent writers at fairly short intervals Impact is on both TPS on primary as well as that the WAL replay process on replica becomes completely overloaded (100% cpu even though not a lot of WAL is being generated) A very simple pgbench example that showcases degradation (taken with max_connections=2000 to clearly show it). test_truncate.sql begin; set client_min_messages=warning; create temp table if not exists ut0 (a int, b text, c text) on commit delete rows; commit; test_no_truncate.sql begin; set client_min_messages=warning; create temp table if not exists ut0 (a int, b text, c text); -- do not do anything on commit commit; pgbench -n -f test_truncate.sql -T 10 postgres -c 1 -j 1 pgbench (18devel) tps = 2693.806376 (without initial connection time) pgbench -n -f test_truncate.sql -T 10 postgres -c10 -j 10 pgbench (18devel) tps = 12119.358458 (without initial connection time) pgbench -n -f test_no_truncate.sql -T 10 postgres -c 1 -j 1 pgbench (18devel) tps = 22685.877281 (without initial connection time) pgbench -n -f test_no_truncate.sql -T 10 postgres -c10 -j 10 pgbench (18devel) tps = 200359.458154 (without initial connection time) For the test_truncate.sql with 10 threads, WAL replay process is spinning at 100% CPU. The reason seems to be that for a large part, the 'on commit delete rows' follows the same code path as a regular truncate. This means: * taking an exclusive lock on the temp table (WAL logged) * invalidating rel cache (WAL logged) On replica process, these exclusive locks are very expensive to replay, leading to spending replica most of its cycles trying to get LWLocks on non-fast-path. On primary, degradation on concurrency I believes comes from the rel cache invalidation. There doesn't seem to be a good reason for this behavior, as the replica has nothing to do for a temporary table truncation, so this lock doesn't need to be WAL logged. The generated WAL is just fully this stuff. LOCK+INVALIDATION+COMMIT. And it's spinning at 100% CPU for trying to replay this. rmgr: Standby len (rec/tot): 42/42, tx:8154875, lsn: 0/3AFFE270, prev 0/3AFFE218, desc: LOCK xid 8154875 db 5 rel 18188 rmgr: Standby len (rec/tot): 42/42, tx:8154875, lsn: 0/3AFFE2A0, prev 0/3AFFE270, desc: LOCK xid 8154875 db 5 rel 18191 rmgr: Standby len (rec/tot): 42/42, tx:8154875, lsn: 0/3AFFE2D0, prev 0/3AFFE2A0, desc: LOCK xid 8154875 db 5 rel 18192 rmgr: Transaction len (rec/tot): 62/62, tx:8154875, lsn: 0/3AFFE300, prev 0/3AFFE2D0, desc: INVALIDATION ; inval msgs: relcache 18191 relcache 18192 rmgr: Transaction len (rec/tot): 82/82, tx:8154875, lsn: 0/3AFFE340, prev 0/3AFFE300, desc: COMMIT 2024-07-16 12:56:42.878147 CEST; inval msgs: relcache 18191 relcache 18192 rmgr: Standby len (rec/tot): 42/42, tx:8154876, lsn: 0/3AFFE398, prev 0/3AFFE340, desc: LOCK xid 8154876 db 5 rel 18188 rmgr: Standby len (rec/tot): 42/42, tx:8154876, lsn: 0/3AFFE3C8, prev 0/3AFFE398, desc: LOCK xid 8154876 db 5 rel 18191 rmgr: Standby len (rec/tot): 42/42, tx:8154876, lsn: 0/3AFFE3F8, prev 0/3AFFE3C8, desc: LOCK xid 8154876 db 5 rel 18192 rmgr: Transaction len (rec/tot): 62/62, tx:8154876, lsn: 0/3AFFE428, prev 0/3AFFE3F8, desc: INVALIDATION ; inval msgs: relcache 18191 relcache 18192 rmgr: Transaction len (rec/tot): 82/82, tx:8154876, lsn: 0/3AFFE468, prev 0/3AFFE428, desc: COMMIT 2024-07-16 12:56:42.878544 CEST; inval msgs: relcache 18191 relcache 18192 Have people observed this behavior before? Would the community be open to accepting a patch that changes the behavior of ON COMMIT ROWS to not WAL log the lock taken and/or the relcache inval? I think: * An exclusive lock needs to be taken on primary, but does not need to be WAL logged * Rel cache invalidation should not be necessary I think (it currently happens just on the TOAST table+index, not on the regular TEMP table) -Floris
RE: pg_init_privs corruption.
> Kirill Reshke writes: > > As you can see, after drop role there is invalid records in > > pg_init_privs system relation. After this, pg_dump generate sql > > statements, some of which are based on content of pg_init_privs, resulting > in invalid dump. > This is as far as I can see the same case as what I reported a few years ago here: https://www.postgresql.org/message-id/flat/1574068566573.13088%40Optiver.com#488bd647ce6f5d2c92764673a7c58289 There was a discussion with some options, but no fix back then. -Floris
RE: Commitfest 2021-11 Patch Triage - Part 2
> Daniel Gustafsson writes: > > 2773: libpq compression > > === > > This patch intended to provide libpq connection compression to > > "replace SSL compression" which was doomed when the patch was > written, > > and have since been removed altogether. The initial approach didn't > > get much traction but there was significant discussion and work, which > > has since fizzled out. The patch has been updated but there hasn't > > been meaningful review the past months, the last comments seem to > > imply there being a fair amount of questionmarks left in here. > > Robert, having been very involved in this do you have any thoughts on > where we are and where to go (if at all IYO)? > > I'm not Robert, but I still have an opinion here, and that it's that this > feature > would at best be an attractive nuisance. If you need compression on a > database session, it probably means that the connection is over the open > internet, which means that you need encryption even more. And we know > that compression and encryption do not play well together. The reason > compression was taken out of the latest TLS standards is not that they > wouldn't have liked to have it, nor that applying compression in a separate > code layer would be any safer. I fear offering this would merely lead people > to build CVE-worthy setups. > I don't have a strong opinion on the patch but I'd just like to mention that there are definitely use cases for using compression on private networks (with ssl off). We have cases where we have a PG extension that hooks into the COPY ... FROM command and accept lz4-compressed data for the COPY data message specifically. This has proven invaluable to us even on private networks, because it can give a good compression ratio for very little CPU decompression overhead.
RE: Why timestamptz_pl_interval and timestamptz_mi_interval are not immutable?
> > What do I miss? > > -- > Best regards, > Alexander Pyhalov, > Postgres Professional > See for example around DST changes postgres=# begin; BEGIN postgres =# show timezone; TimeZone -- Europe/Amsterdam (1 row) postgres=# select '2021-03-27 15:00 +0100'::timestamptz + interval '1d'; ?column? 2021-03-28 15:00:00+02 (1 row) postgres =# set timezone to UTC; SET postgres =# select '2021-03-27 15:00 +0100'::timestamptz + interval '1d'; ?column? 2021-03-28 14:00:00+00 (1 row) postgres =# select '2021-03-28 15:00:00+02' = '2021-03-28 14:00:00+00'; ?column? -- f (1 row)
RE: visibility map corruption
> > I wonder if it's related to this issue: > > https://www.postgresql.org/message- > id/20210423234256.hwopuftipdmp3...@alap3.anarazel.de > > Have you increased autovacuum_freeze_max_age from its default? This > already sounds like the kind of database where that would make sense. > autovacuum_freeze_max_age is increased in our setup indeed (it is set to 500M). However, we do regularly run manual VACUUM (FREEZE) on individual tables in the database, including this one. A lot of tables in the database follow an INSERT-only pattern and since it's not running v13 yet, this meant that these tables would only rarely be touched by autovacuum. Autovacuum would sometimes kick in on some of these tables at the same time at unfortunate moments. Therefore we have some regular job running that VACUUM (FREEZE)s tables with a xact age higher than a (low, 10M) threshold ourselves.
RE: visibility map corruption
> On Sun, Jul 4, 2021 at 1:44 PM Floris Van Nee > wrote: > > We recently ran into an issue where the visibility map of a relation was > corrupt, running Postgres 12.4. The error we'd get when running a SELECT * > from this table is: > > > > could not access status of transaction 3704450152 > > DETAIL: Could not open file "pg_xact/0DCC": No such file or directory. > > Have you ever used pg_upgrade on this database? > Yes. The last time (from v11 to v12) was in October 2020. The transaction id in the tuples (the one PG is trying to check in the tx log) dated from February 2021. I do believe (but am not 100% certain) that the affected table already existed at the time of the last pg_upgrade though.
visibility map corruption
Hi hackers, We recently ran into an issue where the visibility map of a relation was corrupt, running Postgres 12.4. The error we'd get when running a SELECT * from this table is: could not access status of transaction 3704450152 DETAIL: Could not open file "pg_xact/0DCC": No such file or directory. On the lists I could find several similar reports, but corruption like this could obviously have a very wide range of root causes.. see [1] [2] [3] for example - not all of them have their root cause known. This particular case was similar to reported cases above, but also has some differences. The following query returns ~21.000 rows, which indicates something inconsistent between the visibility map and the pd_all_visible flag on the page: select * from pg_check_frozen('tbl'); Looking at one of the affected pages with pageinspect: =# SELECT lp,lp_off,lp_flags,lp_len,t_xmin,t_xmax,t_field3,t_ctid,t_infomask2,t_infomask,t_hoff,t_oid FROM heap_page_items(get_raw_page('tbl', 726127)); ┌┬┬──┬┬┬┬──┬┬─┬┬┬───┐ │ lp │ lp_off │ lp_flags │ lp_len │ t_xmin │ t_xmax │ t_field3 │ t_ctid │ t_infomask2 │ t_infomask │ t_hoff │ t_oid │ ├┼┼──┼┼┼┼──┼┼─┼┼┼───┤ │ 1 │ 6328 │1 │ 1864 │ 3704450155 │ 3704450155 │1 │ (726127,1) │ 249 │ 8339 │ 56 │ ∅ │ │ 2 │ 4464 │1 │ 1864 │ 3704450155 │ 3704450155 │1 │ (726127,2) │ 249 │ 8339 │ 56 │ ∅ │ │ 3 │ 2600 │1 │ 1864 │ 3704450155 │ 3704450155 │1 │ (726127,3) │ 249 │ 8339 │ 56 │ ∅ │ │ 4 │680 │1 │ 1920 │ 3704450155 │ 3704450155 │1 │ (726127,4) │ 249 │ 8339 │ 56 │ ∅ │ └┴┴──┴┴┴┴──┴┴─┴┴┴───┘ t_infomask shows that HEAP_XMIN_COMMITTED and HEAP_XMIN_INVALID bits are both unset. This pg_visibility() call shows the inconsistency between VM and page, with PD_ALL_VISIBLE=false =# select * from pg_visibility('tbl', 726127); ┌─┬┬┐ │ all_visible │ all_frozen │ pd_all_visible │ ├─┼┼┤ │ t │ t │ f │ └─┴┴┘ Looking at other pages show the same information. What's interesting is that out of the affected tuples returned by pg_check_frozen, over 99% belong to 1 transaction (3704450155 as seen above) and the remaining few are from one other transaction that occurred at roughly the same time. I find it hard to believe that this is due to some random bit flipping, because many pages are affected, but the "incorrect" ones are in total only from two specific transactions which occurred at roughly the same time. There were also no server crashes or other known failures around the time of this transaction. I'm not ruling out any other kind of failure still, but I also cannot really explain how this could have happened. The server has PG12.4 with full_page_writes=on, data_checksums=off. It's a large analytics database. The VM inconsistencies also occur on the streaming replicas. I realize these cases are pretty rare and hard to debug, but I wanted to share the information I found so far here for reference. Maybe someone has an idea what occurred, or maybe someone in the future finds it useful when he finds something similar. I have no idea how the inconsistency between VM and PD_ALL_VISIBLE started - from looking through the code I can't really find any way how this could occur. However, for it to lead to the problem described here, I believe there should be *no* SELECT that touches that particular page after the insert/update transaction and before the transaction log gets truncated. If the page is read before the transaction log gets truncated, then the hint bit HEAP_XMIN_COMMITTED will get set and future reads will succeed regardless of tx log truncation. One of the replica's had this happen to it: the affected pages are identical to the primary except that the HEAP_XMIN_COMMITTED flag is set (note that the VM inconsistency is still there on the replica though: PD_ALL_VISIBLE=false even though VM shows that all_frozen=all_visible=true). But I can query these rows on the replica without issues, because it doesn't check the tx log when it sees that HEAP_XMIN_COMMITTED is set. -Floris [1] https://postgrespro.com/list/thread-id/2422376 [2] https://postgrespro.com/list/thread-id/2501800 [3] https://postgrespro.com/list/thread-id/2321949
RE: non-HOT update not looking at FSM for large tuple update
Hi Noah, Thanks for taking a look at this patch. > > In evaluating whether this is a good choice of value, I think about the > expected page lifecycle. A tuple barely larger than fillfactor (roughly > len=1+BLCKSZ*fillfactor/100) will start on a roughly-empty page. As long as > the tuple exists, the server will skip that page for inserts. Updates can > cause > up to floor(99/fillfactor) same-size versions of the tuple to occupy the page > simultaneously, creating that many line pointers. At the fillfactor=10 > minimum, it's good to accept otherwise-empty pages having at least nine line > pointers, so a page can restart the aforementioned lifecycle. Tolerating even > more line pointers helps when updates reduce tuple size or when the page > was used for smaller tuples before it last emptied. At the BLCKSZ=8192 > default, this maxPaddedFsmRequest allows 36 line pointers (good or > somewhat high). At the BLCKSZ=1024 minimum, it allows 4 line pointers > (low). At the BLCKSZ=32768 maximum, 146 (likely excessive). I'm not > concerned about optimizing non-default block sizes, so let's keep your > proposal. > Agreed. You briefly mention this already, but the case that caused me to report this was exactly the one where under normal circumstances each UPDATE would be small. However, in rare cases, the tuple that is updated grows in size to 1k bytes (the specific case we encountered sometimes would under specific circumstances write extra info in a field, which would otherwise be NULL). Suppose that this 1k UPDATE does not fit into the current page (so no HOT update), then a new page would be created (HEAD behavior). However, it is very likely that the next updates to this same tuple will be the regular size again. This causes the higher number of line pointers on the page. -Floris
RE: non-HOT update not looking at FSM for large tuple update
Hi, > > This patch fails to consider that len may be bigger than MaxHeapTupleSize * > 0.98, which in this case triggers a reproducable > PANIC: Good catch! I've adapted the patch with your suggested fix. > > One different question I have, though, is why we can't "just" teach vacuum > to clean up trailing unused line pointers. As in, can't we trim the line > pointer > array when vacuum detects that the trailing line pointers on the page are all > unused? > > The only documentation that I could find that this doesn't happen is in the > comment on PageIndexTupleDelete and PageRepairFragmentation, both not > very descriptive on why we can't shrink the page->pd_linp array. One is > "Unlike heap pages, we compact out the line pointer for the removed tuple." > (Jan. 2002), and the other is "It doesn't remove unused line pointers! Please > don't change this." (Oct. 2000), but I can't seem to find the documentation / > conversations on the implications that such shrinking would have. > This is an interesting alternative indeed. I also can't find any documentation/conversation about this and the message is rather cryptic. Hopefully someone on the list still remembers the reasoning behind this rather cryptic comment in PageRepairFragmentation. -Floris v3-Allow-inserting-tuples-into-almost-empty-pages.patch Description: v3-Allow-inserting-tuples-into-almost-empty-pages.patch
RE: non-HOT update not looking at FSM for large tuple update
> I've added this to the commitfest as a bug fix and added you as an author. Thanks. Patch looks good to me, but I guess there needs to be someone else reviewing too? Also, would this be a backpatchable bugfix? -Floris
RE: non-HOT update not looking at FSM for large tuple update
> That makes sense, although the exact number seems precisely tailored to your > use case. 2% gives 164 bytes of slack and doesn't seem too large. Updated > patch attached. Yeah, I tried picking it as conservative as I could, but 2% is obviously great too. :-) I can't think of any large drawbacks either of having a slightly larger value. Thanks for posting the patch! -Floris
RE: non-HOT update not looking at FSM for large tuple update
> In this case, the vast majority has between 8050-8099 bytes free according to > the FSM. That means that, for this particular case, for a fillfactor of 10, > we’d need to subtract ~120 bytes or so in order to properly recycle the pages. Also, I think this "fudge" factor would need to be defined as a percentage of the page size as well. 100 bytes on an 8kB page is quite different than 100 bytes on a 1kB page (although I have no idea if people ever actually compile PG with a different page size, but it is supposed to be supported?). I also understand the temptation to define it based on the relation's fill factor, as you did in the patch. However, upon some further thought I wonder if that's useful? A relation with a higher fill factor will have a lower 'saveFreeSpace' variable, so it's less likely to run into issues in finding a fresh page, except if the tuple you're inserting/updating is even larger. However, if that case happens, you'll still be wanting to look for a page that's completely empty (except for the line items). So the only proper metric is 'how many unused line items do we expect on empty pages' and the fillfactor doesn't say much about this. Since this is probably difficult to estimate at all, we may be better off just defining it off MaxHeapTupleSize completely? For example, we expect 1.5% of the page could be line items, then: targetFreeSpace = MaxHeapTupleSize * 0.985 -Floris
RE: non-HOT update not looking at FSM for large tuple update
Hi John, > One idea is to take your -50 idea and make it more general and safe, by > scaling the fudge factor based on fillfactor, such that if fillfactor is less > than 100, the requested freespace is a bit smaller than the max. It's still a > bit of a hack, though. I've attached a draft of this idea. You’re right, that’d work better. Though, one thing I'd forgot to mention earlier is that in the "wild" where this occurred, the UPDATEs with these large tuple sizes are much rarer than UPDATEs with a much smaller tuple size. So this means that in reality, when this happens, the empty pages contain more unused line pointers and we’d need to subtract more bytes in order to find those pages in the fsm. This is the (partial) output of pg_freespace function, bucketed by free space, for a real-life table with fillfactor=10 under the mixed load that I've described. │ free │ count │ │ 7750 │2003 │ │ 7800 │7113 │ │ 7850 │1781 │ │ 7900 │6803 │ │ 7950 │ 13643 │ │ 8000 │ 64779 │ │ 8050 │ 2469665 │ │ 8100 │ 61869 │ └──┴─┘ (163 rows) The ‘free’ column is the bucket where the value is the lower limit. So, free=7500 means between 7500-7549 bytes free, and count is the number of pages that have this amount free according to the fsm. In this case, the vast majority has between 8050-8099 bytes free according to the FSM. That means that, for this particular case, for a fillfactor of 10, we’d need to subtract ~120 bytes or so in order to properly recycle the pages. -Floris
non-HOT update not looking at FSM for large tuple update
Hi hackers, Recently we found a table that was slowly, but consistently increasing in size. The table has a low fill-factor set and was updated very frequently. As expected, almost all updates are HOT updates, but for some of the non-HOT updates it always wanted to use a new page, rather than reuse an existing empty page. This led to a steady growth in table size (and a steady growth in the number of empty pages in the table). I've managed to create a very simple reproducing example that shows the problem (original problem occurred on 12.4, but I've tested this example on latest master). It only occurs for updates where the new tuple is larger than the size of what "fillfactor" would normally allow. In real life, this would only be a very small portion of the updates to a certain table of course, but in this example every update will be this large. Create a table with a low fill-factor and insert one row into it. Note that, in this case, the row that we're inserting is by itself larger than the "max fill factor space". create table t1 (a int primary key, b text) with (fillfactor=10); insert into t1 select 1, (select string_agg('1', '') from generate_series(1,1000)); -- 1000 byte text field vacuum t1; postgres=# select * from pg_freespace('t1'); blkno | avail ---+--- 0 | 7104 (1 row) This looks alright - there's 1 page and the available space is indeed roughly 1000 bytes less, because of our tuple and page header. Now, in a different backend, initiate a longer query. select pg_sleep(600); -- just sleep 600 seconds so that we have enough time to do some updates during this Then, in the original backend, update the tuple 7 times. -- execute this 7 times update t1 set b=(select string_agg((random()*9)::int::text, '') from generate_series(1,1000)) where a=1; Cancel the pg_sleep call. Then execute vacuum t1; -- cleans rows and updates the fsm postgres=# select * from pg_freespace('t1'); blkno | avail ---+--- 0 | 8128 1 | 7104 (2 rows) This still looks OK. There's an extra page, because a total of 8 tuples needed to kept alive for the pg_sleep query. These didn't fit on one page, so a new page was created. Now, repeat it (the pg_sleep, update 7 times, cancel the pg_sleep and vacuum). postgres=# select * from pg_freespace('t1'); blkno | avail ---+--- 0 | 8128 1 | 8128 2 | 7104 (3 rows) This does not look good anymore. The tuple was on page 1, so at first there were several HOT updates on page 1. Then, when page 1 was full, it needed to search for another page to put the tuple. It did not consider page 0, but instead decided to create a new page 2. Repeating this process would create a new page each time, never reusing the empty old pages. The reason it does not consider page 0 is because of this piece of code in function RelationGetBufferForTuple in hio.c: /* Compute desired extra freespace due to fillfactor option */ saveFreeSpace = RelationGetTargetPageFreeSpace(relation, HEAP_DEFAULT_FILLFACTOR); ... if (len + saveFreeSpace > MaxHeapTupleSize) { /* can't fit, don't bother asking FSM */ targetBlock = InvalidBlockNumber; use_fsm = false; } The problem here is two-folded: for any non-HOT update of a tuple that's larger than the size of the fillfactor, the fsm will not be used, but instead a new page will be chosen. This seems to rely on the false assumption that every existing page has at last one tuple on it. Secondly, and this is a bit trickier.. Even if we would ask the FSM to come up with a free page with a free size of "MaxHeapTupleSize", it wouldn't find anything... This is, because the FSM tracks free space excluding any unused line pointers. In this example, if we look at block 0: postgres=# select * from page_header(get_raw_page('t1', 0)); lsn| checksum | flags | lower | upper | special | pagesize | version | prune_xid ---+--+---+---+---+-+--+-+--- 0/16D75A0 |0 | 5 |52 | 8192 |8192 | 8192 | 4 | 0 (1 row) postgres=# select * from heap_page_items(get_raw_page('t1', 0)); lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid | t_data ++--++++--++-++++---+ 1 | 0 |0 | 0 ||| || |||| | 2 | 0 |0 | 0 ||| || |||| | 3 | 0 |0 | 0 ||| || |||| | 4 | 0 |0 | 0 ||| || |||| |
RE: Index Skip Scan (new UniqueKeys)
> > One UniqueKey can have multiple corresponding expressions, which gives us > also possibility of having one unique key with (t1.a, t2.a) and it looks now > similar to EquivalenceClass. > I believe the current definition of a unique key with two expressions (t1.a, t2.a) means that it's unique on the tuple (t1.a, t2.a) - this gives weaker guarantees than uniqueness on (t1.a) and uniqueness on (t2.a). > > The idea behind this query sounds questionable to me, more transparent > would be to do this without distinct, skipping will actually do exactly the > same > stuff just under another name. But if allowing skipping on constants do not > bring significant changes in the code probably it's fine. > Yeah indeed, I didn't say it's a query that people should generally write. :-) It's better to write as a regular SELECT with LIMIT 1 of course. However, it's more to be consistent and predictable to the user: if a SELECT DISTINCT ON (a) * FROM t1 runs fast, then it doesn't make sense to the user if a SELECT DISTINCT ON (a) * FROM t1 WHERE a=2 runs slow. And to support it also makes the implementation more consistent with little code changes. > > > > Yeah, there's definitely some double work there, but the actual impact may > be limited - it doesn't actually allocate a new path key, but it looks it up > in > root->canon_pathkeys and returns that path key. > > I wrote it like this, because I couldn't find a way to identify from a > > certain > PathKey the actual location in the index of that column. The constructed path > keys list filters out all redundant path keys. An index on (a,a,b,a,b) becomes > path keys (a,b). Now if we skip on (a,b) we actually need to use prefix=3. But > how to get from PathKey=b to that number 3, I didn't find a solid way except > doing this. Maybe there is though? > > I don't think there is a direct way, but why not modify build_index_paths to > also provide this information, or compare index_pathkeys expressions with > indextlist without actually create those pathkeys again? > I agree there could be other ways - I don't currently have a strong preference for either. I can have a look at this later. > And couple of words about this thread [1]. It looks to me like a strange way > of interacting with the community. Are you going to duplicate there > everything, or what are your plans? At the very least you could try to include > everyone involved in the recipients list, not exclude some of the authors. > When I wrote the first mail in the thread, I went to this thread [1] and included everyone from there, but I see now that I only included the to: and cc: people and forgot the original thread author, you. I'm sorry about that - I should've looked better to make sure I had everyone. In any case, my plan is to keep the patch at least applicable to master, as I believe it can be helpful for discussions about both patches. [1] https://www.postgresql.org/message-id/20200609102247.jdlatmfyeecg52fi%40localhost
RE: [PATCH] Keeps tracking the uniqueness with UniqueKey
Hi Andy, A small thing I found: +static List * +get_exprs_from_uniqueindex(IndexOptInfo *unique_index, + List *const_exprs, + List *const_expr_opfamilies, + Bitmapset *used_varattrs, + bool *useful, + bool *multi_nullvals) … + indexpr_item = list_head(unique_index->indexprs); + for(c = 0; c < unique_index->ncolumns; c++) + { I believe the for loop must be over unique_index->nkeycolumns, rather than columns. It shouldn’t include the extra non-key columns. This can currently lead to invalid memory accesses as well a few lines later when it does an array access of unique_index->opfamily[c] – this array only has nkeycolumns entries. -Floris From: Andy Fan Sent: Sunday 19 July 2020 5:03 AM To: Dmitry Dolgov <9erthali...@gmail.com> Cc: David Rowley ; PostgreSQL Hackers ; Tom Lane ; Ashutosh Bapat ; rushabh.lat...@gmail.com Subject: Re: [PATCH] Keeps tracking the uniqueness with UniqueKey [External] Fixed a test case in v10. -- Best Regards Andy Fan
RE: Index Skip Scan (new UniqueKeys)
> > > - the uniquekeys that is built, still contains some redundant keys, that are > normally eliminated from the path keys lists. > > I guess you're talking about: > > + if (EC_MUST_BE_REDUNDANT(ec)) > + continue; > > Can you add some test cases to your changes to show the effect of it? It > seem to me redundant keys are already eliminated at this point by either > make_pathkeys_for_uniquekeys or even earlier for distinct on, but could be > I've missed something. > The build_uniquekeys function calls make_pathkeys_for_uniquekeys, which checks for uniqueness using pathkey_is_unique, but not for constantness. Consider a query like: SELECT DISTINCT ON (a,b) * FROM t1 WHERE a=1 ORDER BY a,b,c All the regular path keys filter out 'a' for constantness here - so they would end up with a distinct pathkeys of (b) and sort path keys of (b,c). The unique keys would end up with (a,b) though. In the previous patch, it'd compared in create_distinct_paths, the pathkeys to the unique keys, so it wouldn't consider the skip scan. Due to the other changes I made in create_distinct_paths/query_has_uniquekeys_for, it will choose a correct plan now, even without the EC_MUST_BE_REDUNDANT check though, so it's difficult to give an actual failing test case now. However, since all code filters out constant keys, I think uniqueness should do it too - otherwise you could get into problems later on. It's also more consistent. If you already know something is unique by just (b), it doesn't make sense to store that it's unique by (a,b). Now that I think of it, the best place to do this EC_MUST_BE_REDUNDANT check is probably inside make_pathkeys_for_uniquekeys, rather than build_uniquekeys though. It's probably good to move it there. > Along the lines I'm also curious about this part: > > - ListCell *k; > - List *exprs = NIL; > - > - foreach(k, ec->ec_members) > - { > - EquivalenceMember *mem = (EquivalenceMember *) > lfirst(k); > - exprs = lappend(exprs, mem->em_expr); > - } > - > - result = lappend(result, makeUniqueKey(exprs, false, false)); > + EquivalenceMember *mem = (EquivalenceMember*) > +lfirst(list_head(ec->ec_members)); > > I'm curious about this myself, maybe someone can clarify. It looks like > generaly speaking there could be more than one member (if not > ec_has_volatile), which "representing knowledge that multiple items are > effectively equal". Is this information is not interesting enough to preserve > it > in unique keys? Yeah, that's a good question. Hence my question about the choice for Expr rather than EquivalenceClass for the Unique Keys patch to Andy/David. When storing just Expr, it is rather difficult to check equivalence in joins for example. Suppose, later on we decide to add join support to the distinct skip scan. Consider a query like this: SELECT DISTINCT t1.a FROM t1 JOIN t2 ON t1.a=t2.a As far as my understanding goes (I didn't look into it in detail though), I think here the distinct_pathkey will have an EqClass {t1.a, t2.a}. That results in a UniqueKey with expr (t1.a) (because currently we only take the first Expr in the list to construct the UniqueKey). We could also construct *two* unique keys, one with Expr (t1.a) and one with Expr (t2.a), but I don't think that's the correct approach either, as it will explode when you have multiple pathkeys, each having multiple Expr inside their EqClasses. That makes it difficult to check if we can perform the DISTINCT skip scan on table t2 as well (theoretically we could, but for that we need to check equivalence classes rather than expressions). > > > - the distinct_pathkeys may be NULL, even though there's a possibility for > skipping. But it wouldn't create the uniquekeys in this case. This makes the > planner not choose skip scans even though it could. For example in queries > that do SELECT DISTINCT ON (a) * FROM t1 WHERE a=1 ORDER BY a,b; Since a > is constant, it's eliminated from regular pathkeys. > > What would be the point of skipping if it's a constant? For the query: SELECT DISTINCT ON (a) * FROM t1 WHERE a=1 ORDER BY a,b; There may be 1000s of records with a=1. We're only interested in the first one though. The traditional non-skip approach would still scan all records with a=1. Skip would just fetch the first one with a=1 and then skip to the next prefix and stop the scan. > > > - to combat the issues mentioned earlier, there's now a check in > build_index_paths that checks if the query_pathkeys matches the > useful_pathkeys. Note that we have to use the path keys here rather than > any of the unique keys. The unique keys are only Expr nodes - they do not > contain the necessary information about ordering. Due to elimination of > some constant path keys, we have to search the attributes of the index to > find the correct prefix to use in skipping. > > IIUC here you mean this function, right? > > + prefix = find_index_prefix_for_pathkey(root,
RE: Index Skip Scan (new UniqueKeys)
Hi Dmitry, Also took another look at the patch now, and found a case of incorrect data. It looks related to the new way of creating the paths, as I can't recall seeing this in earlier versions. create table t1 as select a,b,b%5 as c, random() as d from generate_series(1, 10) a, generate_series(1,100) b; create index on t1 (a,b,c); postgres=# explain select distinct on (a) * from t1 order by a,b desc,c; QUERY PLAN --- Sort (cost=2.92..2.94 rows=10 width=20) Sort Key: a, b DESC, c -> Index Scan using t1_a_b_c_idx on t1 (cost=0.28..2.75 rows=10 width=20) Skip scan: true (4 rows) With the 'order by a, b desc, c' we expect the value of column 'b' to always be 100. With index_skipscan enabled, it always gives 1 though. It's not correct that the planner chooses a skip scan followed by sort in this case. -Floris
RE: Index Skip Scan
> > * Suspicious performance difference between different type of workload, > mentioned by Tomas (unfortunately I had no chance yet to investigate). > His benchmark results indeed most likely point to multiple comparisons being done. Since the most likely place where these occur is _bt_readpage, I suspect this is called multiple times. Looking at your patch, I think that's indeed the case. For example, suppose a page contains [1,2,3,4,5] and the planner makes a complete misestimation and chooses a skip scan here. First call to _bt_readpage will compare every tuple on the page already and store everything in the workspace, which will now contain [1,2,3,4,5]. However, when a skip is done the elements on the page (not the workspace) are compared to find the next one. Then, another _bt_readpage is done, starting at the new offnum. So we'll compare every tuple (except 1) on the page again. Workspace now contains [2,3,4,5]. Next tuple we'll end up with [3,4,5] etc. So tuple 5 actually gets compared 5 times in _bt_readpage alone. -Floris
RE: Index Skip Scan
> > Hm, I wasn't aware about this one, thanks for bringing this up. Btw, Floris, I > would appreciate if in the future you can make it more visible that changes > you > suggest contain some fixes. E.g. it wasn't clear for me from your previous > email > that that's the case, and it doesn't make sense to pull into different > direction > when we're trying to achieve the same goal :) I wasn't aware that this particular case could be triggered before I saw Dilip's email, otherwise I'd have mentioned it here of course. It's just that because my patch handles filter conditions in general, it works for this case too. > > > > In the patch I posted a week ago these cases are all handled > > > correctly, as it introduces this extra logic in the Executor. > > > > Okay, So I think we can merge those fixes in Dmitry's patch set. > > I'll definitely take a look at suggested changes in filtering part. It may be possible to just merge the filtering part into your patch, but I'm not entirely sure. Basically you have to pull the information about skipping one level up, out of the node, into the generic IndexNext code. I'm eager to get some form of skip scans into master - any kind of patch that makes this possible is fine by me. Long term I think my version provides a more generic approach, with which we can optimize a much broader range of queries. However, since many more eyes have seen your patch so far, I hope yours can be committed much sooner. My knowledge on this committer process is limited though. That's why I've just posted mine so far in the hope of collecting some feedback, also on how we should continue with the process. -Floris
RE: Index Skip Scan
> > > On Sun, Apr 05, 2020 at 04:30:51PM +0530, Dilip Kumar wrote: > > > > > > I was just wondering how the distinct will work with the "skip scan" > > > if we have some filter? I mean every time we select the unique row > > > based on the prefix key and that might get rejected by an external > > > filter right? > > Yeah, you're correct. This patch only handles the index conditions and doesn't handle any filters correctly. There's a check in the planner for the IndexScan for example that only columns that exist in the index are used. However, this check is not sufficient as your example shows. There's a number of ways we can force a 'filter' rather than an 'index condition' and still choose a skip scan (WHERE b!=0 is another one I think). This leads to incorrect query results. This patch would need some logic in the planner to never choose the skip scan in these cases. Better long-term solution is to adapt the rest of the executor to work correctly in the cases of external filters (this ties in with the previous visibility discussion as well, as that's basically also an external filter, although a special case). In the patch I posted a week ago these cases are all handled correctly, as it introduces this extra logic in the Executor. -Floris
RE: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()
> Attached is v5, which inlines in a targeted fashion, pretty much in the same > way as the earliest version. This is the same as v4 in every other way. > Perhaps you can test this. > Thank you for the new patch. With the new one I am indeed able to reproduce a performance increase. It is very difficult to reliably reproduce it when running on a large number of cores though, due to the NUMA architecture. For tests with a small number of cores, I pin all of them to the same node. With that, I see a significant performance increase for v5 compared to master. However, if I pin pgbench to a different node than the node that Postgres is pinned to, this leads to a 20% performance degradation compared to having all of them on the same node, as well as the stddev increasing by a factor of 2 (regardless of patch). With that, it becomes very difficult to see any kind of performance increase due to the patch. For a large number of pgbench workers, I cannot specifically pin the pgbench worker on the same node as the Postgres backend connection it's handling. Leaving it to the OS gives very unreliable results - when I run the 30 workers / 30 connections test, I sometimes see periods of up to 30 minutes where it's doing it 'correctly', but it could also randomly run at the -20% performance for a long time. So far my best bet at explaining this is the NUMA performance hit. I'd like to be able to specifically schedule some Postgres backends to run on one node, while other Postgres backends run on a different node, but this isn't straightforward. For now, I see no issues with the patch though. However, in real life situations there may be other, more important, optimizations for people that use big multi-node machines. Thoughts? -Floris
RE: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()
> The cfbot thinks it doesn't even apply anymore --- conflict with the dedup > patch, no doubt? Minor conflict with that patch indeed. Attached is rebased version. I'm running some tests now - will post the results here when finished. -Floris v4-0001-Avoid-pipeline-stall-in-_bt_compare.patch Description: v4-0001-Avoid-pipeline-stall-in-_bt_compare.patch v4-0002-Inline-_bt_compare.patch Description: v4-0002-Inline-_bt_compare.patch
RE: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()
> > The interesting thing now is the role of the "negative infinity test" > patch (the 0003-* patch) in all of this. I suspect that it may not be helping > us > much here. I wonder, could you test the following configurations to settle > this question? > > * with 30 clients (i.e. repeat the test that you reported on most > recently) > > * with 30 clients (i.e. repeat the test that you reported got us > that nice ~8.6% increase in TPS) > > * with 30 clients -- a new test, to see if performance is at all > helped by the "negative infinity test" patch (the 0003-* patch). > > It seems like a good idea to repeat the other two tests as part of performing > this third test, out of general paranoia. Intel seem to roll out a microcode > update for a spectre-like security issue about every other day. > I ran all the tests on two different machines, several times for 1 hour each time. I'm still having a hard time getting reliable results for the 30 clients case though. I'm pretty certain the patches bring a performance benefit, but how high exactly is difficult to say. As for applying only patch 1+2 or all three patches - I found no significant difference between these two cases. It looks like all the performance benefit is in the first two patches. -Floris
RE: Index Skip Scan
> this point me and Jesper inclined to go with the second option. But maybe > I'm missing something, are there any other suggestions? Unfortunately I figured this would need a more invasive fix. I tend to agree that it'd be better to not skip in situations like this. I think it'd make most sense to make any plan for these 'prepare/fetch' queries would not use skip, but rather a materialize node, right? -Floris
RE: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()
> > Can you test this version, Floris? The second two patches are probably not > helping here, so it would be useful if you could just test 0001-*, and then > test > all three together. I can toss the latter two patches if there is no > additional > speedup. > Here's the results for runs with respectively 1 client, 9 clients and 30 clients on master, v2-0001, v2-0001+0002+0003 and for completeness also the previous v1 version of the patches. I ran the tests for 45 minutes each this time which seems to give more stable results. I'd say applying just v2-0001 is actually slightly hurtful for single-core performance. Applying all of them gives a good improvement though. It looks like the performance improvement is also more noticeable at higher core counts now. number of clients: 1 number of threads: 1 duration: 2700 s number of transactions actually processed: 139314796 latency average = 0.019 ms latency stddev = 0.001 ms tps = 51598.071835 (including connections establishing) tps = 51598.098715 (excluding connections establishing) number of clients: 1 number of threads: 1 duration: 2700 s number of transactions actually processed: 137257492 latency average = 0.020 ms latency stddev = 0.001 ms tps = 50836.107076 (including connections establishing) tps = 50836.133137 (excluding connections establishing) number of clients: 1 number of threads: 1 duration: 2700 s number of transactions actually processed: 141721881 latency average = 0.019 ms latency stddev = 0.001 ms tps = 52489.584928 (including connections establishing) tps = 52489.611373 (excluding connections establishing) number of clients: 1 number of threads: 1 duration: 2700 s number of transactions actually processed: 141663780 latency average = 0.019 ms latency stddev = 0.001 ms tps = 52468.065549 (including connections establishing) tps = 52468.093018 (excluding connections establishing) number of clients: 9 number of threads: 9 duration: 2700 s number of transactions actually processed: 1197242115 latency average = 0.020 ms latency stddev = 0.001 ms tps = 443422.987601 (including connections establishing) tps = 443423.306495 (excluding connections establishing) number of clients: 9 number of threads: 9 duration: 2700 s number of transactions actually processed: 1187890004 latency average = 0.020 ms latency stddev = 0.002 ms tps = 439959.241392 (including connections establishing) tps = 439959.588125 (excluding connections establishing) number of clients: 9 number of threads: 9 duration: 2700 s number of transactions actually processed: 1203412941 latency average = 0.020 ms latency stddev = 0.002 ms tps = 445708.478915 (including connections establishing) tps = 445708.798583 (excluding connections establishing) number of clients: 9 number of threads: 9 duration: 2700 s number of transactions actually processed: 1195359533 latency average = 0.020 ms latency stddev = 0.001 ms tps = 442725.734267 (including connections establishing) tps = 442726.052676 (excluding connections establishing) number of clients: 30 number of threads: 30 duration: 2700 s number of transactions actually processed: 2617037081 latency average = 0.031 ms latency stddev = 0.011 ms tps = 969272.811990 (including connections establishing) tps = 969273.960316 (excluding connections establishing) number of clients: 30 number of threads: 30 duration: 2700 s number of transactions actually processed: 2736881585 latency average = 0.029 ms latency stddev = 0.011 ms tps = 1013659.581348 (including connections establishing) tps = 1013660.819277 (excluding connections establishing) number of clients: 30 number of threads: 30 duration: 2700 s number of transactions actually processed: 2844199686 latency average = 0.028 ms latency stddev = 0.011 ms tps = 1053407.074721 (including connections establishing) tps = 1053408.220093 (excluding connections establishing) number of clients: 30 number of threads: 30 duration: 2700 s number of transactions actually processed: 2693765822 latency average = 0.030 ms latency stddev = 0.011 ms tps = 997690.883117 (including connections establishing) tps = 997692.051005 (excluding connections establishing)
RE: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()
> > I could do some tests with the patch on some larger machines. What exact > tests do you propose? Are there some specific postgresql.conf settings and > pgbench initialization you recommend for this? And was the test above just > running 'pgbench -S' select-only with specific -T, -j and -c parameters? > With Andres' instructions I ran a couple of tests. With your patches I can reproduce a speedup of ~3% on single core tests reliably on a dual-socket 36-core machine for the pgbench select-only test case. When using the full scale test my results are way too noisy even for large runs unfortunately. I also tried some other queries (for example select's that return 10 or 100 rows instead of just 1), but can't see much of a speed-up there either, although it also doesn't hurt. So I guess the most noticeable one is the select-only benchmark for 1 core: transaction type: scaling factor: 300 query mode: prepared number of clients: 1 number of threads: 1 duration: 600 s number of transactions actually processed: 30255419 latency average = 0.020 ms latency stddev = 0.001 ms tps = 50425.693234 (including connections establishing) tps = 50425.841532 (excluding connections establishing) transaction type: scaling factor: 300 query mode: prepared number of clients: 1 number of threads: 1 duration: 600 s number of transactions actually processed: 31363398 latency average = 0.019 ms latency stddev = 0.001 ms tps = 52272.326597 (including connections establishing) tps = 52272.476380 (excluding connections establishing) This is the one with 40 clients, 40 threads. Not really an improvement, and quite still quite noisy. transaction type: scaling factor: 300 query mode: prepared number of clients: 40 number of threads: 40 duration: 600 s number of transactions actually processed: 876846915 latency average = 0.027 ms latency stddev = 0.015 ms tps = 1461407.539610 (including connections establishing) tps = 1461422.084486 (excluding connections establishing) transaction type: scaling factor: 300 query mode: prepared number of clients: 40 number of threads: 40 duration: 600 s number of transactions actually processed: 872633979 latency average = 0.027 ms latency stddev = 0.038 ms tps = 1454387.326179 (including connections establishing) tps = 1454396.879195 (excluding connections establishing) For tests that don't use the full machine (eg. 10 clients, 10 threads) I see speed-ups as well, but not as high as the single-core run. It seems there are other bottlenecks (on the machine) coming into play. -Floris
RE: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()
> He reported that the problem went away with the patches applied. The > following pgbench SELECT-only result was sent to me privately: > > before: > single: tps = 30300.202363 (excluding connections establishing) > all cores: tps = 1026853.447047 (excluding connections establishing) > > after: > single: tps = 31120.452919 (excluding connections establishing) > all cores: tps = 1032376.361006 (excluding connections establishing) > > (Actually, he tested something slightly different -- he inlined > _bt_compare() in his own way instead of using my 0002-*, and didn't use the > 0003-* optimization at all.) > > Apparently this was a large multi-socket machine. Those are hard to come by. > I could do some tests with the patch on some larger machines. What exact tests do you propose? Are there some specific postgresql.conf settings and pgbench initialization you recommend for this? And was the test above just running 'pgbench -S' select-only with specific -T, -j and -c parameters? -Floris
RE: Index Skip Scan
Hi Dmitry, Thanks for the new patch! I tested it and managed to find a case that causes some issues. Here's how to reproduce: drop table if exists t; create table t as select a,b,b%2 as c,10 as d from generate_series(1,5) a, generate_series(1,1000) b; create index on t (a,b,c,d); -- correct postgres=# begin; declare c scroll cursor for select distinct on (a) a,b,c,d from t order by a desc, b desc; fetch forward all from c; fetch backward all from c; commit; BEGIN DECLARE CURSOR a | b | c | d ---+--+---+ 5 | 1000 | 0 | 10 4 | 1000 | 0 | 10 3 | 1000 | 0 | 10 2 | 1000 | 0 | 10 1 | 1000 | 0 | 10 (5 rows) a | b | c | d ---+--+---+ 1 | 1000 | 0 | 10 2 | 1000 | 0 | 10 3 | 1000 | 0 | 10 4 | 1000 | 0 | 10 5 | 1000 | 0 | 10 (5 rows) -- now delete some rows postgres=# delete from t where a=3; DELETE 1000 -- and rerun: error is thrown postgres=# begin; declare c scroll cursor for select distinct on (a) a,b,c,d from t order by a desc, b desc; fetch forward all from c; fetch backward all from c; commit; BEGIN DECLARE CURSOR a | b | c | d ---+--+---+ 5 | 1000 | 0 | 10 4 | 1000 | 0 | 10 2 | 1000 | 0 | 10 1 | 1000 | 0 | 10 (4 rows) ERROR: lock buffer_content is not held ROLLBACK A slightly different situation arises when executing the cursor with an ORDER BY a, b instead of the ORDER BY a DESC, b DESC: -- recreate table again and execute the delete as above postgres=# begin; declare c scroll cursor for select distinct on (a) a,b,c,d from t order by a, b; fetch forward all from c; fetch backward all from c; commit; BEGIN DECLARE CURSOR a | b | c | d ---+---+---+ 1 | 1 | 1 | 10 2 | 1 | 1 | 10 4 | 1 | 1 | 10 5 | 1 | 1 | 10 (4 rows) a | b | c | d ---+-+---+ 5 | 1 | 1 | 10 4 | 1 | 1 | 10 2 | 827 | 1 | 10 1 | 1 | 1 | 10 (4 rows) COMMIT And lastly, you'll also get incorrect results if you do the delete slightly differently: -- leave one row where a=3 and b=1000 postgres=# delete from t where a=3 and b<=999; -- the cursor query above won't show any of the a=3 rows even though they should -Floris
RE: Index Skip Scan
> Note in particular that index scans cannot return the same index tuple twice - > - processing a page at a time ensures that that cannot happen. > > Can a loose index scan return the same tuple (i.e. a tuple with the same heap > TID) to the executor more than once? > The loose index scan shouldn't return a tuple twice. It should only be able to skip 'further', so that shouldn't be a problem. Out of curiosity, why can't index scans return the same tuple twice? Is there something in the executor that isn't able to handle this? -Floris
Re: Index Skip Scan
Hi Dmitry, > > On Wed, Jan 22, 2020 at 07:50:30AM +, Floris Van Nee wrote: > > > > Anyone please correct me if I'm wrong, but I think one case where the > > current patch relies on some data from the page it has locked before it in > > checking this hi/lo key. I think it's possible for the following sequence > > to happen. Suppose we have a very simple one leaf-page btree containing > > four elements: leaf page 1 = [2,4,6,8] > > We do a backwards index skip scan on this and have just returned our first > > tuple (8). The buffer is left pinned but unlocked. Now, someone else comes > > in and inserts a tuple (value 5) into this page, but suppose the page > > happens to be full. So a page split occurs. As far as I know, a page split > > could happen at any random element in the page. One of the situations we > > could be left with is: > > Leaf page 1 = [2,4] > > Leaf page 2 = [5,6,8] > > However, our scan is still pointing to leaf page 1. > In case if we just returned a tuple, the next action would be either >> check the next page for another key or search down to the tree. Maybe But it won't look at the 'next page for another key', but rather at the 'same page or another key', right? In the _bt_scankey_within_page shortcut we're taking, there's no stepping to a next page involved. It just locks the page again that it previously also locked. > I'm missing something in your scenario, but the latter will land us on a > required page (we do not point to any leaf here), and before the former > there is a check for high/low key. Is there anything else missing? Let me try to clarify. After we return the first tuple, so->currPos.buf is pointing to page=1 in my example (it's the only page after all). We've returned item=8. Then the split happens and the items get rearranged as in my example. We're still pointing with so->currPos.buf to page=1, but the page now contains [2,4]. The split happened to the right, so there's a page=2 with [5,6,8], however the ongoing index scan is unaware of that. Now _bt_skip gets called to fetch the next tuple. It starts by checking _bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir), the result of which will be 'true': we're comparing the skip key to the low key of the page. So it thinks the next key can be found on the current page. It locks the page and does a _binsrch, finding item=4 to be returned. The problem here is that _bt_scankey_within_page mistakenly returns true, thereby limiting the search to just the page that it's pointing to already. It may be fine to just fix this function to return the proper value (I guess it'd also need to look at the high key in this example). It could also be fixed by not looking at the lo/hi key of the page, but to use the local tuple buffer instead. We already did a _read_page once, so if we have any matching tuples on that specific page, we have them locally in the buffer already. That way we never need to lock the same page twice.
RE: Index Skip Scan
> > Could you please elaborate why does it sound like that? If I understand > correctly, to stop a scan only "between" pages one need to use only > _bt_readpage/_bt_steppage? Other than that there is no magic with scan > position in the patch, so I'm not sure if I'm missing something here. Anyone please correct me if I'm wrong, but I think one case where the current patch relies on some data from the page it has locked before it in checking this hi/lo key. I think it's possible for the following sequence to happen. Suppose we have a very simple one leaf-page btree containing four elements: leaf page 1 = [2,4,6,8] We do a backwards index skip scan on this and have just returned our first tuple (8). The buffer is left pinned but unlocked. Now, someone else comes in and inserts a tuple (value 5) into this page, but suppose the page happens to be full. So a page split occurs. As far as I know, a page split could happen at any random element in the page. One of the situations we could be left with is: Leaf page 1 = [2,4] Leaf page 2 = [5,6,8] However, our scan is still pointing to leaf page 1. For non-skip scans this is not a problem, as we already read all matching elements in our local buffer and we'll return those. But the skip scan currently: a) checks the lo-key of the page to see if the next prefix can be found on the leaf page 1 b) finds out that this is actually true c) does a search on the page and returns value=4 (while it should have returned value=6) Peter, is my understanding about the btree internals correct so far? Now that I look at the patch again, I fear there currently may also be such a dependency in the "Advance forward but read backward"-case. It saves the offset number of a tuple in a variable, then does a _bt_search (releasing the lock and pin on the page). At this point, anything can happen to the tuples on this page - the page may be compacted by vacuum such that the offset number you have in your variable does not match the actual offset number of the tuple on the page anymore. Then, at the check for (nextOffset == startOffset) later, there's a possibility the offsets are different even though they relate to the same tuple. -Floris
RE: Index Skip Scan
Hi all, I reviewed the latest version of the patch. Overall some good improvements I think. Please find my feedback below. - I think I mentioned this before - it's not that big of a deal, but it just looks weird and inconsistent to me: create table t2 as (select a, b, c, 10 d from generate_series(1,5) a, generate_series(1,100) b, generate_series(1,1) c); create index on t2 (a,b,c desc); postgres=# explain select distinct on (a,b) a,b,c from t2 where a=2 and b>=5 and b<=5 order by a,b,c desc; QUERY PLAN - Index Only Scan using t2_a_b_c_idx on t2 (cost=0.43..216.25 rows=500 width=12) Skip scan: true Index Cond: ((a = 2) AND (b >= 5) AND (b <= 5)) (3 rows) postgres=# explain select distinct on (a,b) a,b,c from t2 where a=2 and b=5 order by a,b,c desc; QUERY PLAN - Unique (cost=0.43..8361.56 rows=500 width=12) -> Index Only Scan using t2_a_b_c_idx on t2 (cost=0.43..8361.56 rows=9807 width=12) Index Cond: ((a = 2) AND (b = 5)) (3 rows) When doing a distinct on (params) and having equality conditions for all params, it falls back to the regular index scan even though there's no reason not to use the skip scan here. It's much faster to write b between 5 and 5 now rather than writing b=5. I understand this was a limitation of the unique-keys patch at the moment which could be addressed in the future. I think for the sake of consistency it would make sense if this eventually gets addressed. - nodeIndexScan.c, line 126 This sets xs_want_itup to true in all cases (even for non skip-scans). I don't think this is acceptable given the side-effects this has (page will never be unpinned in between returned tuples in _bt_drop_lock_and_maybe_pin) - nbsearch.c, _bt_skip, line 1440 _bt_update_skip_scankeys(scan, indexRel); This function is called twice now - once in the else {} and immediately after that outside of the else. The second call can be removed I think. - nbtsearch.c _bt_skip line 1597 LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK); scan->xs_itup = (IndexTuple) PageGetItem(page, itemid); This is an UNLOCK followed by a read of the unlocked page. That looks incorrect? - nbtsearch.c _bt_skip line 1440 if (BTScanPosIsValid(so->currPos) && _bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir)) Is it allowed to look at the high key / low key of the page without have a read lock on it? - nbtsearch.c line 1634 if (_bt_readpage(scan, indexdir, offnum)) ... else error() Is it really guaranteed that a match can be found on the page itself? Isn't it possible that an extra index condition, not part of the scan key, makes none of the keys on the page match? - nbtsearch.c in general Most of the code seems to rely quite heavily on the fact that xs_want_itup forces _bt_drop_lock_and_maybe_pin to never release the buffer pin. Have you considered that compacting of a page may still happen even if you hold the pin? [1] I've been trying to come up with cases in which this may break the patch, but I haven't able to produce such a scenario - so it may be fine. But it would be good to consider again. One thing I was thinking of was a scenario where page splits and/or compacting would happen in between returning tuples. Could this break the _bt_scankey_within_page check such that it thinks the scan key is within the current page, while it actually isn't? Mainly for backward and/or cursor scans. Forward scans shouldn't be a problem I think. Apologies for being a bit vague as I don't have a clear example ready when it would go wrong. It may well be fine, but it was one of the things on my mind. [1] https://postgrespro.com/list/id/1566683972147.11...@optiver.com -Floris
jsonb_set() strictness considered harmful to data
FWIW I've been bitten by this 'feature' more than once as well, accidentally erasing a column. Now I usually write js = jsonb_set(js, coalesce(new_column, 'null'::jsonb)) to prevent erasing the whole column, and instead setting the value to a jsonb null value, but I also found the STRICT behavior very surprising at first.. -Floris
Re: Index Skip Scan
> Sorry for long delay. Here is more or less what I had in mind. After changing > read_closest to read the whole page I couldn't resist to just merge it into > readpage itself, since it's basically the same. It could raise questions > about> > performance and how intrusive this patch is, but I hope it's not that much of > a > problem (in the worst case we can split it back). I've also added few tests > for > the issue you've mentioned. Thanks again, I'm appreciate how much efforts you > put into reviewing! Putting it into one function makes sense I think. Looking at the patch, I think in general there are some good improvements in there. I'm afraid I did manage to find another incorrect query result though, having to do with the keepPrev part and skipping to the first tuple on an index page: postgres=# drop table if exists b; create table b as select a,b::int2 b,(b%2)::int2 c from generate_series(1,5) a, generate_series(1,366) b; create index on b (a,b,c); analyze b; DROP TABLE SELECT 1830 CREATE INDEX ANALYZE postgres=# set enable_indexskipscan=1; SET postgres=# select distinct on (a) a,b,c from b where b>=1 and c=0 order by a,b; a | b | c ---+---+--- 1 | 2 | 0 2 | 4 | 0 3 | 4 | 0 4 | 4 | 0 5 | 4 | 0 (5 rows) postgres=# set enable_indexskipscan=0; SET postgres=# select distinct on (a) a,b,c from b where b>=1 and c=0 order by a,b; a | b | c ---+---+--- 1 | 2 | 0 2 | 2 | 0 3 | 2 | 0 4 | 2 | 0 5 | 2 | 0 (5 rows) -Floris
Re: Optimize single tuple fetch from nbtree index
>> It seems that it contradicts the very idea of your patch, so probably we >> should look for other ways to optimize this use-case. >> Maybe this restriction can be relaxed for write only tables, that never >> have to reread the page because of visibility, or something like that. >> Also we probably can add to IndexScanDescData info about expected number >> of tuples, to allow index work more optimal >> and avoid the overhead for other loads.= > The idea of the patch is exactly to relax this limitation. I forgot to update > that README file though. The current implementation of the patch should be > correct like this - that's why I added the look-back code on the page if the > tuple couldn't be found anymore on the same location on the page. Similarly, > it'll look on the page to the right if it detected a page split. These two > measures combined should give a correct implementation of the 'it's possible > that a scan stops in the middle of a page' relaxation. However, as Peter and > Tom pointed out earlier, they feel that the performance advantage that this > approach gives, does not outweigh the extra complexity at this time. I'd be > open to other suggestions though. Although now that I think of it - do you mean the case where the tuple that we returned to the caller after _bt_first actually gets deleted (not moved) from the page? I guess that can theoretically happen if _bt_first returns a non-visible tuple (but not DEAD yet in the index at the time of _bt_first). For my understanding, would a situation like the following lead to this (in my patch)? 1) Backend 1 does an index scan and returns the first tuple on _bt_first - this tuple is actually deleted in the heap already, however it's not marked dead yet in the index. 2) Backend 1 does a heap fetch to check actual visibility and determines the tuple is actually dead 3) While backend 1 is busy doing the heap fetch (so in between _bt_first and _bt_next) backend 2 comes in and manages to somehow do 1) a _bt_killitems on the page to mark tuples dead as well as 2) compact items on the page, thereby actually removing this item from the page. 4) Now backend 1 tries to find the next tuple in _bt_next - it first tries to locate the tuple where it left off, but cannot find it anymore because it got removed completely by backend 2. If this is indeed possible then it's a bad issue unfortunately, and quite hard to try to reproduce, as a lot of things need to happen concurrently while doing a visiblity check. As for your patch, I've had some time to take a look at it. For the two TODOs: + /* TODO Is it possible that currPage is not valid anymore? */ + Assert(BTScanPosIsValid(so->currPos)) This Assert exists already a couple of lines earlier at the start of this function. + * TODO It is not clear to me + * why to check scanpos validity based on currPage value. + * I wonder, if we need currPage at all? Is there any codepath that + * assumes that currPage is not the same as BufferGetBlockNumber(buf)? + */ The comments in the source mention the following about this: * We note the buffer's block number so that we can release the pin later. * This allows us to re-read the buffer if it is needed again for hinting. */ so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf); As we figured out earlier, so->currPos.buf gets set to invalid when we release the pin by the unpin macro. So, if we don't store currPage number somewhere else, we cannot obtain the pin again if we need it during killitems. I think that's the reason that currPage is stored. Other than the two TODOs in the code, I think the comments really help clarifying what's going on in the code - I'd be happy if this gets added. -Floris
Re: Optimize single tuple fetch from nbtree index
> It seems that it contradicts the very idea of your patch, so probably we > should look for other ways to optimize this use-case. > Maybe this restriction can be relaxed for write only tables, that never > have to reread the page because of visibility, or something like that. > Also we probably can add to IndexScanDescData info about expected number > of tuples, to allow index work more optimal > and avoid the overhead for other loads.= The idea of the patch is exactly to relax this limitation. I forgot to update that README file though. The current implementation of the patch should be correct like this - that's why I added the look-back code on the page if the tuple couldn't be found anymore on the same location on the page. Similarly, it'll look on the page to the right if it detected a page split. These two measures combined should give a correct implementation of the 'it's possible that a scan stops in the middle of a page' relaxation. However, as Peter and Tom pointed out earlier, they feel that the performance advantage that this approach gives, does not outweigh the extra complexity at this time. I'd be open to other suggestions though. > That's true. It took me quite some time to understand that existing code > is correct. > There is a comment for the structure's field that claims that > BufferIsValid is the same that BufferIsPinned in ScanPos context. > Attached patch contains some comments' updates. Any suggestions on how > to improve them are welcome. I'll have a look tomorrow. Thanks a lot for writing this up! -Floris
Re: Optimize single tuple fetch from nbtree index
> Hello, > welcome to hackers with your first patch) Thank you. > Though, I got interested in the comment inconsistency you have found. > I added debug message into this code branch of the patch and was able to > see it in regression.diffs after 'make check': > Speaking of your patch, it seems that the buffer was unpinned and pinned > again between two reads, > and the condition of holding it continuously has not been met. May I ask what makes you conclude that the condition of holding the pin continuously has not been met? Your reply encouraged me to dig a little bit more into this today. First, I wanted to check if indeed the pin was continuously held by the backend or not. I added some debug info to ReleaseBuffer for this: it turned out that the pin on the buffer was definitely never released by the backend between the calls to _bt_first and _bt_next. So the buffer got compacted while the backend held a pin on it. After some more searching I found the following code: _bt_vacuum_one_page in nbtinsert.c This function compacts one single page without taking a super-exclusive lock. It is used during inserts to make room on a page. I verified that if I comment out the calls to this function, the compacting never happens while I have a pin on the buffer. So I guess that answers my own question: cleaning up garbage during inserts is one of the cases where compacting may happen even while other backends hold a pin to the buffer. Perhaps this should also be more clearly phrased in the comments in eg. _bt_killitems? Because currently those comments make it look like this case never occurs. > While reading the code to answer your question, I noticed that > BTScanPosIsPinned macro name is misleading. > It calls BufferIsValid(), not BufferIsPinned() as one could expect. > And BufferIsValid in bufmgr.h comment explicitly states that it > shouldn't be confused with BufferIsPinned. > The same goes for BTScanPosUnpinIfPinned(). I agree the name is misleading. It clearly does something else than how it's named. However, I don't believe this introduces problems in these particular pieces of code, as long as the macro's are always used. BTScanPosIsPinned actually checks whether it's valid and not necessarily whether it's pinned, as you mentioned. However, any time the buffer gets unpinned using the macro BTScanPosUnpin, the buffer gets set to Invalid by the macro as well. Therefore, any consecutive call to BTScanPosIsPinned should indeed return false. It'd definitely be nice if this gets clarified in comments though. -Floris
Re: Index Skip Scan
> Yes, the check should be for that. However, the query in question > doesn't have any query_pathkeys, and hence query_uniquekeys in > standard_qp_callback(), so therefore it isn't supported > Your scenario is covered by one of the test cases in case the > functionality is supported. But, I think that is outside the scope of > the current patch. Ah alright, thanks. That makes it clear why it doesn't work. >From a user point of view I think it's rather strange that SELECT DISTINCT ON (a) a,b FROM a WHERE a BETWEEN 2 AND 2 would give a fast skip scan, even though the more likely query that someone would write SELECT DISTINCT ON (a) a,b FROM a WHERE a=2 would not. It is something we could be leave up to the next patch though. Something else I just noticed which I'm just writing here for awareness; I don't think it's that pressing at the moment and can be left to another patch. When there are multiple indices on a table the planner gets confused and doesn't select an index-only skip scan even though it could. I'm guessing it just takes the first available index based on the DISTINCT clause and then doesn't look further, eg. With an index on (a,b) and (a,c,b): postgres=# explain select distinct on (a) a,c,b FROM a; QUERY PLAN Index Scan using a_a_b_idx on a (cost=0.29..1.45 rows=5 width=12) Skip scan mode: true (2 rows) -> This could be an index only scan with the (a,b,c) index. > For the records, the purpose of `_bt_read_closest` is not so much to reduce > amount of data we read from a page, but more to correctly handle those > situations we were discussing before with reading forward/backward in cursors, > since for that in some cases we need to remember previous values for stepping > to the next. I've limited number of items, fetched in this function just > because I was misled by having a check for dead tuples in `_bt_skip`. Of > course > we can modify it to read a whole page and leave visibility check for the code > after `index_getnext_tid` (although in case if we know that all tuples on this > page are visilbe I guess it's not strictly necessary, but we still get > improvement from skipping itself). I understand and I agree - primary purpose why we chose this function was to make it work correctly. I don't think it would be something for this patch to use the optimization of partially reading a page. My point was however, if this optimization was allowed in a future patch, it would have great performance benefits. To fix the current patch, we'd indeed need to read the full page. It'd be good to take a close look at the implementation of this function then, because messing around with the previous/next is also not trivial. I think the current implementation also has a problem when the item that is skipped to, is the first item on the page. Eg. (this depends on page size) postgres=# drop table if exists b; create table b as select a,b from generate_series(1,5) a, generate_series(1,366) b; create index on b (a,b); analyze b; DROP TABLE SELECT 1830 CREATE INDEX ANALYZE postgres=# select distinct on(a) a,b from b; a | b ---+--- 1 | 1 2 | 2 <-- (2,1) is the first item on the page and doesn't get selected by read_closest function. it returns the second item on page which is (2,2) 3 | 2 4 | 2 5 | 2 (5 rows) -Floris
Re: Optimize single tuple fetch from nbtree index
Hi Peter, > Actually, having looked at the test case in more detail, that now > seems less likely. The test case seems designed to reward making it > cheaper to access one specific tuple among a fairly large group of > related tuples -- reducing the number of inner scans is not going to > be possible there. > If this really is totally representative of the case that Floris cares > about, I suppose that the approach taken more or less makes sense. > Unfortunately, it doesn't seem like an optimization that many other > users would find compelling, partly because it's only concerned with > fixed overheads, and partly because most queries don't actually look > like this. Thanks for taking a look. Unfortunately this is exactly the case I care about. I'm a bit puzzled as to why this case wouldn't come up more often by other users though. We have many large tables with timeseries data and it seems to me that with timeseries data, two of the most common queries are: (1) What is the state of { a1,a2, a3 ...} at timepoint t (but you don't know that there's an update *exactly* at timepoint t - so you're left with trying to find the latest update smaller than t) (2) What is the state of { a } at timepoints { t1, t2, t3 ... } Given that a1,a2,a3... are indepedently updating, but similar time series (eg. sensor a1 and sensor a2, but both provide a temperature value and update independently from each other). Both of these can also be done with some kind of DISTINCT ON clause, but this is often already much slower than just doing a nested loop of fast index lookups with LIMIT 1 (this depends on the frequency of the timeseries data itself versus the sampling rate of your query though, for high frequency time series and/or low frequency sampling the LIMIT 1 approach is much faster). Note that there is actually some related work to this - in the Index Skip Scan thread [1] a function called _bt_read_closest was developed which also partially reads the page. A Skip Scan has a very similar access pattern to the use case I describe here, because it's also very likely to just require one tuple from the page. Even though the implementation in that patch is currently incorrect, performance of the Skip Scan would likely also be quite a bit faster if it had a correct implementation of this partial page-read and it wouldn't have to read the full page every time. I have one further question about these index offsets. There are several comments in master that indicate that it's impossible that an item moves 'left' on a page, if we continuously hold a pin on the page. For example, _bt_killitems has a comment like this: * Note that if we hold a pin on the target page continuously from initially * reading the items until applying this function, VACUUM cannot have deleted * any items from the page, and so there is no need to search left from the * recorded offset. (This observation also guarantees that the item is still * the right one to delete, which might otherwise be questionable since heap * TIDs can get recycled.) This holds true even if the page has been modified * by inserts and page splits, so there is no need to consult the LSN. Still, exactly this case happens in practice. In my tests I was able to get behavior like: 1) pin + lock a page in _bt_first 2) read a tuple, record indexOffset (for example offset=100) and heap tid 3) unlock page, but *keep* the pin (end of _bt_first of my patch) 4) lock page again in _bt_next (we still hold the pin, so vacuum shouldn't have occurred) 5) look inside the current page for the heap Tid that we registered earlier 6) we find that we can now find this tuple at indexOffset=98, eg. it moved left. This should not be possible. This case sometimes randomly happens when running 'make check', which is why I added code in my patch to also look left on the page from the previous indexOffset. However, this is in contradiction with the comments (and code) of _bt_killitems. Is the comment incorrect/outdated or is there a bug in vacuum or any other part of Postgres that might move index items left even though there are others holding a pin? -Floris [1] https://www.postgresql.org/message-id/flat/CA%2BhUKGKW4dXTP9G%2BWBskjT09tzD%2B9aMWEm%3DFpeb6RS5SXfPyKw%40mail.gmail.com#21abe755d5cf36aabaaa048c8a282169
Re: Index Skip Scan
Thanks for the new patch. I've reviewed the skip scan patch, but haven't taken a close look at the uniquekeys patch yet. In my previous review I mentioned that queries of the form: select distinct on(a) a,b from a where a=1; do not lead to a skip scan with the patch, even though the skip scan would be much faster. It's about this piece of code in planner.c /* Consider index skip scan as well */ if (enable_indexskipscan && IsA(path, IndexPath) && ((IndexPath *) path)->indexinfo->amcanskip && root->distinct_pathkeys != NIL) The root->distinct_pathkeys is already filtered for redundant keys, so column 'a' is not in there anymore. Still, it'd be much faster to use the skip scan here, because a regular scan will scan all entries with a=1, even though we're really only interested in the first one. In previous versions, this would be fixed by changing the check in planner.c to use root->uniq_distinct_pathkeys instead of root->distinct_pathkeys, but things change a bit now that the patch is rebased on the unique-keys patch. Would it be valid to change this check to root->query_uniquekeys != NIL to consider skip scans also for this query? Second is about the use of _bt_skip and _bt_read_closest in nbtsearch.c. I don't think _bt_read_closest is correctly implemented, and I'm not sure if it can be used at all, due to concerns by Tom and Peter about such approach. I had a similar idea to only partially read items from a page for another use case, for which I submitted a patch last Friday. However, both Tom and Peter find this idea quite scary [1]. You could take a look at my patch on that thread to see the approach taken to correctly partially read a page (well, correct as far as I can see so far...), but perhaps we need to just use the regular _bt_readpage function which reads everything, although this is unfortunate from a performance point of view, since most of the time we're indeed just interested in the first tuple. Eg. it looks like there's some mixups between index offsets and heap tid's in _bt_read_closest: /* * Save the current item and the previous, even if the * latter does not pass scan key conditions */ ItemPointerData tid = prevItup->t_tid; OffsetNumber prevOffnum = ItemPointerGetOffsetNumber(&tid); _bt_saveitem(so, itemIndex, prevOffnum, prevItup); itemIndex++; _bt_saveitem(so, itemIndex, offnum, itup); itemIndex++; The 'prevOffnum' is the offset number for the heap tid, and not the offset number for the index offset, so it looks like just a random item is saved. Furthermore, index offsets may change due to insertions and vacuums, so if we, at any point, release the lock, these offsets are not necessarily valid anymore. However, currently, the patch just reads the closest and then doesn't consider this page at all anymore, if the first tuple skipped to turns out to be not visible. Consider the following sql to illustrate: create table a (a int, b int, c int); insert into a (select vs, ks, 10 from generate_series(1,5) vs, generate_series(1, 1) ks); create index on a (a,b); analyze a; select distinct on (a) a,b from a order by a,b; a | b ---+--- 1 | 1 2 | 1 3 | 1 4 | 1 5 | 1 (5 rows) delete from a where a=2 and b=1; DELETE 1 select distinct on (a) a,b from a order by a,b; a | b ---+- 1 | 1 2 | 249 ->> this should be b=2, because we deleted a=2 && b=1. however, it doesn't consider any tuples from that page anymore and gives us the first tuple from the next page. 3 | 1 4 | 1 5 | 1 (5 rows) ? -Floris [1] https://www.postgresql.org/message-id/flat/26641.1564778586%40sss.pgh.pa.us#dd8f23e1704f45447185894e1c2a4f2a
Re: Optimize single tuple fetch from nbtree index
Hi Tom, Thanks for your quick reply! > Regardless of whether there's actually a LIMIT 1? That seems disastrous > for every other case than the narrow one where the optimization wins. > Because every other case is now going to have to touch the index page > twice. That's more CPU and about double the contention --- if you could > not measure any degradation from that, you're not measuring the right > thing. I thought the same as well at first. Note that I did measure degradation of 2-3% as mentioned on some cases, but initially I also expected worse. Do you have any ideas on cases that would suffer the most? I thought the tight inner nested loop that I posted in my performance tests would have this index lookup as bottleneck. I know they are the bottleneck for the LIMIT 1 query (because these improve by a factor 2-3 with the patch). And my theory is that for a LIMIT 3, the price paid for this optimization is highest, because it would touch the page twice and read all items from it, while only returning three of them. > In principle, you could pass down knowledge of whether there's a LIMIT, > using the same mechanism used to enable top-N sorting. But it'd have to > also go through the AM interface layer, so I'm not sure how messy this > would be. This was an idea I had as well and I would be willing to implement such a thing if this is deemed interesting enough by the community. However, I didn't want to do this for the first version of this patch, as it would be quite some extra work, which would be useless if the idea of the patch itself gets rejected already. :-) I'd appreciate any pointers in the right direction - I can take a look at how top-N sorting pushes the LIMIT down. Given enough interest for the basic idea of this patch, I will implement it. >> This calls _bt_first in nbtsearch.c, which will, if there are scankeys, >> descend the tree to a leaf page and read just the first (or possibly two) >> tuples. It won't touch the rest of the page yet. If indeed just one tuple >> was required, there won't be a call to _bt_next and we're done. If we do >> need more than one tuple, _bt_next will resume reading tuples from the index >> page at the point where we left off. > How do you know how many index entries you have to fetch to get a tuple that's live/visible to the query? Indeed we don't know that - that's why this initial patch does not make any assumptions about this and just assumes the good-weather scenario that everything is visible. I'm not sure if it's possible to give an estimation of this and whether or not that would be useful. Currently, if it turns out that the tuple is not visible, there'll just be another call to _bt_next again which will resume reading the page as normal. I'm open to implement any suggestions that may improve this. >> - We need to take into account page splits, insertions and vacuums while we >> do not have the read-lock in between _bt_first and the first call to >> _bt_next. This made my patch quite a bit more complicated than my initial >> implementation. > Meh. I think the odds that you got this 100% right are small, and the > odds that it would be maintainable are smaller. There's too much that > can happen if you're not holding any lock --- and there's a lot of active > work on btree indexes, which could break whatever assumptions you might > make today. Agreed, which is also why I posted this initial version of the patch here already, to get some input from the experts on this topic what assumptions can be made now and in the future. If it turns out that it's completely not feasible to do an optimization like this, because of other constraints in the btree implementation, then we're done pretty quickly here. :-) For what it's worth: the patch at least passes make check consistently - I caught a lot of these edge cases related to page splits and insertions while running the regression tests, which runs the modified bits of code quite often and in parallel. There may be plenty of edge cases left however... > I'm not unalterably opposed to doing something like this, but my sense > is that the complexity and probable negative performance impact on other > cases are not going to look like a good trade-off for optimizing the > case at hand. I do think it could be a big win if we could get something like this working. Cases with a LIMIT seem common enough to me to make it possible to add some extra optimizations, especially if that could lead to 2-3x the TPS for these kind of queries. However, it indeed needs to be within a reasonable complexity. If it turns out that in order for us to optimize this, we need to add a lot of extra complexity, it may not be worth it to add it. > BTW, you haven't even really made the case that optimizing a query that > behaves this way is the right thing to be doing ... maybe some other > plan shape that isn't a nestloop around a LIMIT query would be a better > solution. It is prett
Optimize single tuple fetch from nbtree index
Hi hackers, While I was reviewing some code in another patch, I stumbled upon a possible optimization in the btree index code in nbtsearch.c for queries using 'LIMIT 1'. I have written a small patch that implements this optimization, but I'd like some feedback on the design of the patch, whether it is correct at all to use this optimization, and whether the performance tradeoffs are deemed worth it by the community. Basically, an example of the case I'd like to optimize is the following. Given a table 'tbl' with an index on columns (k,ts DESC): SELECT* FROM tbl WHERE k=:val AND ts<=:timestamp ORDER BY k, ts DESC LIMIT 1; And, even more importantly, when this query gets called in an inner loop like: SELECT* FROM generate_series(:start_ts, :end_ts, :interval) ts -- perhaps thousands of iterations, could also be a loop over values of 'k' rather than timestamps. this is just an example CROSS JOIN LATERAL ( SELECT* FROM tbl WHERE k=:val AND ts<=:timestamp ORDER BY k, ts DESC LIMIT 1; ) _; With time-series data, this case often arises as you have a certain natural key for which you store updates as they occur. Getting the state of k at a specific time then boils down to the given query there, which is almost always the fastest way to get this information, since the index scan with LIMIT 1 is very fast already. However, there seems to be a possibility to make this even faster (up to nearly 3x faster in test cases that use this nested loop of index scans) Every time the index scan is done, all tuples from the leaf page are read in nbtsearch.c:_bt_readpage. The idea of this patch is to make an exception for this *only* the first time amgettuple gets called. This calls _bt_first in nbtsearch.c, which will, if there are scankeys, descend the tree to a leaf page and read just the first (or possibly two) tuples. It won't touch the rest of the page yet. If indeed just one tuple was required, there won't be a call to _bt_next and we're done. If we do need more than one tuple, _bt_next will resume reading tuples from the index page at the point where we left off. There are a few caveats: - Possible performance decrease for queries that need a small number of tuples (but more than one), because they now need to lock the same page twice. This can happen in several cases, for example: LIMIT 3; LIMIT 1 but the first tuple returned does not match other scan conditions; LIMIT 1 but the tuple returned is not visible; no LIMIT at all but there are just only a few matching rows. - We need to take into account page splits, insertions and vacuums while we do not have the read-lock in between _bt_first and the first call to _bt_next. This made my patch quite a bit more complicated than my initial implementation. I did performance tests for some best case and worst case test scenarios. TPS results were stable and reproducible in re-runs on my, otherwise idle, server. Attached are the full results and how to reproduce. I picked test cases that show best performance as well as worst performance compared to master. Summary: the greatest performance improvement can be seen for the cases with the subquery in a nested loop. In a nested loop of 100 times, the performance is roughly two times better, for 1 times the performance is roughly three times better. For most test cases that don't use LIMIT 1, I couldn't find a noticeable difference, except for the nested loop with a LIMIT 3 (or similarly, a nested loop without any LIMIT-clause that returns just three tuples). This is also theoretically the worst-case test case, because it has to lock the page again and then read it, just for one tuple. In this case, I saw TPS decrease by 2-3% in a few cases (details in the attached file), due to it having to lock/unlock the same page in both _bt_first and _bt_next. A few concrete questions to the community: - Does the community also see this as a useful optimization? - Is the way it is currently implemented safe? I struggled quite a bit to get everything working with respect to page splits and insertions. In particular, I don't know why in my patch, _bt_find_offset_for_tid needs to consider searching for items with an offset *before* the passed offset. As far as my understanding goes, this could only happen when the index gets vacuumed in the mean-time. However, we hold a pin on the buffer the whole time (we even assert this), so vacuum should not have been possible. Still, this code gets triggered sometimes, so it seems to be necessary. Perhaps someone in the community who's more of an expert on this can shed some light on it. - What are considered acceptable performance tradeoffs in this case? Is a performance degradation in any part generally not acceptable at all? I'd also welcome any feedback on the process - this is my first patch and while I tried to follow the guidelines, I may have missed something along the way. Attachments: - out_limit.txt: pgbench resu
Re: Index Skip Scan
> For the general forward direction but for a backwards cursor scroll, > we'd return the lowest value for each distinct prefix, but for the > general backwards direction (DESC case) we'd return the highest value > for each distinct prefix. Looking at IndexNext() the cursor direction > seems to be estate->es_direction and the general scan direction is > indicated by the plan's indexorderdir. Can't we just pass both of > those to index_skip() to have it decide what to do? If we also pass in > indexorderdir then index_skip() should know if it's to return the > highest or lowest value, right? Correct, with these two values correct behavior can be deduced. The implementation of this is a bit cumbersome though. Consider a case like: SELECT DISTINCT ON (a) a,b,c FROM a WHERE c = 2 (with an index on a,b,c) Data (imagine every tuple here actually occurs 10.000 times in the index to see the benefit of skipping): 1,1,1 1,1,2 1,2,2 1,2,3 2,2,1 2,2,3 3,1,1 3,1,2 3,2,2 3,2,3 Creating a cursor on this query and then moving forward, you should get (1,1,2), (3,1,2). In the current implementation of the patch, after bt_first, it skips over (1,1,2) to (2,2,1). It checks quals and moves forward one-by-one until it finds a match. This match only comes at (3,1,2) however. Then it skips to the end. If you move the cursor backwards from the end of the cursor, you should still get (3,1,2) (1,1,2). A possible implementation would start at the end and do a skip to the beginning of the prefix: (3,1,1). Then it needs to move forward one-by-one in order to find the first matching (minimum) item (3,1,2). When it finds it, it needs to skip backwards to the beginning of prefix 2 (2,2,1). It needs to move forwards to find the minimum element, but should stop as soon as it detects that the prefix doesn't match anymore (because there is no match for prefix 2, it will move all the way from (2,2,1) to (3,1,1)). It then needs to skip backwards again to the start of prefix 1: (1,1,1) and scan forward to find (1,1,2). Perhaps anyone can think of an easier way to implement it? I do think being able to use DISTINCT ON is very useful and it's worth the extra complications. In the future we can add even more useful skipping features to it, for example: SELECT DISTINCT ON (a) * FROM a WHERE b =2 After skipping to the next prefix of column a, we can start a new search for (a,b)=(prefix,2) to avoid having to move one-by-one from the start of the prefix to the first matching element. There are many other useful optimizations possible. That won't have to be for this patch though :-) -Floris
Re: Index Skip Scan
> Thanks for testing! Could you provide a test case to show what exactly is the > problem? Note that in the case of a regular non-skip scan, this cursor backwards works because the Unique node on top does not support backwards scanning at all. Therefore, when creating the cursor, the actual plan actually contains a Materialize node on top of the Unique+Index Scan nodes. The 'fetch backwards' never reaches the the index scan therefore, as it just fetches stuff from the materialize node. -Floris
Re: Index Skip Scan
> Thanks for testing! Could you provide a test case to show what exactly is the > problem? create table a (a int, b int, c int); insert into a (select vs, ks, 10 from generate_series(1,5) vs, generate_series(1, 1) ks); create index on a (a,b); analyze a; set enable_indexskipscan=1; // setting this to 0 yields different results set random_page_cost=1; explain SELECT DISTINCT ON (a) a,b FROM a; BEGIN; DECLARE c SCROLL CURSOR FOR SELECT DISTINCT ON (a) a,b FROM a; FETCH FROM c; FETCH BACKWARD FROM c; FETCH 6 FROM c; FETCH BACKWARD 6 FROM c; FETCH 6 FROM c; FETCH BACKWARD 6 FROM c; END;
Re: Index Skip Scan
> Here is finally a new version of the patch, where all the mentioned issues > seems to be fixed and the corresponding new tests should keep it like that > (I've skipped all the pubs at PostgresLondon for that). Thanks for the new patch! Really appreciate the work you're putting into it. I verified that the backwards index scan is indeed functioning now. However, I'm afraid it's not that simple, as I think the cursor case is broken now. I think having just the 'scan direction' in the btree code is not enough to get this working properly, because we need to know whether we want the minimum or maximum element of a certain prefix. There are basically four cases: - Forward Index Scan + Forward cursor: we want the minimum element within a prefix and we want to skip 'forward' to the next prefix - Forward Index Scan + Backward cursor: we want the minimum element within a prefix and we want to skip 'backward' to the previous prefix - Backward Index Scan + Forward cursor: we want the maximum element within a prefix and we want to skip 'backward' to the previous prefix - Backward Index Scan + Backward cursor: we want the maximum element within a prefix and we want to skip 'forward' to the next prefix These cases make it rather complicated unfortunately. They do somewhat tie in with the previous discussion on this thread about being able to skip to the min or max value. If we ever want to support a sort of minmax scan, we'll encounter the same issues. Also, I think in planner.c, line 4831, we should actually be looking at whether uniq_distinct_pathkeys is NIL, rather than the current check for distinct_pathkeys. That'll make the planner pick the skip scan even with queries like "select distinct on (a) a,b where a=2". Currently, it doesn't pick the skip scan here, beacuse distinct_pathkeys does not contain "a" anymore. This leads to it scanning every item for a=2 even though it can stop after the first one. I'll do some more tests with the patch. -Floris
Re: Index Skip Scan
> Try the attached patch -- it has DEBUG1 traces with the contents of > index tuples from key points during index scans, allowing you to see > what's going on both as a B-Tree is descended, and as a range scan is > performed. It also shows details of the insertion scankey that is set > up within _bt_first(). This hasn't been adopted to the patch at all, > so you'll probably need to do that. Thanks! That works quite nicely. I've pinpointed the problem to within _bt_skip. I'll try to illustrate with my test case. The data in the table is (a,b)=(1,1), (1,2) ... (1,1), (2, 1), (2,2), ... (2,1) until (5,1). Running the query SELECT DISTINCT ON (a) a,b FROM a WHERE b=2; The flow is like this: _bt_first is called first - it sees there are no suitable scan keys to start at a custom location in the tree, so it just starts from the beginning and searches until it finds the first tuple (1,2). After the first tuple was yielded, _bt_skip kicks in. It constructs an insert scan key with a=1 and nextkey=true, so doing the _bt_search + _bt_binsrch on this, it finds the first tuple larger than this: (2,1). This is not the tuple that it stops at though, because after that it does this: if (ScanDirectionIsForward(dir)) /* Move back for _bt_next */ offnum = OffsetNumberPrev(offnum); /* Now read the data */ if (!_bt_readpage(scan, dir, offnum)) { /* * There's no actually-matching data on this page. Try to advance to * the next page. Return false if there's no matching data at all. */ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK); if (!_bt_steppage(scan, dir)) First, it takes the previous tuple with OffsetNumberPrev (so tuple before (2,1), which is (1,1)). This is done, because if this tuple were to be returned, there would be a call to _bt_next afterwards, which would then conveniently be on the tuple (2,1) that we want. However, _bt_readpage messes things up, because it only reads tuples that match all the provided keys (so where b=2). The only tuple it'll return is (2,2). This will be the tuple that is set, however, on the call to _bt_next, the tuple is first incremented, so we'll find (2,3) there which doesn't match our keys. This leads it to skip (2,2) in our result set. I was wondering about something else: don't we also have another problem with updating this current index tuple by skipping before calling btgettuple/_bt_next? I see there's some code in btgettuple to kill dead tuples when scan->kill_prior_tuple is true. I'm not too familiar with the concept of killing dead tuples while doing index scans, but by looking at the code it seems to be possible that btgettuple returns a tuple, caller processes it and sets kill_prior_tuple to true in order to have it killed. However, then the skip scan kicks in, which sets the current tuple to a completely different tuple. Then, on the next call of btgettuple, the wrong tuple gets killed. Is my reasoning correct here or am I missing something? -Floris
Re: Index Skip Scan
> Thanks for testing! You're right, looks like in the current implementation in > case of backwards scan there is one unnecessary extra step forward. It seems > this mistake was made, since I was concentrating only on the backward scans > with a cursor, and used not exactly correct approach to wrap up after a scan > was finished. Give me a moment, I'll tighten it up. Thanks. Looking forward to it. I think I found some other strange behavior. Given the same table as in my previous e-mail, the following queries also return inconsistent results. I spent some time trying to debug it, but can't easily pinpoint the cause. It looks like it also skips over one value too much, my guess is during _bt_skippage call in _bt_skip? Perhaps a question: when stepping through code in GDB, is there an easy way to pretty print for example the contents on an IndexTuple? I saw there's some tools out there that can pretty print plans, but viewing tuples is more complicated I guess. -- this one is OK postgres=# select distinct on (a) a,b from a where b>1; a | b ---+--- 1 | 2 2 | 2 3 | 2 4 | 2 5 | 2 (5 rows) -- this one is not OK, it seems to skip too much postgres=# select distinct on (a) a,b from a where b=2; a | b ---+--- 1 | 2 3 | 2 5 | 2 (3 rows)
Re: Index Skip Scan
The following sql statement seems to have incorrect results - some logic in the backwards scan is currently not entirely right. -Floris drop table if exists a; create table a (a int, b int, c int); insert into a (select vs, ks, 10 from generate_series(1,5) vs, generate_series(1, 1) ks); create index on a (a,b); analyze a; select distinct on (a) a,b from a order by a desc, b desc; explain select distinct on (a) a,b from a order by a desc, b desc; DROP TABLE CREATE TABLE INSERT 0 5 CREATE INDEX ANALYZE a | b ---+--- 5 | 1 5 | 1 4 | 1 3 | 1 2 | 1 1 | 1 (6 rows) QUERY PLAN - Index Only Scan Backward using a_a_b_idx on a (cost=0.29..1.45 rows=5 width=8) Scan mode: Skip scan (2 rows)
Re: Index Skip Scan
> To address this, probably we can do something like in the attached patch. > Altogether with distinct_pathkeys uniq_distinct_pathkeys are stored, which is > the same, but without the constants elimination. It's being used then for > getting the real number of distinct keys, and to check the order of the > columns > to not consider index skip scan if it's different. Hope it doesn't > look too hacky. > Thanks! I've verified that it works now. I was wondering if we're not too strict in some cases now though. Consider the following queries: postgres=# explain(analyze) select distinct on (m,f) m,f from t where m='M2'; QUERY PLAN --- Index Only Scan using t_m_f_t_idx on t (cost=0.29..11.60 rows=40 width=5) (actual time=0.056..0.469 rows=10 loops=1) Scan mode: Skip scan Index Cond: (m = 'M2'::text) Heap Fetches: 10 Planning Time: 0.095 ms Execution Time: 0.490 ms (6 rows) postgres=# explain(analyze) select distinct on (f) m,f from t where m='M2'; QUERY PLAN Unique (cost=0.29..849.83 rows=10 width=5) (actual time=0.088..10.920 rows=10 loops=1) -> Index Only Scan using t_m_f_t_idx on t (cost=0.29..824.70 rows=10052 width=5) (actual time=0.087..8.524 rows=1 loops=1) Index Cond: (m = 'M2'::text) Heap Fetches: 1 Planning Time: 0.078 ms Execution Time: 10.944 ms (6 rows) This is basically the opposite case - when distinct_pathkeys matches the filtered list of index keys, an index skip scan could be considered. Currently, the user needs to write 'distinct m,f' explicitly, even though he specifies in the WHERE-clause that 'm' can only have one value anyway. Perhaps it's fine like this, but it could be a small improvement for consistency. -Floris?
Re: Index Skip Scan
Hi, Thanks for the helpful replies. > Yes, good catch, I'll investigate. Looks like in the current implementation > something is not quite right, when we have this order of columns in an index > and where clause (e.g. in the examples above everything seems fine if we > create > index over (feedcode, market) and not the other way around). I did a little bit of investigation and it seems to occur because in pathkeys.c the function pathkey_is_redundant considers pathkeys redundant if there is an equality condition with a constant in the corresponding WHERE clause. * 1. If the new pathkey's equivalence class contains a constant, and isn't * below an outer join, then we can disregard it as a sort key. An example: * SELECT ... WHERE x = 42 ORDER BY x, y; In planner.c it builds the list of distinct_pathkeys, which is then used for the index skip scan to skip over the first length(distinct_pathkeys) columns when it does a skip. In my query, the distinct_pathkeys list only contains 'feedcode' and not 'market', because 'market' was considered redundant due to the WHERE clause. However, the index skip scan interprets this as that it has to skip over just the first column. We need to get this list of number of prefix columns to skip differently while building the plan. We need the 'real' number of distinct keys without throwing away the redundant ones. However, I'm not sure if this information can still be obtained while calling create_skipscan_unique_path? But I'm sure people here will have much better ideas than me about this :-) -Floris
Re: Index Skip Scan
Actually I'd like to add something to this. I think I've found a bug in the current implementation. Would someone be able to check? Given a table definition of (market text, feedcode text, updated_at timestamptz, value float8) and an index on (market, feedcode, updated_at desc) (note that this table slightly deviates from what I described in my previous mail) and filling it with data. The following query uses an index skip scan and returns just 1 row (incorrect!) select distinct on (market, feedcode) market, feedcode from streams.base_price where market='TEST' The following query still uses the regular index scan and returns many more rows (correct) select distinct on (market, feedcode) * from streams.base_price where market='TEST' It seems that partially filtering on one of the distinct columns triggers incorrect behavior where too many rows in the index are skipped. -Floris
Re: Index Skip Scan
After some talks with Jesper at PGCon about the Index Skip Scan, I started testing this patch, because it seems to have great potential in speeding up many of our queries (great conference by the way, really enjoyed my first time being there!). I haven't looked in depth to the code itself, but I focused on some testing with real data that we have. Let me start by sketching our primary use case for this, as it is similar, but slightly different than what was discussed earlier in this thread. I think this use case is something a lot of people who handle timeseries data have. Our database has many tables with timeseries data. We don't update rows, but just insert new rows each time. One example of this would be a table with prices for instruments. Instruments are identified by a column called feedcode. Prices of instrument update with a certain frequency. Each time it updates we insert a new row with the new value and the timestamp at that time. So in the simplest form, you could see it as a table like this: create table prices (feedcode text, updated_at timestamptz, value float8); -- there'll be some other columns as well, this is just an example create unique index on prices (feedcode, updated_at desc); This table perfectly fits the criteria for the index skip scan as there are relatively few distinct feedcodes, but each of them has quite a lot of price inserts (imagine 1000 distinct feedcodes, each of them having one price per second). We normally partition our data by a certain time interval, so let's say we're just looking at one day of prices here. We have other cases with higher update frequencies and/or more distinct values though. Typical queries on this table involve querying the price at a certain point in time, or simply querying the latest update. If we know the feedcode, this is easy: select * from prices where feedcode='A' and updated_at <= '2019-06-01 12:00' order by feedcode, updated_at desc limit 1 Unfortunately, this gets hard if you want to know the price of everything at a certain point in time. The query then becomes: select distinct on (feedcode) * from prices where updated_at <= '2019-06-01 12:00' order by feedcode, updated_at desc Up until now (even with this patch) this uses a regular index scan + a unique node which scans the full index, which is terribly slow and is also not constant - as the table grows it becomes slower and slower. Obviously there are currently already ways to speed this up using the recursive loose index scan, but I think everybody agrees that those are pretty unreadable. However, since they're also several orders of magnitude faster, we actually use them everywhere. Eg. -- certain point in time -- first query *all* distinct feedcodes (disregarding time), then look do an index scan for every feedcode found to see if it has an update in the time window we're interested in -- this essentially means we're doing 2*N index scans with recursive t as ( select feedcode from prices order by feedcode, updated_at desc limit 1 union all select n.feedcode from t cross join lateral (select feedcode from prices where feedcode > t.feedcode order by feedcode, updated_at desc limit 1) n ) select n.* from t cross join lateral (select * from prices where feedcode=t.feedcode and updated_at <= '2019-06-01 12:00' order by feedcode, updated_at desc limit 1) n -- just latest value -- if we're interested in just the latest value, it can actually be optimized to just N index scans like this. -- to make it even more confusing - there's a tradeoff here.. if you're querying a timestamp close to the latest available timestamp, it is often faster to use this method anyway and just put the filter for updated_at inside this query. this avoids the overhead of 2*N index scans, at the expense that the LIMIT 1 may have to scan several tuples before finding one that matches the timestamp criteria. With the 2*N method above we're guaranteed that the first tuple it sees is the correct tuple, but we're doing many more scans... with recursive t as ( select * from prices order by feedcode, updated_at desc limit 1 union all select n.* from t cross join lateral (select * from prices where feedcode > t.feedcode order by feedcode, updated_at desc limit 1) _ ) select * from t I hope this makes our current situation clear. Please do ask if I need to elaborate on something here. So what changes with this patch? The great thing is that the recursive CTE is not required anymore! This is a big win for readability and it helps performance as well. It makes everything much better. I am really happy with these results. If the index skip scan triggers, it is easily over 100x faster than the naive 'distinct on' query in earlier versions of Postgres. It is also quite a bit faster than the recursive CTE version of the query. I have a few remarks though. I tested some of our queries with the patch and found that the following
Re: partitioning performance tests after recent patches
Hi David, Thanks for your reply. I really appreciate your work on run-time pruning! Here's the output of explain/analyze for HEAD. At run-time, technically all partitions could be pruned directly. However, one partition remains in the output of explain/analyze because of other difficulties with removing all of them, if I remember correctly? Still, that partition is never executed. The only difference I can see is the Limit node on top, as well as apparently another partition appearing in the analyze output (4096_4096, last partition, remains in the first plan. 4096_1, the first partition, remains the second plan). -- select_now.sql explain(analyze, verbose, buffers on) select * from :tbl where a='abc' and updated_at between now() and now()+interval '1d'; Append (cost=0.16..8949.61 rows=4096 width=112) (actual time=0.000..0.000 rows=0 loops=1) Subplans Removed: 4095 -> Index Scan using p4096_4096_a_updated_at_idx on public.p4096_4096 (cost=0.16..2.18 rows=1 width=112) (never executed) Output: p4096_4096.a, p4096_4096.b, p4096_4096.c, p4096_4096.d, p4096_4096.updated_at Index Cond: ((p4096_4096.a = 'abc'::text) AND (p4096_4096.updated_at >= now()) AND (p4096_4096.updated_at <= (now() + '1 day'::interval))) Planning Time: 237.603 ms Execution Time: 0.475 ms -- select_now_limit.sql explain(analyze, verbose, buffers on) select * from :tbl where a='abc' and updated_at between now() and now()+interval '1d' order by a, updated_at desc limit 1; Limit (cost=645.53..647.56 rows=1 width=112) (actual time=0.002..0.002 rows=0 loops=1) Output: p4096_1.a, p4096_1.b, p4096_1.c, p4096_1.d, p4096_1.updated_at -> Append (cost=645.53..8949.61 rows=4096 width=112) (actual time=0.000..0.000 rows=0 loops=1) Subplans Removed: 4095 -> Index Scan using p4096_1_a_updated_at_idx on public.p4096_1 (cost=0.57..2.03 rows=1 width=54) (never executed) Output: p4096_1.a, p4096_1.b, p4096_1.c, p4096_1.d, p4096_1.updated_at Index Cond: ((p4096_1.a = 'abc'::text) AND (p4096_1.updated_at >= now()) AND (p4096_1.updated_at <= (now() + '1 day'::interval))) Planning Time: 3897.687 ms Execution Time: 0.491 ms Regarding the nested loops - thanks for your explanation. I can see this is more complicated than I initially thought. It may be doable to determine if your set of pruned partitions is still valid, but it's more difficult to determine if, on top of that, extra partitions must be included due to widening of the range. -Floris ________ From: David Rowley Sent: Monday, April 15, 2019 1:25 AM To: Floris Van Nee Cc: Pg Hackers Subject: Re: partitioning performance tests after recent patches [External] On Mon, 15 Apr 2019 at 07:19, Floris Van Nee wrote: > 3) What could be causing the big performance difference between case 7 > (simple SELECT) and 8 (simple SELECT with ORDER BY LIMIT 1)? For 4096 > partitions, TPS of 7) is around 5, while adding the ORDER BY LIMIT 1 > makes TPS drop well below 1. In theory, run-time pruning of the right chunk > should take exactly the same amount of time in both cases, because both are > pruning timestamp now() on the same number of partitions. The resulting plans > are also identical with the exception of the top LIMIT-node (in PG11 they > differ slightly as a MergeAppend is chosen for the ORDER BY instead of an > Append, in HEAD with ordered append this is not necessary anymore). Am I > missing something here? With the information provided, I don't really see any reason why the ORDER BY LIMIT would slow it down if the plan is the same apart from the LIMIT node. Please share the EXPLAIN ANALYZE output of each. > 4) A more general question about run-time pruning in nested loops, like the > one for case 14. I believe I read in one of the previous threads that > run-time pruning only reoccurs if it determines that the value that > determines which partitions must be excluded has changed in between > iterations. How is this defined? Eg. let's say partitions are 1-day wide and > the first iteration of the loop filters on the partitioned table for > timestamp between 14-04-2019 12:00 and 14-04-2019 20:00 (dynamically > determined). Then the second iteration comes along and now filters on values > between 14-04-2019 12:00 and 14-04-2019 19:00. The partition that should be > scanned hasn't changed, because both timestamps fall into the same partition. > Is the full process of run-time pruning applied again, or is there some kind > of shortcut that first checks if the previous pruning result is still valid > even if the value has changed slightly? If not, would this be a possible > optimization, as I think it's a case that occurs very often? I don't k
Re: speeding up planning with partitions
Thanks for the details! Indeed the versions with now()/current_date use the runtime pruning rather than planning time. I wasn't aware of the use of 'today' though - that could be useful in case we're sure statements won't be prepared. Previously (v10/ partly v11) it was necessary to make sure that statements on partioned tables were never prepared, because run-time pruning wasn't available - using a generic plan was almost always a bad option. Now in v12 it seems to be a tradeoff between whether or not run-time pruning can occur. If pruning is possible at planning time it's probably still better not to prepare statements, whereas if run-time pruning has to occur, it's better to prepare them. One unrelated thing I noticed (but I'm not sure if it's worth a separate email thread) is that the changed default of jit=on in v12 doesn't work very well with a large number of partitions at run-time, for which a large number get excluded at run-time. A query that has an estimated cost above jit_optimize_above_cost takes about 30 seconds to run (for a table with 1000 partitions), because JIT is optimizing the full plan. Without JIT it's barely 20ms (+400ms planning). I can give more details in a separate thread if it's deemed interesting. Planning Time: 411.321 ms JIT: Functions: 5005 Options: Inlining false, Optimization true, Expressions true, Deforming true Timing: Generation 721.472 ms, Inlining 0.000 ms, Optimization 16312.195 ms, Emission 12533.611 ms, Total 29567.278 ms -Floris
Re: speeding up planning with partitions
Hi all, First of all I would like to thank everyone involved in this patch for their hard work on this. This is a big step forward. I've done some performance and functionality testing with the patch that was committed to master and it looks very good. I had a question about the performance of pruning of functions like now() and current_date. I know these are handled differently, as they cannot be excluded during the first phases of planning. However, curerntly, this new patch makes the performance difference between the static timestamp variant and now() very obvious (even more than before). Consider select * from partitioned_table where ts >= now() or select * from partitioned_table where ts >= '2019-04-04' The second plans in less than a millisecond, whereas the first takes +- 180ms for a table with 1000 partitions. Both end up with the same plan. I'm not too familiar with the code that handles this, but is there a possibility for improvement in this area? Or is the stage at which exclusion for now()/current_date occurs already too far in the process to make any good improvements to this? My apologies if this is considered off-topic for this patch, but I ran into this issue specifically when I was testing this patch, so I thought I'd ask here about it. I do think a large number of use-cases for tables with a large number of partitions involve a timestamp for partition key, and naturally people will start writing queries for this that use functions such as now() and current_date. Thanks again for your work on this patch! -Floris From: Amit Langote Sent: Tuesday, April 2, 2019 7:50 AM To: Tom Lane Cc: David Rowley; Imai Yoshikazu; jesper.peder...@redhat.com; Imai, Yoshikazu; Amit Langote; Alvaro Herrera; Robert Haas; Justin Pryzby; Pg Hackers Subject: Re: speeding up planning with partitions [External] Thanks for taking a look. On 2019/04/02 2:34, Tom Lane wrote: > Amit Langote writes: >> On 2019/03/30 0:29, Tom Lane wrote: >>> That seems like probably an independent patch --- do you want to write it? > >> Here is that patch. >> It revises get_relation_constraints() such that the partition constraint >> is loaded in only the intended cases. > > So I see the problem you're trying to solve here, but I don't like this > patch a bit, because it depends on root->inhTargetKind which IMO is a > broken bit of junk that we need to get rid of. Here is an example of > why, with this patch applied: > > regression=# create table p (a int) partition by list (a); > CREATE TABLE > regression=# create table p1 partition of p for values in (1); > CREATE TABLE > regression=# set constraint_exclusion to on; > SET > regression=# explain select * from p1 where a = 2; > QUERY PLAN > -- > Result (cost=0.00..0.00 rows=0 width=0) >One-Time Filter: false > (2 rows) > > So far so good, but watch what happens when we include the same case > in an UPDATE on some other partitioned table: > > regression=# create table prtab (a int, b int) partition by list (a); > CREATE TABLE > regression=# create table prtab2 partition of prtab for values in (2); > CREATE TABLE > regression=# explain update prtab set b=b+1 from p1 where prtab.a=p1.a and > p1.a=2; > QUERY PLAN > --- > Update on prtab (cost=0.00..82.30 rows=143 width=20) >Update on prtab2 >-> Nested Loop (cost=0.00..82.30 rows=143 width=20) > -> Seq Scan on p1 (cost=0.00..41.88 rows=13 width=10) >Filter: (a = 2) > -> Materialize (cost=0.00..38.30 rows=11 width=14) >-> Seq Scan on prtab2 (cost=0.00..38.25 rows=11 width=14) > Filter: (a = 2) > (8 rows) > > No constraint exclusion, while in v10 you get > > Update on prtab (cost=0.00..0.00 rows=0 width=0) >-> Result (cost=0.00..0.00 rows=0 width=0) > One-Time Filter: false > > The reason is that this logic supposes that root->inhTargetKind describes > *all* partitioned tables in the query, which is obviously wrong. > > Now maybe we could make it work by doing something like > > if (rel->reloptkind == RELOPT_BASEREL && > (root->inhTargetKind == INHKIND_NONE || > rel->relid != root->parse->resultRelation)) Ah, you're right. inhTargetKind has to be checked in conjunction with checking whether the relation is the target relation. > but I find that pretty messy, plus it's violating the concept that we > shouldn't be allowing messiness from inheritance_planner to leak into > other places. I'm afraid that we'll have to live with this particular hack as long as we have inheritance_planner(), but we maybe could somewhat reduce the extent to which the hack is spread into other planner files. How about we move the part of get_relation_constraints() that loads the partition constrai