Re: A comment fix
On Mon, May 11, 2020 at 02:22:36PM +0900, Michael Paquier wrote: > Looks right to me, so will fix if there are no objections. > read_local_xlog_page() uses the replay location when in recovery. Done this part now. -- Michael signature.asc Description: PGP signature
Re: refactoring basebackup.c
Hi Robert, Please see my comments inline below. On Tue, May 12, 2020 at 12:33 AM Robert Haas wrote: > Yeah, that needs to be tested. Right now the chunk size is 32kB but it > might be a good idea to go larger. Another thing is that right now the > chunk size is tied to the protocol message size, and I'm not sure > whether the size that's optimal for disk reads is also optimal for > protocol messages. One needs a balance between the number of packets to be sent across the network and so if the size of reading from the disk and the network packet size could be unified then it might provide a better optimization. > > I don't think it's a particularly bad thing that we include a small > amount of progress for sending an empty file, a directory, or a > symlink. That could make the results more meaningful if you have a > database with lots of empty relations in it. However, I agree that the > effect of compression shouldn't be included. To get there, I think we > need to redesign the wire protocol. Right now, the server has no way > of letting the client know how many uncompressed bytes it's sent, and > the client has no way of figuring it out without uncompressing, which > seems like something we want to avoid. > > I agree here too, except that if we have too many tar files one might cringe but sending the xtra amt from these tar files looks okay to me. > There are some other problems with the current wire protocol, too: > > 1. The syntax for the BASE_BACKUP command is large and unwieldy. We > really ought to adopt an extensible options syntax, like COPY, VACUUM, > EXPLAIN, etc. do, rather than using a zillion ad-hoc bolt-ons, each > with bespoke lexer and parser support. > > 2. The client is sent a list of tablespaces and is supposed to use > that to expect an equal number of archives, computing the name for > each one on the client side from the tablespace info. However, I think > we should be able to support modes like "put all the tablespaces in a > single archive" or "send a separate archive for every 256GB" or "ship > it all to the cloud and don't send me any archives". To get there, I > think we should have the server send the archive name to the clients, > and the client should just keep receiving the next archive until it's > told that there are no more. Then if there's one archive or ten > archives or no archives, the client doesn't have to care. It just > receives what the server sends until it hears that there are no more. > It also doesn't know how the server is naming the archives; the server > can, for example, adjust the archive names based on which compression > format is being chosen, without knowledge of the server's naming > conventions needing to exist on the client side. > > One thing to remember here could be that an optimization would need to be made between the number of options we provide and people coming back and saying which combinations do not work For example, if a user script has "put all the tablespaces in a single archive" and later on somebody makes some script changes to break it down at "256 GB" and there is a conflict then which one takes precedence needs to be chosen. When the number of options like this become very large this could lead to some complications. > I think we should keep support for the current BASE_BACKUP command but > either add a new variant using an extensible options, or else invent a > whole new command with a different name (BACKUP, SEND_BACKUP, > whatever) that takes extensible options. This command should send back > all the archives and the backup manifest using a single COPY stream > rather than multiple COPY streams. Within the COPY stream, we'll > invent a sub-protocol, e.g. based on the first letter of the message, > e.g.: > > t = Tablespace boundary. No further message payload. Indicates, for > progress reporting purposes, that we are advancing to the next > tablespace. > f = Filename. The remainder of the message payload is the name of the > next file that will be transferred. > d = Data. The next four bytes contain the number of uncompressed bytes > covered by this message, for progress reporting purposes. The rest of > the message is payload, possibly compressed. Could be empty, if the > data is being shipped elsewhere, and these messages are only being > sent to update the client's notion of progress. > Here I support this. > I thought about that a bit, too. There might be some way to unify that > by having some common context object that's defined by basebackup.c > and all archivers get it, so that they have some commonly-desired > details without needing bespoke code, but I'm not sure at this point > whether that will actually produce a nicer result. Even if we don't > have it initially, it seems like it wouldn't be very hard to add it > later, so I'm not too stressed about it. > --Sumanta Mukherjee EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company > > >
Re: Parallel copy
On Mon, May 11, 2020 at 11:52 PM Robert Haas wrote: > > > Case-3: > > In copy command, for performing foreign key checks, we take KEY SHARE > > lock on primary key table rows which inturn will increment the command > > counter and updates the snapshot. Now, as we share the snapshots at > > the beginning of the command, we can't allow it to be changed later. > > So, unless we do something special for it, I think we can't allow > > parallelism in such cases. > > This sounds like much more of a problem to me; it'd be a significant > restriction that would kick in routine cases where the user isn't > doing anything particularly exciting. The command counter presumably > only needs to be updated once per command, so maybe we could do that > before we start parallelism. However, I think we would need to have > some kind of dynamic memory structure to which new combo CIDs can be > added by any member of the group, and then discovered by other members > of the group later. At the end of the parallel operation, the leader > must discover any combo CIDs added by others to that table before > destroying it, even if it has no immediate use for the information. We > can't allow a situation where the group members have inconsistent > notions of which combo CIDs exist or what their mappings are, and if > KEY SHARE locks are being taken, new combo CIDs could be created. > AFAIU, we don't generate combo CIDs for this case. See below code in heap_lock_tuple(): /* * Store transaction information of xact locking the tuple. * * Note: Cmax is meaningless in this context, so don't set it; this avoids * possibly generating a useless combo CID. Moreover, if we're locking a * previously updated tuple, it's important to preserve the Cmax. * * Also reset the HOT UPDATE bit, but only if there's no update; otherwise * we would break the HOT chain. */ tuple->t_data->t_infomask &= ~HEAP_XMAX_BITS; tuple->t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED; tuple->t_data->t_infomask |= new_infomask; tuple->t_data->t_infomask2 |= new_infomask2; I don't understand why we need to do something special for combo CIDs if they are not generated during this operation? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [PATCH] hs_standby_disallowed test fix
On 2020/05/12 12:05, Fujii Masao wrote: On 2020/05/12 8:03, Michail Nikolaev wrote: Hello. There is a recent commit about changes in way read-only commands are prevented to be executed [1]. It seems like hs_standby_disallowed test is broken now. So, a simple patch to fix the test is attached. Thanks for the report and patch! LGTM. Pushed. Thanks! Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Is it useful to record whether plans are generic or custom?
Hi, When we run prepared statements, as far as I know, running EXPLAIN is the only way to know whether executed plans are generic or custom. There seems no way to know how many times a prepared statement was executed as generic and custom. I think it may be useful to record the number of generic and custom plans mainly to ensure the prepared statements are executed as expected plan type. If people also feel it's useful, I'm going to think about adding columns such as 'generic plans' and 'custom plans' to pg_stat_statements. As you know, pg_stat_statements can now track not only 'calls' but 'plans', so we can presume which plan type was executed from them. When both 'calls' and 'plans' were incremented, plan type would be custom. When only 'calls' was incremented, it would be generic. But considering the case such as only the plan phase has succeeded and the execution phase has failed, this presumption can be wrong. Thoughts? Regards, -- Atsushi Torikoshi
Re: Parallel copy
On Mon, May 11, 2020 at 11:52 PM Robert Haas wrote: > > I wonder why you're still looking at this instead of looking at just > speeding up the current code, especially the line splitting, > Because the line splitting is just 1-2% of overall work in common cases. See the data shared by Vignesh for various workloads [1]. The time it takes is in range of 0.5-12% approximately and for cases like a table with few indexes, it is not more than 1-2%. [1] - https://www.postgresql.org/message-id/CALDaNm3r8cPsk0Vo_-6AXipTrVwd0o9U2S0nCmRdku1Dn-Tpqg%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PG 13 release notes, first draft
At Mon, 11 May 2020 20:12:04 -0400, Bruce Momjian wrote in > On Thu, May 7, 2020 at 09:22:02PM -0700, Noah Misch wrote: > > On Thu, May 07, 2020 at 09:38:34AM -0400, Bruce Momjian wrote: > > > > > > - Crash recovery was losing tuples written via COPY TO. This fixes > > > > > > the bug. > > > > > > > > > > This was not backpatched? > > > > > > > > Right. > > > > > > Oh. So you are saying we could lose COPY data on a crash, even after a > > > commit. That seems bad. Can you show me the commit info? I can't find > > > it. > > > > commit c6b9204 > > Author: Noah Misch > > AuthorDate: Sat Apr 4 12:25:34 2020 -0700 > > Commit: Noah Misch > > CommitDate: Sat Apr 4 12:25:34 2020 -0700 > > > > Skip WAL for new relfilenodes, under wal_level=minimal. > > > > Until now, only selected bulk operations (e.g. COPY) did this. If a > > given relfilenode received both a WAL-skipping COPY and a WAL-logged > > operation (e.g. INSERT), recovery could lose tuples from the COPY. See > > src/backend/access/transam/README section "Skipping WAL for New > > RelFileNode" for the new coding rules. Maintainers of table access > > methods should examine that section. > > OK, so how do we want to document this? Do I mention in the text below > the WAL skipping item that this fixes a bug where a mix of simultaneous > COPY and INSERT into a table could lose rows during crash recovery, or > create a new item? FWIW, as dicussed upthread, I suppose that the API change is not going to be in relnotes. something like this? - Fix bug of WAL-skipping optimiazation Previously a trasaction doing both of COPY and a WAL-logged operations like INSERT while wal_level=minimal can lead to loss of COPY'ed rows through crash recovery. Also this fix extends the WAL-skipping optimiazation to all kinds of bulk insert operations. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Two fsync related performance issues?
On 2020/05/12 9:42, Paul Guo wrote: Hello hackers, 1. StartupXLOG() does fsync on the whole data directory early in the crash recovery. I'm wondering if we could skip some directories (at least the pg_log/, table directories) since wal, etc could ensure consistency. I agree that we can skip log directory but I'm not sure if skipping table directory is really safe. Also ISTM that we can skip the directories that those contents are removed or zeroed during recovery, for example, pg_snapshots, pg_substrans, etc. Here is the related code. if (ControlFile->state != DB_SHUTDOWNED && ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY) { RemoveTempXlogFiles(); SyncDataDirectory(); } I have this concern since I saw an issue in a real product environment that the startup process needs 10+ seconds to start wal replay after relaunch due to elog(PANIC) (it was seen on postgres based product Greenplum but it is a common issue in postgres also). I highly suspect the delay was mostly due to this. Also it is noticed that on public clouds fsync is much slower than that on local storage so the slowness should be more severe on cloud. If we at least disable fsync on the table directories we could skip a lot of file fsync - this may save a lot of seconds during crash recovery. 2. CheckPointTwoPhase() This may be a small issue. See the code below, for (i = 0; i < TwoPhaseState->numPrepXacts; i++) RecreateTwoPhaseFile(gxact->xid, buf, len); RecreateTwoPhaseFile() writes a state file for a prepared transaction and does fsync. It might be good to do fsync for all files once after writing them, given the kernel is able to do asynchronous flush when writing those file contents. If the TwoPhaseState->numPrepXacts is large we could do batching to avoid the fd resource limit. I did not test them yet but this should be able to speed up checkpoint/restartpoint a bit. Any thoughts? It seems worth making the patch and measuring the performance improvement. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: PG 13 release notes, first draft
Bruce Momjian writes: > On Mon, May 11, 2020 at 11:08:35PM -0400, Chapman Flack wrote: >> 'specify' ? > I like that word if Tom prefers it. 'specify' works for me. regards, tom lane
Re: PG 13 release notes, first draft
On Mon, May 11, 2020 at 11:08:35PM -0400, Chapman Flack wrote: > On 05/11/20 22:48, Bruce Momjian wrote: > > On Mon, May 11, 2020 at 10:07:23PM -0400, Tom Lane wrote: > >> Bruce Momjian writes: > >>> Allow Unicode escapes, e.g., E'\u', U&'\', to represent any > >>> character available in the database encoding, even when the database > >>> encoding is not UTF-8 (Tom Lane) > >> > >> How about "to be used for" instead of "to represent"? "Represent" kind > >> of sounds like we're using these on output, which we aren't. > > > > Uh, I think "used for" is worse though, since we are not using it. I > > don't get the "output" feel of the word at all. > > 'specify' ? I like that word if Tom prefers it. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: PG 13 release notes, first draft
On 05/11/20 22:48, Bruce Momjian wrote: > On Mon, May 11, 2020 at 10:07:23PM -0400, Tom Lane wrote: >> Bruce Momjian writes: >>> Allow Unicode escapes, e.g., E'\u', U&'\', to represent any >>> character available in the database encoding, even when the database >>> encoding is not UTF-8 (Tom Lane) >> >> How about "to be used for" instead of "to represent"? "Represent" kind >> of sounds like we're using these on output, which we aren't. > > Uh, I think "used for" is worse though, since we are not using it. I > don't get the "output" feel of the word at all. 'specify' ? -Chap
Re: PG 13 release notes, first draft
On Tue, May 12, 2020 at 6:11 AM Justin Pryzby wrote: > > On Mon, May 11, 2020 at 10:52:41AM +0530, Amit Kapila wrote: > > > 1. We have allowed an (auto)vacuum to display additional information > > > about heap or index in case of an error in commit b61d161c14 [1]. > > > Now, in general, it might not be worth saying much about error > > > information but I think this one could help users in case they have > > > some corruption. For example, if one of the indexes on a relation has > > > some corrupted data (due to bad hardware or some bug), it will let the > > > user know the index information, and the user can take appropriate > > > action like either Reindex or maybe drop and recreate the index to > > > overcome the problem. > > I'm not opposed to including it, but I think it's still true that the user > doesn't need to know in advance that the error message will be additionally > helpful in the event of corruption. If we were to include more "error" items, > we might also include these: > > 71a8a4f6e36547bb060dbcc961ea9b57420f7190 Add backtrace support for error > reporting > 17a28b03645e27d73bf69a95d7569b61e58f06eb Improve the internal implementation > of ereport(). > 05f18c6b6b6e4b44302ee20a042cedc664532aa2 Added relation name in error > messages for constraint checks. > 33753ac9d7bc83dd9ccee9d5e678ed78a0725b4e Add object names to partition > integrity violations. > I think the first one (Add backtrace support for error reporting) seems to be quite useful as it can help to detect the problems faster. I think having a simple rule as Bruce has w.r.t "error messages" makes it easier to decide whether to take a particular commit or not but I feel some of these could help users to know the new functionality and might encourage them to upgrade to the new version. Sure, nobody is going to move due to only these features but along with other things, improved error handling is a good thing to know. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [PATCH] hs_standby_disallowed test fix
On 2020/05/12 8:03, Michail Nikolaev wrote: Hello. There is a recent commit about changes in way read-only commands are prevented to be executed [1]. It seems like hs_standby_disallowed test is broken now. So, a simple patch to fix the test is attached. Thanks for the report and patch! LGTM. I just wonder why standbycheck regression test doesn't run by default in buildfarm. Which caused us not to notice this issue long time. Maybe because it's difficult to set up hot-standby environment in the regression test? If so, we might need to merge standbycheck test into TAP tests for recovery. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: PG 13 release notes, first draft
On 2020/05/12 9:34, Bruce Momjian wrote: Could you add "Vinayak Pokale" as a co-author of the following feature since I sometimes read his old patch to create a patch [1] ? === E.1.3.1.6. System Views - Add system view pg_stat_progress_analyze to report analyze progress (Álvaro Herrera, Tatsuro Yamada) + Add system view pg_stat_progress_analyze to report analyze progress (Álvaro Herrera, Tatsuro Yamada, Vinayak Pokale) === Done. Hi Bruce, Thank you! Regards, Tatsuro Yamada
Re: PG 13 release notes, first draft
On Mon, May 11, 2020 at 10:23:53PM -0400, Tom Lane wrote: > Bruce Momjian writes: > > Well, are you suggesting a new section because the glossary shouldn't be > > listed under source code, or because you want the function reformatting > > added? We just need to understand what the purpose is. We already have > > the glossary listed, since that is new and user-visible. > > The implication of what you say here is that "is it user-visible?" > is a criterion for whether to release-note something. By that logic > we probably *should* relnote the function table layout changes, because > they sure as heck are user-visible. People might or might not notice > addition of a glossary, but I think just about every user consults > the function/operator tables regularly. > > I concur with Alvaro's position that if we are listing documentation > changes, pushing them under "Source Code" is not the way to do it. > That subsection has always been understood to be "stuff you don't > care about if you're not a hacker". > > So that sort of leads me to the conclusion that "major documentation > changes" might be a reasonable sub-heading for the release notes. OK, section and item added, patch attached, -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription + diff --git a/doc/src/sgml/release-13.sgml b/doc/src/sgml/release-13.sgml index 83470e9ba1..34211e2b68 100644 --- a/doc/src/sgml/release-13.sgml +++ b/doc/src/sgml/release-13.sgml @@ -2331,7 +2331,7 @@ Use the directory of the pg_upgrade binary as the default new 'bindir' location -Source Code +Documentation @@ -2347,6 +2347,27 @@ Add a glossary to the documentation (Corey Huinker, Jrgen Purtz, Roger Har + + + + +Reformat tables containing function information for better clarity (Tom Lane) + + + + + + + + + +Source Code + + +
Re: PG 13 release notes, first draft
On Tue, May 12, 2020 at 6:36 AM Bruce Momjian wrote: > > On Mon, May 11, 2020 at 07:41:55PM -0500, Justin Pryzby wrote: > > On Mon, May 11, 2020 at 10:52:41AM +0530, Amit Kapila wrote: > > > One more observation: > > > > > > Allow inserts to trigger autovacuum activity (Laurenz Albe, Darafei > > > Praliaskouski) > > > This new behavior allows pages to be set as all-visible, which then > > > allows index-only scans, ... > > > > > The above sentence sounds to mean that this feature allows index-only > > > scans in more number of cases after this feature. Is that what you > > > intend to say? If so, is that correct? Because I think this will > > > allow index-only scans to skip "Heap Fetches" in more cases. > > > > I think what you mean is that the autovacuum feature, in addition to > > encouraging the *planner* to choose an indexonly scan, will *also* allow (at > > execution time) fewer heap fetches for a plan which would have > > already/previously used IOS. Right ? So maybe it should say "allows OR > > IMPROVES index-only scans" or "allows plans which use IOS to run more > > efficiently". > > Yes, I see your point now. New text is: > > This new behavior reduces the work necessary when the table > needs to be frozen and allows pages to be set as all-visible. > All-visible pages allow index-only scans to access fewer heap rows. > The next text LGTM. Thanks. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: No core file generated after PostgresNode->start
Robert Haas writes: > On Mon, May 11, 2020 at 4:24 PM Antonin Houska wrote: >> Could "sysctl kernel.core_pattern" be the problem? I discovered this setting >> sometime when I also couldn't find the core dump on linux. > Well, I'm running on macOS and the core files normally show up in > /cores, but in this case they didn't. I have a standing note to check the permissions on /cores after any macOS upgrade, because every so often Apple decides that that directory ought to be read-only. regards, tom lane
Re: PG 13 release notes, first draft
On Mon, May 11, 2020 at 10:07:23PM -0400, Tom Lane wrote: > Bruce Momjian writes: > > I like your wording, but the "that encoding" wasn't clear enough for me, > > so I reworded it to: > > > Allow Unicode escapes, e.g., E'\u', U&'\', to represent any > > character available in the database encoding, even when the database > > encoding is not UTF-8 (Tom Lane) > > How about "to be used for" instead of "to represent"? "Represent" kind > of sounds like we're using these on output, which we aren't. Uh, I think "used for" is worse though, since we are not using it. I don't get the "output" feel of the word at all. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: PG 13 release notes, first draft
On Tue, May 12, 2020 at 10:54:52AM +0900, Michael Paquier wrote: > On Mon, May 11, 2020 at 07:18:56PM -0400, Bruce Momjian wrote: > > On Fri, May 8, 2020 at 11:55:33AM +0900, Michael Paquier wrote: > >> Should e2e02191 be added to the notes? This commit means that we > >> actually dropped support for Windows 2000 (finally) at run-time. > > > > Oh, yes. This is much more important than the removal of support for > > non-ELF BSD systems, which I already listed. The new text is: > > > > Remove support for Windows 2000 (Michael Paquier) > > Sounds fine to me. > > >> At the same time I see no mention of 79dfa8af, which added better > >> error handling when backends the SSL context with incorrect bounds. > > > > I skipped that commit since people don't normally care about better > > error messages until they see the error message, and then they are happy > > it is there, unless this is some chronic error message problem we are > > fixing. > > Okay. > > > I thought this fell into the previous category about error messages, but > > coloring is different. Can we say these utilities now honor the color > > environment variables? > > Exactly, I actually became aware of that possibility after plugging > in the common logging APIs to oid2name and vacuumlo as of fc8cb94b so > this was not mentioned in the log message. And anything using > src/common/logging.c can make use of the colorized output with > PG_COLOR[S] set. > > > Are these the only new ones? > > I can recall an extra one in this case: pgbench as of 30a3e77. And I > don't see any new callers of pg_logging_init() in the stuff that > already existed in ~12. I am not sure we even mentioned this in 12. Should we document this somewhere? Maybe a blog posting? -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: No core file generated after PostgresNode->start
On Mon, May 11, 2020 at 4:24 PM Antonin Houska wrote: > Could "sysctl kernel.core_pattern" be the problem? I discovered this setting > sometime when I also couldn't find the core dump on linux. Well, I'm running on macOS and the core files normally show up in /cores, but in this case they didn't. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: PG 13 release notes, first draft
Bruce Momjian writes: > Well, are you suggesting a new section because the glossary shouldn't be > listed under source code, or because you want the function reformatting > added? We just need to understand what the purpose is. We already have > the glossary listed, since that is new and user-visible. The implication of what you say here is that "is it user-visible?" is a criterion for whether to release-note something. By that logic we probably *should* relnote the function table layout changes, because they sure as heck are user-visible. People might or might not notice addition of a glossary, but I think just about every user consults the function/operator tables regularly. I concur with Alvaro's position that if we are listing documentation changes, pushing them under "Source Code" is not the way to do it. That subsection has always been understood to be "stuff you don't care about if you're not a hacker". So that sort of leads me to the conclusion that "major documentation changes" might be a reasonable sub-heading for the release notes. regards, tom lane
Re: PG 13 release notes, first draft
On 2020-May-11, Bruce Momjian wrote: > On Mon, May 11, 2020 at 08:34:56PM -0400, Alvaro Herrera wrote: > > Yes, we had to touch the source code in order to add documentation; but > > so what? Everything touches the source code, but that doesn't mean it > > should be listed there. I don't see what's the problem with having a > > new subsection in the relnotes entitled "Documentation" where these two > > items appears (glossary + new doc table format) are listed. It's not > > like it's going to cost us a lot of space or anything. > Well, are you suggesting a new section because the glossary shouldn't be > listed under source code, or because you want the function reformatting > added? We just need to understand what the purpose is. We already have > the glossary listed, since that is new and user-visible. IMO the table reformatting change is significant enough to be noteworthy. I'm suggesting that a new Documentation subsection would list both that and the glossary, separately from Source Code -- so it'd be E.1.3.10 and Source Code would be E.1.3.11. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PG 13 release notes, first draft
Bruce Momjian writes: > I like your wording, but the "that encoding" wasn't clear enough for me, > so I reworded it to: > Allow Unicode escapes, e.g., E'\u', U&'\', to represent any > character available in the database encoding, even when the database > encoding is not UTF-8 (Tom Lane) How about "to be used for" instead of "to represent"? "Represent" kind of sounds like we're using these on output, which we aren't. regards, tom lane
Re: pg13: xlogreader API adjust
At Mon, 11 May 2020 16:33:36 -0400, Alvaro Herrera wrote in > Hello > > Per discussion in thread [1], I propose the following patch to give > another adjustment to the xlogreader API. This results in a small but > not insignificat net reduction of lines of code. What this patch does > is adjust the signature of these new xlogreader callbacks, making the > API simpler. The changes are: > > * the segment_open callback installs the FD in xlogreader state itself, > instead of passing the FD back. This was suggested by Kyotaro > Horiguchi in that thread[2]. > > * We no longer pass segcxt to segment_open; it's in XLogReaderState, > which is already an argument. > > * We no longer pass seg/segcxt to WALRead; instead, that function takes > them from XLogReaderState, which is already an argument. > (This means XLogSendPhysical has to drink more of the fake_xlogreader > kool-aid.) > > I claim the reason to do it now instead of pg14 is to make it simpler > for third-party xlogreader callers to adjust. > > (Some might be thinking that I do this to avoid an API change later, but > my guts tell me that we'll adjust xlogreader again in pg14 for the > encryption stuff and other reasons, so.) > > > [1] https://postgr.es/m/20200406025651.fpzdb5yyb7qyh...@alap3.anarazel.de > [2] > https://postgr.es/m/20200508.114228.963995144765118400.horikyota@gmail.com The simplified interface of WALRead looks far better to me since it no longer has unreasonable duplicates of parameters. I agree to the discussion about third-party xlogreader callers but not sure about back-patching burden. I'm not sure the reason for wal_segment_open and WalSndSegmentOpen being modified different way about error handling of BasicOpenFile, I prefer the WalSndSegmentOpen way. However, that difference doesn't harm anything so I'm fine with the current patch. + fake_xlogreader.seg = *sendSeg; + fake_xlogreader.segcxt = *sendCxt; fake_xlogreader.seg is a different instance from *sendSeg. WALRead modifies fake_xlogreader.seg but does not modify *sendSeg. Thus the change doesn't persist. On the other hand WalSndSegmentOpen reads *sendSeg, which is not under control of WALRead. Maybe we had better to make fake_xlogreader be a global variable of walsender.c that covers the current sendSeg and sendCxt. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: PG 13 release notes, first draft
On Mon, May 11, 2020 at 07:18:56PM -0400, Bruce Momjian wrote: > On Fri, May 8, 2020 at 11:55:33AM +0900, Michael Paquier wrote: >> Should e2e02191 be added to the notes? This commit means that we >> actually dropped support for Windows 2000 (finally) at run-time. > > Oh, yes. This is much more important than the removal of support for > non-ELF BSD systems, which I already listed. The new text is: > > Remove support for Windows 2000 (Michael Paquier) Sounds fine to me. >> At the same time I see no mention of 79dfa8af, which added better >> error handling when backends the SSL context with incorrect bounds. > > I skipped that commit since people don't normally care about better > error messages until they see the error message, and then they are happy > it is there, unless this is some chronic error message problem we are > fixing. Okay. > I thought this fell into the previous category about error messages, but > coloring is different. Can we say these utilities now honor the color > environment variables? Exactly, I actually became aware of that possibility after plugging in the common logging APIs to oid2name and vacuumlo as of fc8cb94b so this was not mentioned in the log message. And anything using src/common/logging.c can make use of the colorized output with PG_COLOR[S] set. > Are these the only new ones? I can recall an extra one in this case: pgbench as of 30a3e77. And I don't see any new callers of pg_logging_init() in the stuff that already existed in ~12. -- Michael signature.asc Description: PGP signature
Re: A patch for get origin from commit_ts.
On Mon, May 11, 2020 at 04:43:11PM +0800, movead...@highgo.ca wrote: > But I can not fond any code to get 'origin' from commit_ts, just like it is > producing data which nobody cares about. Can I know what's the purpose > of the 'origin' in commit_ts? Do you think we should add some support > to the careless data? I have not thought about this matter, but it seems to me that you should add this patch to the upcoming commit fest for evaluation: https://commitfest.postgresql.org/28/ This is going to take a couple of months though as the main focus lately is the stability of 13. -- Michael signature.asc Description: PGP signature
Re: PG 13 release notes, first draft
On Mon, May 11, 2020 at 6:23 PM Bruce Momjian wrote: > OK, I was able to add some of it cleanly: > > This allows efficient btree indexing of low cardinality columns by > storing duplicate keys only once. Users upgrading with pg_upgrade > will need to use REINDEX to make use of this feature. That seems like a reasonable compromise. Thanks! -- Peter Geoghegan
Re: PG 13 release notes, first draft
On Tue, May 12, 2020 at 8:51 AM Bruce Momjian wrote: > On Fri, May 8, 2020 at 12:07:09PM +0900, Amit Langote wrote: > > I have attached a patch with my suggestions above. > > OK, I slightly modified the wording of your first change, patch > attached. Thanks. I checked that what you committed looks fine. -- Amit Langote EnterpriseDB: http://www.enterprisedb.com
Re: PG 13 release notes, first draft
On Mon, May 11, 2020 at 09:15:43PM -0400, Bruce Momjian wrote: > On Mon, May 11, 2020 at 05:33:40PM -0700, Peter Geoghegan wrote: > > On Mon, May 11, 2020 at 5:14 PM Bruce Momjian wrote: > > > Agreed. How is this? > > > > > > This allows efficient btree indexing of low cardinality columns. > > > Users upgrading with pg_upgrade will need to use REINDEX to make > > > use of > > > this feature. > > > > I still think that the release notes should say that the key is only > > stored once, while TIDs that identify table rows are stored together > > as an array. Everything that's helpful (or harmful) about the feature > > happens as a consequence of that. This isn't hard to grasp > > intuitively, and is totally in line with how things like Oracle bitmap > > indexes are presented to ordinary users. > > I still don't think these details belong in the release notes. OK, I was able to add some of it cleanly: This allows efficient btree indexing of low cardinality columns by storing duplicate keys only once. Users upgrading with pg_upgrade will need to use REINDEX to make use of this feature. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: PG 13 release notes, first draft
On Mon, May 11, 2020 at 05:33:40PM -0700, Peter Geoghegan wrote: > On Mon, May 11, 2020 at 5:14 PM Bruce Momjian wrote: > > Agreed. How is this? > > > > This allows efficient btree indexing of low cardinality columns. > > Users upgrading with pg_upgrade will need to use REINDEX to make > > use of > > this feature. > > I still think that the release notes should say that the key is only > stored once, while TIDs that identify table rows are stored together > as an array. Everything that's helpful (or harmful) about the feature > happens as a consequence of that. This isn't hard to grasp > intuitively, and is totally in line with how things like Oracle bitmap > indexes are presented to ordinary users. I still don't think these details belong in the release notes. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: PG 13 release notes, first draft
On Mon, May 11, 2020 at 07:41:01PM -0500, Justin Pryzby wrote: > |Allow function call backtraces of errors to be logged (Peter Eisentraut, > Álvaro Herrera) > |Server variable backtrace_functions specifies which C functions should > generate backtraces on error. > > I think the details in the description are eclipsing the most important thing: > backtraces on Assert(). I would say "Support for showing backtraces on > error". Uh, you mean this adds backtraces for errors and asserts? Are non-developers running assert builds? > Regarding this one: > |Add system view pg_shmem_allocations to display shared memory usage (Andres > Freund, Robert Haas) > |WHAT IS THE ENTRY WITH NO NAME? > > There seems to be two special, "unnamed" cases: > src/backend/storage/ipc/shmem.c-/* output shared memory allocated but > not counted via the shmem index */ > src/backend/storage/ipc/shmem.c:values[0] = > CStringGetTextDatum(""); > ... > src/backend/storage/ipc/shmem.c-/* output as-of-yet unused shared > memory */ > src/backend/storage/ipc/shmem.c-nulls[0] = true; > > That seems to be adequately documented: > https://www.postgresql.org/docs/devel/view-pg-shmem-allocations.html > |NULL for unused memory and for anonymous allocations. OK, thanks. Comment removed. > I would remove this part: > "Previously, this could only be set at server start." OK, you rae saying it is already clear, agreed, removed. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: PG 13 release notes, first draft
On Mon, May 11, 2020 at 05:09:54PM -0400, Alvaro Herrera wrote: > On 2020-May-11, Alvaro Herrera wrote: > > > Hello > > > > > + > > > + > > > + > > > +Add psql commands to report operator classes and operator families > > > (Sergey Cherkashin, Nikita Glukhov, Alexander Korotkov) > > > + > > > > I think this item should list the commands in question: > > \dA, \dAc, \dAf, \dAo, \dAp > > (All the other psql entries in the relnotes do that). > > Sorry, it's the last four only -- \dA is an older command. OK, confirmed, thanks. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: PG 13 release notes, first draft
On Mon, May 11, 2020 at 07:41:55PM -0500, Justin Pryzby wrote: > On Mon, May 11, 2020 at 10:52:41AM +0530, Amit Kapila wrote: > > One more observation: > > > > Allow inserts to trigger autovacuum activity (Laurenz Albe, Darafei > > Praliaskouski) > > This new behavior allows pages to be set as all-visible, which then > > allows index-only scans, ... > > > The above sentence sounds to mean that this feature allows index-only > > scans in more number of cases after this feature. Is that what you > > intend to say? If so, is that correct? Because I think this will > > allow index-only scans to skip "Heap Fetches" in more cases. > > I think what you mean is that the autovacuum feature, in addition to > encouraging the *planner* to choose an indexonly scan, will *also* allow (at > execution time) fewer heap fetches for a plan which would have > already/previously used IOS. Right ? So maybe it should say "allows OR > IMPROVES index-only scans" or "allows plans which use IOS to run more > efficiently". Yes, I see your point now. New text is: This new behavior reduces the work necessary when the table needs to be frozen and allows pages to be set as all-visible. All-visible pages allow index-only scans to access fewer heap rows. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: PG 13 release notes, first draft
On Mon, May 11, 2020 at 08:34:56PM -0400, Alvaro Herrera wrote: > On 2020-May-11, Bruce Momjian wrote: > > > We have long discussed how much of the release notes is to reward > > behavior, and we have settled on having the names on the items, and the > > Acknowledgments section at the bottom. > > Yes, we had to touch the source code in order to add documentation; but > so what? Everything touches the source code, but that doesn't mean it > should be listed there. I don't see what's the problem with having a > new subsection in the relnotes entitled "Documentation" where these two > items appears (glossary + new doc table format) are listed. It's not > like it's going to cost us a lot of space or anything. > > I don't think there is any circularity argument BTW -- we're not going > to document that we added release notes. And changing the table format > is not entirely pointless, given that we've historically had trouble > with these tables (read: they're totally unusable in PDF). Well, are you suggesting a new section because the glossary shouldn't be listed under source code, or because you want the function reformatting added? We just need to understand what the purpose is. We already have the glossary listed, since that is new and user-visible. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: PG 13 release notes, first draft
On Mon, May 11, 2020 at 04:50:50PM -0400, Alvaro Herrera wrote: > Hello > > > + > > + > > + > > +Add psql commands to report operator classes and operator families (Sergey > > Cherkashin, Nikita Glukhov, Alexander Korotkov) > > + > > I think this item should list the commands in question: > \dA, \dAc, \dAf, \dAo, \dAp > (All the other psql entries in the relnotes do that). Good idea. I added this paragraph: The new commands are \dAc, \dAf, \dAo, and \dAp. I didn't see any changes to \dA except regression tests. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Two fsync related performance issues?
Hello hackers, 1. StartupXLOG() does fsync on the whole data directory early in the crash recovery. I'm wondering if we could skip some directories (at least the pg_log/, table directories) since wal, etc could ensure consistency. Here is the related code. if (ControlFile->state != DB_SHUTDOWNED && ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY) { RemoveTempXlogFiles(); SyncDataDirectory(); } I have this concern since I saw an issue in a real product environment that the startup process needs 10+ seconds to start wal replay after relaunch due to elog(PANIC) (it was seen on postgres based product Greenplum but it is a common issue in postgres also). I highly suspect the delay was mostly due to this. Also it is noticed that on public clouds fsync is much slower than that on local storage so the slowness should be more severe on cloud. If we at least disable fsync on the table directories we could skip a lot of file fsync - this may save a lot of seconds during crash recovery. 2. CheckPointTwoPhase() This may be a small issue. See the code below, for (i = 0; i < TwoPhaseState->numPrepXacts; i++) RecreateTwoPhaseFile(gxact->xid, buf, len); RecreateTwoPhaseFile() writes a state file for a prepared transaction and does fsync. It might be good to do fsync for all files once after writing them, given the kernel is able to do asynchronous flush when writing those file contents. If the TwoPhaseState->numPrepXacts is large we could do batching to avoid the fd resource limit. I did not test them yet but this should be able to speed up checkpoint/restartpoint a bit. Any thoughts? Regards.
Re: PG 13 release notes, first draft
On Mon, May 11, 2020 at 10:52:41AM +0530, Amit Kapila wrote: > > 1. We have allowed an (auto)vacuum to display additional information > > about heap or index in case of an error in commit b61d161c14 [1]. > > Now, in general, it might not be worth saying much about error > > information but I think this one could help users in case they have > > some corruption. For example, if one of the indexes on a relation has > > some corrupted data (due to bad hardware or some bug), it will let the > > user know the index information, and the user can take appropriate > > action like either Reindex or maybe drop and recreate the index to > > overcome the problem. I'm not opposed to including it, but I think it's still true that the user doesn't need to know in advance that the error message will be additionally helpful in the event of corruption. If we were to include more "error" items, we might also include these: 71a8a4f6e36547bb060dbcc961ea9b57420f7190 Add backtrace support for error reporting 17a28b03645e27d73bf69a95d7569b61e58f06eb Improve the internal implementation of ereport(). 05f18c6b6b6e4b44302ee20a042cedc664532aa2 Added relation name in error messages for constraint checks. 33753ac9d7bc83dd9ccee9d5e678ed78a0725b4e Add object names to partition integrity violations. > One more observation: > > Allow inserts to trigger autovacuum activity (Laurenz Albe, Darafei > Praliaskouski) > This new behavior allows pages to be set as all-visible, which then > allows index-only scans, ... > The above sentence sounds to mean that this feature allows index-only > scans in more number of cases after this feature. Is that what you > intend to say? If so, is that correct? Because I think this will > allow index-only scans to skip "Heap Fetches" in more cases. I think what you mean is that the autovacuum feature, in addition to encouraging the *planner* to choose an indexonly scan, will *also* allow (at execution time) fewer heap fetches for a plan which would have already/previously used IOS. Right ? So maybe it should say "allows OR IMPROVES index-only scans" or "allows plans which use IOS to run more efficiently". Separate from Amit's comment, I suggest to say: | This new behavior allows autovacuum to set pages as all-visible, which then | allows index-only scans, ... ..otherwise it sounds like this feature implemented the concept of "all-visible". -- Justin
Re: PG 13 release notes, first draft
|Allow function call backtraces of errors to be logged (Peter Eisentraut, Álvaro Herrera) |Server variable backtrace_functions specifies which C functions should generate backtraces on error. I think the details in the description are eclipsing the most important thing: backtraces on Assert(). I would say "Support for showing backtraces on error". Regarding this one: |Add system view pg_shmem_allocations to display shared memory usage (Andres Freund, Robert Haas) |WHAT IS THE ENTRY WITH NO NAME? There seems to be two special, "unnamed" cases: src/backend/storage/ipc/shmem.c-/* output shared memory allocated but not counted via the shmem index */ src/backend/storage/ipc/shmem.c:values[0] = CStringGetTextDatum(""); ... src/backend/storage/ipc/shmem.c-/* output as-of-yet unused shared memory */ src/backend/storage/ipc/shmem.c-nulls[0] = true; That seems to be adequately documented: https://www.postgresql.org/docs/devel/view-pg-shmem-allocations.html |NULL for unused memory and for anonymous allocations. I would remove this part: "Previously, this could only be set at server start." -- Justin
Re: PG 13 release notes, first draft
On 2020-May-11, Bruce Momjian wrote: > We have long discussed how much of the release notes is to reward > behavior, and we have settled on having the names on the items, and the > Acknowledgments section at the bottom. Yes, we had to touch the source code in order to add documentation; but so what? Everything touches the source code, but that doesn't mean it should be listed there. I don't see what's the problem with having a new subsection in the relnotes entitled "Documentation" where these two items appears (glossary + new doc table format) are listed. It's not like it's going to cost us a lot of space or anything. I don't think there is any circularity argument BTW -- we're not going to document that we added release notes. And changing the table format is not entirely pointless, given that we've historically had trouble with these tables (read: they're totally unusable in PDF). -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PG 13 release notes, first draft
On Mon, May 11, 2020 at 03:19:50PM +0900, Tatsuro Yamada wrote: > Hi Bruce, > > On 2020/05/05 12:16, Bruce Momjian wrote: > > I have committed the first draft of the PG 13 release notes. You can > > see them here: > > > > https://momjian.us/pgsql_docs/release-13.html > > > > It still needs markup, word wrap, and indenting. The community doc > > build should happen in a few hours. > > Thanks for working on this! :-D > > Could you add "Vinayak Pokale" as a co-author of the following feature since > I sometimes read his old patch to create a patch [1] ? > > === > E.1.3.1.6. System Views > > - Add system view pg_stat_progress_analyze to report analyze progress > (Álvaro Herrera, Tatsuro Yamada) > > + Add system view pg_stat_progress_analyze to report analyze progress > (Álvaro Herrera, Tatsuro Yamada, Vinayak Pokale) > === Done. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: PG 13 release notes, first draft
On Mon, May 11, 2020 at 5:14 PM Bruce Momjian wrote: > Agreed. How is this? > > This allows efficient btree indexing of low cardinality columns. > Users upgrading with pg_upgrade will need to use REINDEX to make use > of > this feature. I still think that the release notes should say that the key is only stored once, while TIDs that identify table rows are stored together as an array. Everything that's helpful (or harmful) about the feature happens as a consequence of that. This isn't hard to grasp intuitively, and is totally in line with how things like Oracle bitmap indexes are presented to ordinary users. -- Peter Geoghegan
Re: PG 13 release notes, first draft (ltree dot star)
On Sun, May 10, 2020 at 03:09:47PM -0500, Justin Pryzby wrote: > > In ltree, when using adjacent asterisks with braces, e.g. "*{2}.*{3}", > > properly interpret that as "*{5}" (Nikita Glukhov) > > I think that should say ".*" not "*", as in: > > > In ltree, when using adjacent asterisks with braces, e.g. ".*{2}.*{3}", > > properly interpret that as "*{5}" (Nikita Glukhov) > > The existing text clearly came from the commit message, which (based on its > regression tests) I think was the source of the missing dot. > > commit 9950c8aadf0edd31baec74a729d47d94af636c06 > Author: Tom Lane > Date: Sat Mar 28 18:31:05 2020 -0400 > > Fix lquery's behavior for consecutive '*' items. > > Something like "*{2}.*{3}" should presumably mean the same as > "*{5}", but it didn't. Improve that. > ... OK, fixed to be: In ltree, when using adjacent asterisks with braces, e.g. ".*{2}.*{3}", properly interpret that as ".*{5}" (Nikita Glukhov) I added two dots. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: PG 13 release notes, first draft
On Mon, May 11, 2020 at 10:52:41AM +0530, Amit Kapila wrote: > One more observation: > > Allow inserts to trigger autovacuum activity (Laurenz Albe, Darafei > Praliaskouski) > This new behavior allows pages to be set as all-visible, which then > allows index-only scans, ... > > The above sentence sounds to mean that this feature allows index-only > scans in more number of cases after this feature. Is that what you > intend to say? If so, is that correct? Because I think this will Yes. > allow index-only scans to skip "Heap Fetches" in more cases. Uh, by definition an index-only scan only scans the index, not the heap, right? It is true there are fewer heap fetches, but fewer heap features I thought mean more index-only scans. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: PG 13 release notes, first draft
On Sat, May 9, 2020 at 11:16:27AM +0530, Amit Kapila wrote: > On Tue, May 5, 2020 at 8:46 AM Bruce Momjian wrote: > > > > I have committed the first draft of the PG 13 release notes. You can > > see them here: > > > > https://momjian.us/pgsql_docs/release-13.html > > > > Thanks for the work. I was today going through the release notes and > was wondering whether we should consider adding information about some > other work done for PG13. > 1. We have allowed an (auto)vacuum to display additional information > about heap or index in case of an error in commit b61d161c14 [1]. > Now, in general, it might not be worth saying much about error > information but I think this one could help users in case they have > some corruption. For example, if one of the indexes on a relation has > some corrupted data (due to bad hardware or some bug), it will let the > user know the index information, and the user can take appropriate > action like either Reindex or maybe drop and recreate the index to > overcome the problem. I mentioned my approach to error message changes in a previous email today: I skipped that commit since people don't normally care about better error messages until they see the error message, and then they are happy it is there, unless this is some chronic error message problem we are fixing. > 2. In the "Source Code" section, we can add information about > infrastructure enhancement for parallelism. Basically, "Allow > relation extension and page lock to conflict among parallel-group > members" [2][3]. This will allow improving the parallelism further in > many cases like (a) we can allow multiple workers to operate on a heap > and index in a parallel vacuum, (b) we can allow parallel Inserts, > etc. Uh, if there is no user-visible behavior change, this seems too low level for the release notes. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: PG 13 release notes, first draft
On Fri, May 8, 2020 at 09:14:01AM +0200, Fabien COELHO wrote: > > It seems (a) pointless > > I disagree, on the very principle of free software values as a social > movement. > > Documentation improvements should be encouraged, and recognizing these in > the release notes contributes to do that for what is a lot of unpaid work > given freely by many people. I do not see this as "pointless", on the > contrary, having something "free" in a mostly mercantile world is odd enough > to deserve some praise. > > How many hours have you spent on the function operator table improvements? > If someone else had contributed that and only that to a release, would it > not justify two lines of implicit thanks somewhere down in the release > notes? > > Moreover adding a documentation section costs next to nothing, so what is > the actual point of not doing it? Also, having some documentation > improvements listed under "source code" does not make sense: writing good, > precise and structured English is not "source code". We have long discussed how much of the release notes is to reward behavior, and we have settled on having the names on the items, and the Acknowledgments section at the bottom. If you want to revisit that decision, you should start a new thread because doing it for just this item doesn't make sense. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: PG 13 release notes, first draft
On Mon, May 11, 2020 at 05:05:29PM -0700, Peter Geoghegan wrote: > On Mon, May 11, 2020 at 4:10 PM Bruce Momjian wrote: > > > think that you should point out that deduplication works by storing > > > the duplicates in the obvious way: Only storing the key once per > > > distinct value (or once per distinct combination of values in the case > > > of multi-column indexes), followed by an array of TIDs (i.e. a posting > > > list). Each TID points to a separate row in the table. > > > > These are not details that should be in the release notes since the > > internal representation is not important for its use. > > I am not concerned about describing the specifics of the on-disk > representation, and I don't feel too strongly about the storage > parameter (leave it out). I only ask that the wording convey the fact > that the deduplication feature is not just a quantitative improvement > -- it's a qualitative behavioral change, that will help data > warehousing in particular. This wasn't the case with the v12 work on > B-Tree duplicates (as I said last year, I thought of the v12 stuff as > fixing a problem more than an enhancement). > > With the deduplication feature added to Postgres v13, the B-Tree code > can now gracefully deal with low cardinality data by compressing the > duplicates as needed. This is comparable to bitmap indexes in > proprietary database systems, but without most of their disadvantages > (in particular around heavyweight locking, deadlocks that abort > transactions, etc). It's easy to imagine this making a big difference > with analytics workloads. The v12 work made indexes with lots of > duplicates 15%-30% smaller (compared to v11), but the v13 work can > make them 60% - 80% smaller in many common cases (compared to v12). In > extreme cases indexes might even be ~12x smaller (though that will be > rare). Agreed. How is this? This allows efficient btree indexing of low cardinality columns. Users upgrading with pg_upgrade will need to use REINDEX to make use of this feature. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Event trigger code comment duplication
Skimming through the code in event_trigger.c and noticed that while most of the stanzas that reference IsUnderPostmaster refer back to the code comment beginning on line 675 the block for table rewrite copied it in verbatim starting at line 842. The currentEventTriggerState comment at lines 730 and 861 seem to be the same too. https://github.com/postgres/postgres/blob/60c90c16c1885bb9aa2047b66f958b865c5d397e/src/backend/commands/event_trigger.c#L675 https://github.com/postgres/postgres/blob/60c90c16c1885bb9aa2047b66f958b865c5d397e/src/backend/commands/event_trigger.c#L730 https://github.com/postgres/postgres/blob/60c90c16c1885bb9aa2047b66f958b865c5d397e/src/backend/commands/event_trigger.c#L842 https://github.com/postgres/postgres/blob/60c90c16c1885bb9aa2047b66f958b865c5d397e/src/backend/commands/event_trigger.c#L861 https://github.com/postgres/postgres/blob/4d1563717fb1860168a40b852e1d61a33ecdd62f/src/backend/commands/event_trigger.c#L785 I did also notice a difference with the test on line 861 compared to line 785 though I unable to evaluate whether the absence of a "rewriteList" is expected (there is a "dropList" at 785 ...). David J.
Re: PG 13 release notes, first draft
On Thu, May 7, 2020 at 09:22:02PM -0700, Noah Misch wrote: > On Thu, May 07, 2020 at 09:38:34AM -0400, Bruce Momjian wrote: > > > > > - Crash recovery was losing tuples written via COPY TO. This fixes > > > > > the bug. > > > > > > > > This was not backpatched? > > > > > > Right. > > > > Oh. So you are saying we could lose COPY data on a crash, even after a > > commit. That seems bad. Can you show me the commit info? I can't find > > it. > > commit c6b9204 > Author: Noah Misch > AuthorDate: Sat Apr 4 12:25:34 2020 -0700 > Commit: Noah Misch > CommitDate: Sat Apr 4 12:25:34 2020 -0700 > > Skip WAL for new relfilenodes, under wal_level=minimal. > > Until now, only selected bulk operations (e.g. COPY) did this. If a > given relfilenode received both a WAL-skipping COPY and a WAL-logged > operation (e.g. INSERT), recovery could lose tuples from the COPY. See > src/backend/access/transam/README section "Skipping WAL for New > RelFileNode" for the new coding rules. Maintainers of table access > methods should examine that section. OK, so how do we want to document this? Do I mention in the text below the WAL skipping item that this fixes a bug where a mix of simultaneous COPY and INSERT into a table could lose rows during crash recovery, or create a new item? -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: PG 13 release notes, first draft
On Mon, May 11, 2020 at 4:10 PM Bruce Momjian wrote: > > think that you should point out that deduplication works by storing > > the duplicates in the obvious way: Only storing the key once per > > distinct value (or once per distinct combination of values in the case > > of multi-column indexes), followed by an array of TIDs (i.e. a posting > > list). Each TID points to a separate row in the table. > > These are not details that should be in the release notes since the > internal representation is not important for its use. I am not concerned about describing the specifics of the on-disk representation, and I don't feel too strongly about the storage parameter (leave it out). I only ask that the wording convey the fact that the deduplication feature is not just a quantitative improvement -- it's a qualitative behavioral change, that will help data warehousing in particular. This wasn't the case with the v12 work on B-Tree duplicates (as I said last year, I thought of the v12 stuff as fixing a problem more than an enhancement). With the deduplication feature added to Postgres v13, the B-Tree code can now gracefully deal with low cardinality data by compressing the duplicates as needed. This is comparable to bitmap indexes in proprietary database systems, but without most of their disadvantages (in particular around heavyweight locking, deadlocks that abort transactions, etc). It's easy to imagine this making a big difference with analytics workloads. The v12 work made indexes with lots of duplicates 15%-30% smaller (compared to v11), but the v13 work can make them 60% - 80% smaller in many common cases (compared to v12). In extreme cases indexes might even be ~12x smaller (though that will be rare). -- Peter Geoghegan
Re: PG 13 release notes, first draft
On Fri, May 8, 2020 at 12:07:09PM +0900, Amit Langote wrote: > > OK, I used this wording: > > > > Allow logical replication into partitioned tables on subscribers > > (Amit > > Langote) > > > > Previously, subscribers could only receive rows into non-partitioned > > tables. > > This is fine, thanks. > > I have attached a patch with my suggestions above. OK, I slightly modified the wording of your first change, patch attached. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription + diff --git a/doc/src/sgml/release-13.sgml b/doc/src/sgml/release-13.sgml index 239586c04b..51ebb9b2ba 100644 --- a/doc/src/sgml/release-13.sgml +++ b/doc/src/sgml/release-13.sgml @@ -273,7 +273,7 @@ Allow partitionwise joins to happen in more cases (Ashutosh Bapat, Etsuro Fujita For example, partitionwise joins can now happen between partitioned -tables where the ancestors do not exactly match. +tables even when their partition bounds do not match exactly. @@ -307,7 +307,7 @@ Allow partitioned tables to be logically replicated via publications (Amit Lango Previously, partitions had to be replicated individually. Now partitioned tables can be published explicitly causing all partitions to be automatically published. Addition/removal of partitions from -partitioned tables are automatically added/removed from publications. The CREATE PUBLICATION option publish_via_partition_root controls whether partitioned tables are published as their own or their ancestors. +partitioned tables are automatically added/removed from publications. The CREATE PUBLICATION option publish_via_partition_root controls whether changes to partitions are published as their own or their ancestor's.
Re: PG 13 release notes, first draft
On Fri, May 8, 2020 at 11:55:33AM +0900, Michael Paquier wrote: > On Mon, May 04, 2020 at 11:16:00PM -0400, Bruce Momjian wrote: > > I have committed the first draft of the PG 13 release notes. You can > > see them here: > > > > https://momjian.us/pgsql_docs/release-13.html > > > > It still needs markup, word wrap, and indenting. The community doc > > build should happen in a few hours. > > Thanks Bruce for compiling the release notes. Here are some comments > from me, after looking at the state of the notes as of f2ff203. > > Should e2e02191 be added to the notes? This commit means that we > actually dropped support for Windows 2000 (finally) at run-time. Oh, yes. This is much more important than the removal of support for non-ELF BSD systems, which I already listed. The new text is: Remove support for Windows 2000 (Michael Paquier) > At the same time I see no mention of 79dfa8af, which added better > error handling when backends the SSL context with incorrect bounds. I skipped that commit since people don't normally care about better error messages until they see the error message, and then they are happy it is there, unless this is some chronic error message problem we are fixing. > What about fc8cb94, which basically means that vacuumlo and oid2name > are able to now support coloring output for their logging? I thought this fell into the previous category about error messages, but coloring is different. Can we say these utilities now honor the color environment variables? Are these the only new ones? > > Document color support (Peter Eisentraut) > > [...] > > THIS WAS NOT DOCUMENTED BEFORE? > > Not sure that there is a point to add that to the release notes. OK, removed. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: PG 13 release notes, first draft
On Thu, May 7, 2020 at 11:54:12AM -0700, Peter Geoghegan wrote: > Hi Bruce, > > On Mon, May 4, 2020 at 8:16 PM Bruce Momjian wrote: > > I have committed the first draft of the PG 13 release notes. You can > > see them here: > > > > https://momjian.us/pgsql_docs/release-13.html > > I see that you have an entry for the deduplication feature: > > "More efficiently store duplicates in btree indexes (Anastasia > Lubennikova, Peter Geoghegan)" > > I would like to provide some input on this. Fortunately it's much > easier to explain than the B-Tree work that went into Postgres 12. I - Well, that's good! :-) > think that you should point out that deduplication works by storing > the duplicates in the obvious way: Only storing the key once per > distinct value (or once per distinct combination of values in the case > of multi-column indexes), followed by an array of TIDs (i.e. a posting > list). Each TID points to a separate row in the table. These are not details that should be in the release notes since the internal representation is not important for its use. > It won't be uncommon for this to make indexes as much as 3x smaller > (it depends on a number of different factors that you can probably > guess). I wrote a summary of how it works for power users in the > B-Tree documentation chapter, which you might want to link to in the > release notes: > > https://www.postgresql.org/docs/devel/btree-implementation.html#BTREE-DEDUPLICATION > > Users that pg_upgrade will have to REINDEX to actually use the > feature, regardless of which version they've upgraded from. There are > also some limited caveats about the data types that can use > deduplication, and stuff like that -- see the documentation section I > linked to. I have added text to this about pg_upgrade: Users upgrading with pg_upgrade will need to use REINDEX to make use of this feature. > Finally, you might want to note that the feature is enabled by > default, and can be disabled by setting the "deduplicate_items" index > storage option to "off". (We have yet to make a final decision on > whether the feature should be enabled before the first stable release > of Postgres 13, though -- I have an open item for that.) Well, again, I don't think the average user needs to know this can be disabled. They can look at the docs of this feature to see that. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
[PATCH] hs_standby_disallowed test fix
Hello. There is a recent commit about changes in way read-only commands are prevented to be executed [1]. It seems like hs_standby_disallowed test is broken now. So, a simple patch to fix the test is attached. Thanks, Michail. [1] https://www.postgresql.org/message-id/flat/154701965766.11631.2240747476287499810.pgcf%40coridan.postgresql.org#168075c6e89267e11b862aa0c55b1910 diff --git a/src/test/regress/expected/hs_standby_disallowed.out b/src/test/regress/expected/hs_standby_disallowed.out index dff0953e9a..8d3cafa5ce 100644 --- a/src/test/regress/expected/hs_standby_disallowed.out +++ b/src/test/regress/expected/hs_standby_disallowed.out @@ -64,7 +64,7 @@ SELECT count(*) FROM hs1; (1 row) COMMIT PREPARED 'foobar'; -ERROR: COMMIT PREPARED cannot run inside a transaction block +ERROR: cannot execute COMMIT PREPARED during recovery ROLLBACK; BEGIN; SELECT count(*) FROM hs1; @@ -86,7 +86,7 @@ SELECT count(*) FROM hs1; (1 row) ROLLBACK PREPARED 'foobar'; -ERROR: ROLLBACK PREPARED cannot run inside a transaction block +ERROR: cannot execute ROLLBACK PREPARED during recovery ROLLBACK; -- Locks BEGIN;
Re: PG 13 release notes, first draft
On Thu, May 7, 2020 at 02:08:58PM -0400, Chapman Flack wrote: > On 05/07/20 09:46, Bruce Momjian wrote: > > Ah, very good point. New text is: > > > > Allow Unicode escapes, e.g., E'\u', in databases that do not > > use UTF-8 encoding (Tom Lane) > > > > The Unicode characters must be available in the database encoding. > > ... > > > > I am only using E'\u' as an example. > > Hmm, how about: > > Allow Unicode escapes, e.g., E'\u' or U&'\', to represent > any character available in the database encoding, even when that > encoding is not UTF-8. > > which I suggest as I recall more clearly that the former condition > was not that such escapes were always rejected in other encodings; it was > that they were rejected if they represented characters outside of ASCII. > (Yossarian let out a respectful whistle.) I like your wording, but the "that encoding" wasn't clear enough for me, so I reworded it to: Allow Unicode escapes, e.g., E'\u', U&'\', to represent any character available in the database encoding, even when the database encoding is not UTF-8 (Tom Lane) > My inclination is to give at least one example each of the E and U& > form, if only so the casual reader of the notes may think "say! I hadn't > heard of that other form!" and be inspired to find out about it. But > perhaps it seems too much. Sure, works for me. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: PG 13 release notes, first draft
On Mon, May 11, 2020 at 08:41:00PM +0300, Alexander Korotkov wrote: > On Mon, May 11, 2020 at 4:20 PM Bruce Momjian wrote: > > Sorry for the delay in replying. Yes, I agree we should list all of > > those operator class cases where we now take arguments. I looked at the > > patch but got confused and missed the doc changes that clearly need to > > be in the release notes. I see these operator classes now taking > > parameters, as you helpfully listed in your commit message: > > > > tsvector_ops > > gist_ltree_ops > > gist__ltree_ops > > gist_trgm_ops > > gist_hstore_ops > > gist__int_ops > > gist__intbig_ops > > > > I assume the double-underscore is because the first underscore is to > > separate words, and the second one is for the array designation, right? > > Yes, this is true. OK. > > So my big question is whether people will understand when they are using > > these operator classes, since many of them are defaults. Can you use an > > operator class parameter when you are just using the default operator > > class and not specifying its name? > > Actually no. Initial version of patch allowed to explicitly specify > DEFAULT keyword instead of opclass name. But I didn't like idea to > allow keyword instead of name there. > > I've tried to implement syntax allowing specifying parameters without > both new keyword and opclass name, but that causes a lot of grammar > problems. > > Finally, I've decided to provide parameters functionality only when > specifying opclass name. My motivation is that opclass parameters is > functionality for advanced users, who are deeply concerned into what > opclass do. For this category of users it's natural to know the > opclass name. Yes, that is good analysis, and your final decision was probably correct. I now see that the complexity is not a big problem. > > What I thinking that just saying > > the operator class take arguments might not be helpful. I think I see > > enough detail in the documentation to write release note items for > > these, but I will have to point out they need to specify the operator > > class, even if it is the default, right? > > My point was that we should specify where to look to find new > functionality. We can don't write opclass names, because those names > might be confusing for users who are not aware of them. We may > briefly say that new parameters are introduced for GiST for tsvector, > contrib/intarray, contrib/pg_trgm, contrib/ltree, contrib/hstore. > What do you think? OK, I have applied the attached patch, which I now think is the right level of detail, given your information above. Thanks. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription + diff --git a/doc/src/sgml/release-13.sgml b/doc/src/sgml/release-13.sgml new file mode 100644 index 239586c..c0840c2 *** a/doc/src/sgml/release-13.sgml --- b/doc/src/sgml/release-13.sgml *** Author: Alexander Korotkov + + Indexes created on four and eight-byte integer array, tsvector, trigram, + ltree, and hstore columns can now control these GiST index parameters, + rather than using the defaults.
Remaining routine pre-beta tasks
There are a couple of things left to do yet: * Collapse temporary OID assignments down to the permanent range. AFAICS there's no reason to delay this step any longer, since there aren't any open issues that seem likely to touch the catalog data. * In recent years we've usually done a preliminary pgindent run before beta1 (and then another right before making the stable branch). I've been holding off on this to avoid breaking WIP patches for open issues, but we really ought to be close to done on those. Barring objections, I'll do the OID renumbering tomorrow (Tuesday) and a pgindent run on Thursday. regards, tom lane
Re: PG 13 release notes, first draft
On 2020-May-11, Alvaro Herrera wrote: > Hello > > > + > > + > > + > > +Add psql commands to report operator classes and operator families (Sergey > > Cherkashin, Nikita Glukhov, Alexander Korotkov) > > + > > I think this item should list the commands in question: > \dA, \dAc, \dAf, \dAo, \dAp > (All the other psql entries in the relnotes do that). Sorry, it's the last four only -- \dA is an older command. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: pgsql: Show opclass and opfamily related information in psql
On 2020-Mar-08, Alexander Korotkov wrote: > Show opclass and opfamily related information in psql > > This commit provides psql commands for listing operator classes, operator > families and its contents in psql. New commands will be useful for exploring > capabilities of both builtin opclasses/opfamilies as well as > opclasses/opfamilies defined in extensions. I had chance to use these new commands this morning. I noticed the ORDER BY clause of \dAo is not very useful; for example: =# \dAo+ brin datetime_minmax_ops List of operators of operator families AM │Opfamily Name│ Operator │ Strategy │ Purpose │ Sort opfamily ──┼─┼───┼──┼─┼─── brin │ datetime_minmax_ops │ < (date, date) │1 │ search │ brin │ datetime_minmax_ops │ < (date, timestamp with time zone) │1 │ search │ brin │ datetime_minmax_ops │ < (date, timestamp without time zone) │1 │ search │ brin │ datetime_minmax_ops │ < (timestamp with time zone, date) │1 │ search │ brin │ datetime_minmax_ops │ < (timestamp with time zone, timestamp with time zone)│1 │ search │ brin │ datetime_minmax_ops │ < (timestamp with time zone, timestamp without time zone) │1 │ search │ brin │ datetime_minmax_ops │ < (timestamp without time zone, date) │1 │ search │ brin │ datetime_minmax_ops │ < (timestamp without time zone, timestamp with time zone) │1 │ search │ brin │ datetime_minmax_ops │ < (timestamp without time zone, timestamp without time zone) │1 │ search │ brin │ datetime_minmax_ops │ <= (date, date) │2 │ search │ brin │ datetime_minmax_ops │ <= (date, timestamp with time zone) │2 │ search │ brin │ datetime_minmax_ops │ <= (date, timestamp without time zone) │2 │ search │ brin │ datetime_minmax_ops │ <= (timestamp with time zone, date) │2 │ search │ brin │ datetime_minmax_ops │ <= (timestamp with time zone, timestamp with time zone) │2 │ search │ brin │ datetime_minmax_ops │ <= (timestamp with time zone, timestamp without time zone)│2 │ search │ Note how operator for strategy 1 are all together, then strategy 2, and so on. But I think we'd prefer the operators to be grouped together for the same types (just like \dAp already works); so I would change the clause from: ORDER BY 1, 2, o.amopstrategy, 3; to: ORDER BY 1, 2, pg_catalog.format_type(o.amoplefttype, NULL), pg_catalog.format_type(o.amoprighttype, NULL), o.amopstrategy; which gives this table: AM │Opfamily Name│ Operator │ Strategy │ Purpose │ Sort opfamily ──┼─┼───┼──┼─┼─── brin │ datetime_minmax_ops │ < (date, date) │1 │ search │ brin │ datetime_minmax_ops │ <= (date, date) │2 │ search │ brin │ datetime_minmax_ops │ = (date, date) │3 │ search │ brin │ datetime_minmax_ops │ >= (date, date) │4 │ search │ brin │ datetime_minmax_ops │ > (date, date) │5 │ search │ brin │ datetime_minmax_ops │ < (date, timestamp with time zone) │1 │ search │ brin │ datetime_minmax_ops │ <= (date, timestamp with time zone) │2 │ search │ brin │ datetime_minmax_ops │ = (date, timestamp with time zone) │3 │ search │ brin │ datetime_minmax_ops │ >= (date, timestamp with time zone) │4 │ search │ brin │ datetime_minmax_ops │ > (date, timestamp with time zone) │5 │ search │ Also, while I'm going about this, ISTM it'd make sense to list same-class operators first, followed by cross-class operators. That requires to add "o.amoplefttype = o.amoprighttype DESC," after "ORDER BY 1, 2,". For brin's integer_minmax_ops, the resulting list would have first (bigint,bigint) then (integer,integer) then (smallint,smallint), then all the rest: brin │ integer_minmax_ops│ < (bigint, bigint) │1 │ search │ brin │ integer_minmax_ops
Re: PG 13 release notes, first draft
Hello > + > + > + > +Add psql commands to report operator classes and operator families (Sergey > Cherkashin, Nikita Glukhov, Alexander Korotkov) > + I think this item should list the commands in question: \dA, \dAc, \dAf, \dAo, \dAp (All the other psql entries in the relnotes do that). Thanks -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: JSON output from psql
On Mon, May 11, 2020 at 1:24 PM Robert Haas wrote: > On Fri, May 8, 2020 at 7:32 PM Gurjeet Singh wrote: > > There's no standard format that comes to mind, but perhaps an output > format similar to that of (array of row_to_json()) would be desirable. For > example, `select relname, relnamespace from pg_class;` would emit the > following: > > > > [ > > {"relname": "pgclass", "relnamespace": 11}, > > {"relname": "pg_statistic", "relnamespace": 11}, > > ] > > I don't see why psql needs any special support. You can already > generate this using the existing server side functions, if you want > it. > That's a good point! It might still be desirable, perhaps for performance trade-off of JSON conversion on the client-side instead of on the server-side. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/
pg13: xlogreader API adjust
Hello Per discussion in thread [1], I propose the following patch to give another adjustment to the xlogreader API. This results in a small but not insignificat net reduction of lines of code. What this patch does is adjust the signature of these new xlogreader callbacks, making the API simpler. The changes are: * the segment_open callback installs the FD in xlogreader state itself, instead of passing the FD back. This was suggested by Kyotaro Horiguchi in that thread[2]. * We no longer pass segcxt to segment_open; it's in XLogReaderState, which is already an argument. * We no longer pass seg/segcxt to WALRead; instead, that function takes them from XLogReaderState, which is already an argument. (This means XLogSendPhysical has to drink more of the fake_xlogreader kool-aid.) I claim the reason to do it now instead of pg14 is to make it simpler for third-party xlogreader callers to adjust. (Some might be thinking that I do this to avoid an API change later, but my guts tell me that we'll adjust xlogreader again in pg14 for the encryption stuff and other reasons, so.) [1] https://postgr.es/m/20200406025651.fpzdb5yyb7qyh...@alap3.anarazel.de [2] https://postgr.es/m/20200508.114228.963995144765118400.horikyota@gmail.com -- Álvaro Herrera Developer, https://www.PostgreSQL.org/ diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index 7cee8b92c9..4b6f4ada6e 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -1050,8 +1050,6 @@ err: * Read 'count' bytes into 'buf', starting at location 'startptr', from WAL * fetched from timeline 'tli'. * - * 'seg/segcxt' identify the last segment used. - * * Returns true if succeeded, false if an error occurs, in which case * 'errinfo' receives error details. * @@ -1061,7 +1059,6 @@ err: bool WALRead(XLogReaderState *state, char *buf, XLogRecPtr startptr, Size count, TimeLineID tli, - WALOpenSegment *seg, WALSegmentContext *segcxt, WALReadError *errinfo) { char *p; @@ -1078,34 +1075,34 @@ WALRead(XLogReaderState *state, int segbytes; int readbytes; - startoff = XLogSegmentOffset(recptr, segcxt->ws_segsize); + startoff = XLogSegmentOffset(recptr, state->segcxt.ws_segsize); /* * If the data we want is not in a segment we have open, close what we * have (if anything) and open the next one, using the caller's * provided openSegment callback. */ - if (seg->ws_file < 0 || - !XLByteInSeg(recptr, seg->ws_segno, segcxt->ws_segsize) || - tli != seg->ws_tli) + if (state->seg.ws_file < 0 || + !XLByteInSeg(recptr, state->seg.ws_segno, state->segcxt.ws_segsize) || + tli != state->seg.ws_tli) { XLogSegNo nextSegNo; - if (seg->ws_file >= 0) + if (state->seg.ws_file >= 0) state->routine.segment_close(state); - XLByteToSeg(recptr, nextSegNo, segcxt->ws_segsize); - seg->ws_file = state->routine.segment_open(state, nextSegNo, - segcxt, ); + XLByteToSeg(recptr, nextSegNo, state->segcxt.ws_segsize); + state->routine.segment_open(state, nextSegNo, ); + Assert(state->seg.ws_file >= 0); /* shouldn't happen */ /* Update the current segment info. */ - seg->ws_tli = tli; - seg->ws_segno = nextSegNo; + state->seg.ws_tli = tli; + state->seg.ws_segno = nextSegNo; } /* How many bytes are within this segment? */ - if (nbytes > (segcxt->ws_segsize - startoff)) - segbytes = segcxt->ws_segsize - startoff; + if (nbytes > (state->segcxt.ws_segsize - startoff)) + segbytes = state->segcxt.ws_segsize - startoff; else segbytes = nbytes; @@ -1115,7 +1112,7 @@ WALRead(XLogReaderState *state, /* Reset errno first; eases reporting non-errno-affecting errors */ errno = 0; - readbytes = pg_pread(seg->ws_file, p, segbytes, (off_t) startoff); + readbytes = pg_pread(state->seg.ws_file, p, segbytes, (off_t) startoff); #ifndef FRONTEND pgstat_report_wait_end(); @@ -1127,7 +1124,7 @@ WALRead(XLogReaderState *state, errinfo->wre_req = segbytes; errinfo->wre_read = readbytes; errinfo->wre_off = startoff; - errinfo->wre_seg = *seg; + errinfo->wre_seg = state->seg; return false; } diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c index bbd801513a..0fc27215df 100644 --- a/src/backend/access/transam/xlogutils.c +++ b/src/backend/access/transam/xlogutils.c @@ -784,31 +784,28 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa } /* XLogReaderRoutine->segment_open callback for local pg_wal files */ -int +void wal_segment_open(XLogReaderState *state, XLogSegNo nextSegNo, - WALSegmentContext *segcxt, TimeLineID *tli_p) + TimeLineID *tli_p) { TimeLineID tli = *tli_p; char path[MAXPGPATH]; - int fd; - XLogFilePath(path, tli, nextSegNo, segcxt->ws_segsize); - fd = BasicOpenFile(path, O_RDONLY | PG_BINARY); - if
Re: making update/delete of inheritance trees scale better
On Mon, May 11, 2020 at 4:22 PM Tom Lane wrote: > Robert Haas writes: > > If the parent is RTI 1, and the children are RTIs 2..6, what > > varno/varattno will we use in RTI 1's tlist to represent a column that > > exists in both RTI 2 and RTI 3 but not in RTI 1, 4, 5, or 6? > > Fair question. We don't have any problem representing the column > as it exists in any one of those children, but we lack a notation > for the "union" or whatever you want to call it, except in the case > where the parent relation has a corresponding column. Still, this > doesn't seem that hard to fix. My inclination would be to invent > dummy parent-rel columns (possibly with negative attnums? not sure if > that'd be easier or harder than adding them in the positive direction) > to represent such "union" columns. Ah, that makes sense. If we can invent dummy columns on the parent rel, then most of what I was worrying about no longer seems very worrying. I'm not sure what's involved in inventing such dummy columns, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: No core file generated after PostgresNode->start
Robert Haas wrote: > On Sun, May 10, 2020 at 11:21 PM Andy Fan wrote: > > Looks this doesn't mean a crash. If the test > > case(subscription/t/013_partition.pl) > > failed, test framework kill some process, which leads the above message. > > So you can > > ignore this issue now. Thanks > > I think there might be a real issue here someplace, though, because I > couldn't get a core dump last week when I did have a crash happening > locally. I didn't poke into it very hard though so I never figured out > exactly why not, but ulimit -c unlimited didn't help. Could "sysctl kernel.core_pattern" be the problem? I discovered this setting sometime when I also couldn't find the core dump on linux. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: JSON output from psql
On Fri, May 8, 2020 at 7:32 PM Gurjeet Singh wrote: > There's no standard format that comes to mind, but perhaps an output format > similar to that of (array of row_to_json()) would be desirable. For example, > `select relname, relnamespace from pg_class;` would emit the following: > > [ > {"relname": "pgclass", "relnamespace": 11}, > {"relname": "pg_statistic", "relnamespace": 11}, > ] I don't see why psql needs any special support. You can already generate this using the existing server side functions, if you want it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: making update/delete of inheritance trees scale better
Robert Haas writes: > If the parent is RTI 1, and the children are RTIs 2..6, what > varno/varattno will we use in RTI 1's tlist to represent a column that > exists in both RTI 2 and RTI 3 but not in RTI 1, 4, 5, or 6? Fair question. We don't have any problem representing the column as it exists in any one of those children, but we lack a notation for the "union" or whatever you want to call it, except in the case where the parent relation has a corresponding column. Still, this doesn't seem that hard to fix. My inclination would be to invent dummy parent-rel columns (possibly with negative attnums? not sure if that'd be easier or harder than adding them in the positive direction) to represent such "union" columns. This concept would only need to exist within the planner I think, since after setrefs.c there'd be no trace of those dummy columns. > I think we have quite a bit of code that expects to be able to translate > between parent-rel expressions and child-rel expressions, and that's > going to be pretty problematic here. ... shrug. Sure, we'll need to be able to do that mapping. Why will it be any harder than any other parent <-> child mapping? The planner would know darn well what the mapping is while it's inventing the dummy columns, so it just has to keep that info around for use later. > Maybe your answer is - let's just fix all that stuff. That could well > be right, but my first reaction is to think that it sounds hard. I have to think that it'll net out as less code, and certainly less complicated code, than trying to extend inheritance_planner in its current form to do what we wish it'd do. regards, tom lane
Re: gcov coverage data not full with immediate stop
On Mon, May 11, 2020 at 4:04 PM Tom Lane wrote: > This is the wrong thread to be debating that in, though. True. > Also I wonder if this is really RMT turf? I think it is, but the RMT is permitted -- even encouraged -- to consider the views of non-RMT members when making its decision. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: gcov coverage data not full with immediate stop
Robert Haas writes: > I agree, but also, we should start thinking about when to branch. I, > too, have patches that aren't critical enough to justify pushing them > post-freeze, but which are still good improvements that I'd like to > get into the tree. I'm queueing them right now to avoid the risk of > destabilizing things, but that generates more work, for me and for > other people, if their patches force me to rebase or the other way > around. I know there's always a concern with removing the focus on > release N too soon, but the open issues list is 3 items long right > now, and 2 of those look like preexisting issues, not new problems in > v13. Meanwhile, we have 20+ active committers. Yeah. Traditionally we've waited till the start of the next commitfest (which I'm assuming is July 1, for lack of an Ottawa dev meeting to decide differently). But it seems like things are slow enough that perhaps we could branch earlier, like June 1, and give the committers a chance to deal with some of their own stuff before starting the CF. This is the wrong thread to be debating that in, though. Also I wonder if this is really RMT turf? regards, tom lane
Re: Add "-Wimplicit-fallthrough" to default flags (was Re: pgsql: Support FETCH FIRST WITH TIES)
On Mon, May 11, 2020 at 4:40 PM Tom Lane wrote: > > Julien Rouhaud writes: > > On Mon, May 11, 2020 at 3:41 PM Tom Lane wrote: > >> Why? It uses "fallthrough" which is a legal spelling per level 4. > > > GCC documentation mentions [ \t]*FALLTHR(OUGH|U)[ \t]* for level 4 > > (out of the view other alternatives), which AFAICT is case sensitive > > (level 3 has fall(s | |-)?thr(ough|u)[ \t.!]*(-[^\n\r]*)?). > > Oh, I'd missed that that was case sensitive. Ugh --- that seems > unreasonable. Maybe we'd better settle for level 3 after all; > I don't think there's much room to doubt the intentions of a > comment spelled that way. Agreed.
Re: COPY, lock release and MVCC
On Fri, May 8, 2020 at 4:58 AM Laurenz Albe wrote: > I happened to notice that COPY TO releases the ACCESS SHARE lock > on the table right when the command ends rather than holding it > until the end of the transaction: That seems inconsistent with what an INSERT statement would do, and thus bad. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: No core file generated after PostgresNode->start
On Sun, May 10, 2020 at 11:21 PM Andy Fan wrote: > Looks this doesn't mean a crash. If the test > case(subscription/t/013_partition.pl) > failed, test framework kill some process, which leads the above message. So > you can > ignore this issue now. Thanks I think there might be a real issue here someplace, though, because I couldn't get a core dump last week when I did have a crash happening locally. I didn't poke into it very hard though so I never figured out exactly why not, but ulimit -c unlimited didn't help. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: gcov coverage data not full with immediate stop
On Mon, May 11, 2020 at 12:56 AM Tom Lane wrote: > > I think we should definitely get this fixed for pg13 ... > > -1 for shoving in such a thing so late in the cycle. We've survived > without it for years, we can do so for a few months more. I agree, but also, we should start thinking about when to branch. I, too, have patches that aren't critical enough to justify pushing them post-freeze, but which are still good improvements that I'd like to get into the tree. I'm queueing them right now to avoid the risk of destabilizing things, but that generates more work, for me and for other people, if their patches force me to rebase or the other way around. I know there's always a concern with removing the focus on release N too soon, but the open issues list is 3 items long right now, and 2 of those look like preexisting issues, not new problems in v13. Meanwhile, we have 20+ active committers. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: making update/delete of inheritance trees scale better
On Mon, May 11, 2020 at 2:48 PM Tom Lane wrote: > > There doesn't seem to be any execution-time > > problem with such a representation, but there might be a planning-time > > problem with building it, > > Possibly. We manage to cope with not-all-alike children now, of course, > but I think it might be true that no one plan node has Vars from > dissimilar children. Even so, the Vars are self-identifying, so it > seems like this ought to be soluble. If the parent is RTI 1, and the children are RTIs 2..6, what varno/varattno will we use in RTI 1's tlist to represent a column that exists in both RTI 2 and RTI 3 but not in RTI 1, 4, 5, or 6? I suppose the answer is 2 - or 3, but I guess we'd pick the first child as the representative of the class. We surely can't use varno 1, because then there's no varattno that makes any sense. But if we use 2, now we have the tlist for RTI 1 containing expressions with a child's RTI as the varno. I could be wrong, but I think that's going to make setrefs.c throw up and die, and I wouldn't be very surprised if there were a bunch of other things that crashed and burned, too. I think we have quite a bit of code that expects to be able to translate between parent-rel expressions and child-rel expressions, and that's going to be pretty problematic here. Maybe your answer is - let's just fix all that stuff. That could well be right, but my first reaction is to think that it sounds hard. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: refactoring basebackup.c
On Fri, May 8, 2020 at 5:27 PM Andres Freund wrote: > I wonder if there's cases where recursively forwarding like this will > cause noticable performance effects. The only operation that seems > frequent enough to potentially be noticable would be "chunks" of the > file. So perhaps it'd be good to make sure we read in large enough > chunks? Yeah, that needs to be tested. Right now the chunk size is 32kB but it might be a good idea to go larger. Another thing is that right now the chunk size is tied to the protocol message size, and I'm not sure whether the size that's optimal for disk reads is also optimal for protocol messages. > This, to me, indicates that we should measure the progress solely based > on how much of the "source" data was processed. The overhead of tar, the > reduction due to compression, shouldn't be included. I don't think it's a particularly bad thing that we include a small amount of progress for sending an empty file, a directory, or a symlink. That could make the results more meaningful if you have a database with lots of empty relations in it. However, I agree that the effect of compression shouldn't be included. To get there, I think we need to redesign the wire protocol. Right now, the server has no way of letting the client know how many uncompressed bytes it's sent, and the client has no way of figuring it out without uncompressing, which seems like something we want to avoid. There are some other problems with the current wire protocol, too: 1. The syntax for the BASE_BACKUP command is large and unwieldy. We really ought to adopt an extensible options syntax, like COPY, VACUUM, EXPLAIN, etc. do, rather than using a zillion ad-hoc bolt-ons, each with bespoke lexer and parser support. 2. The client is sent a list of tablespaces and is supposed to use that to expect an equal number of archives, computing the name for each one on the client side from the tablespace info. However, I think we should be able to support modes like "put all the tablespaces in a single archive" or "send a separate archive for every 256GB" or "ship it all to the cloud and don't send me any archives". To get there, I think we should have the server send the archive name to the clients, and the client should just keep receiving the next archive until it's told that there are no more. Then if there's one archive or ten archives or no archives, the client doesn't have to care. It just receives what the server sends until it hears that there are no more. It also doesn't know how the server is naming the archives; the server can, for example, adjust the archive names based on which compression format is being chosen, without knowledge of the server's naming conventions needing to exist on the client side. I think we should keep support for the current BASE_BACKUP command but either add a new variant using an extensible options, or else invent a whole new command with a different name (BACKUP, SEND_BACKUP, whatever) that takes extensible options. This command should send back all the archives and the backup manifest using a single COPY stream rather than multiple COPY streams. Within the COPY stream, we'll invent a sub-protocol, e.g. based on the first letter of the message, e.g.: t = Tablespace boundary. No further message payload. Indicates, for progress reporting purposes, that we are advancing to the next tablespace. f = Filename. The remainder of the message payload is the name of the next file that will be transferred. d = Data. The next four bytes contain the number of uncompressed bytes covered by this message, for progress reporting purposes. The rest of the message is payload, possibly compressed. Could be empty, if the data is being shipped elsewhere, and these messages are only being sent to update the client's notion of progress. > I've not though enough about the specifics, but I think it looks like > it's going roughly in a better direction. Good to hear. > One thing I wonder about is how stateful the interface is. Archivers > will pretty much always track which file is currently open etc. Somehow > such a repeating state machine seems a bit ugly - but I don't really > have a better answer. I thought about that a bit, too. There might be some way to unify that by having some common context object that's defined by basebackup.c and all archivers get it, so that they have some commonly-desired details without needing bespoke code, but I'm not sure at this point whether that will actually produce a nicer result. Even if we don't have it initially, it seems like it wouldn't be very hard to add it later, so I'm not too stressed about it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: making update/delete of inheritance trees scale better
Robert Haas writes: > I believe that you'd want to have happen here is for each child to > emit the row identity columns that it knows about, and emit NULL for > the others. Then when you do the Append you end up with a row format > that includes all the individual identity columns, but for any > particular tuple, only one set of such columns is populated and the > others are all NULL. Yeah, that was what I'd imagined in my earlier thinking about this. > There doesn't seem to be any execution-time > problem with such a representation, but there might be a planning-time > problem with building it, Possibly. We manage to cope with not-all-alike children now, of course, but I think it might be true that no one plan node has Vars from dissimilar children. Even so, the Vars are self-identifying, so it seems like this ought to be soluble. regards, tom lane
Re: making update/delete of inheritance trees scale better
On Mon, May 11, 2020 at 8:58 AM Ashutosh Bapat wrote: > What happens if there's a mixture of foreign and local partitions or > mixture of FDWs? Injecting junk columns from all FDWs in the top level > target list will cause error because those attributes won't be > available everywhere. I think that we're talking about a plan like this: Update -> Append -> a bunch of children I believe that you'd want to have happen here is for each child to emit the row identity columns that it knows about, and emit NULL for the others. Then when you do the Append you end up with a row format that includes all the individual identity columns, but for any particular tuple, only one set of such columns is populated and the others are all NULL. There doesn't seem to be any execution-time problem with such a representation, but there might be a planning-time problem with building it, because when you're writing a tlist for the Append node, what varattno are you going to use for the columns that exist only in one particular child and not the others? The fact that setrefs processing happens so late seems like an annoyance in this case. Maybe it would be easier to have one Update note per kind of row identity, i.e. if there's more than one such notion then... Placeholder -> Update -> Append -> all children with one notion of row identity -> Update -> Append -> all children with another notion of row identity ...and so forth. But I'm not sure. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Parallel copy
I wonder why you're still looking at this instead of looking at just speeding up the current code, especially the line splitting, per previous discussion. And then coming back to study this issue more after that's done. On Mon, May 11, 2020 at 8:12 AM Amit Kapila wrote: > Apart from this, we have analyzed the other cases as mentioned below > where we need to decide whether we can allow parallelism for the copy > command. > Case-1: > Do we want to enable parallelism for a copy when transition tables are > involved? I think it would be OK not to support this. > Case-2: > a. When there are BEFORE/INSTEAD OF triggers on the table. > b. For partitioned tables, we can't support multi-inserts when there > are any statement-level insert triggers. > c. For inserts into foreign tables. > d. If there are volatile default expressions or the where clause > contains a volatile expression. Here, we can check if the expression > is parallel-safe, then we can allow parallelism. This all sounds fine. > Case-3: > In copy command, for performing foreign key checks, we take KEY SHARE > lock on primary key table rows which inturn will increment the command > counter and updates the snapshot. Now, as we share the snapshots at > the beginning of the command, we can't allow it to be changed later. > So, unless we do something special for it, I think we can't allow > parallelism in such cases. This sounds like much more of a problem to me; it'd be a significant restriction that would kick in routine cases where the user isn't doing anything particularly exciting. The command counter presumably only needs to be updated once per command, so maybe we could do that before we start parallelism. However, I think we would need to have some kind of dynamic memory structure to which new combo CIDs can be added by any member of the group, and then discovered by other members of the group later. At the end of the parallel operation, the leader must discover any combo CIDs added by others to that table before destroying it, even if it has no immediate use for the information. We can't allow a situation where the group members have inconsistent notions of which combo CIDs exist or what their mappings are, and if KEY SHARE locks are being taken, new combo CIDs could be created. > Case-4: > For Deferred Triggers, it seems we record CTIDs of tuples (via > ExecARInsertTriggers->AfterTriggerSaveEvent) and then execute deferred > triggers at transaction end using AfterTriggerFireDeferred or at end > of the statement. I think this could be left for the future. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: 2020-05-14 Press Release Draft
On Sun, May 10, 2020 at 10:08:46PM -0400, Jonathan S. Katz wrote: > * Ensure that a detatched partition has triggers that come from its former > parent removed. I would have said: "fix for issue which prevented/precluded detaching partitions which have inherited ROW triggers" > * Several fixes for `REINDEX CONCURRENTLY`, particular with dealing with issue > when a `REINDEX CONCURRENTLY` operation fails. ".. in particular relating to an issue ..." > * Avoid scanning irrelevant timelines during archive recovery, which can > eliminate attempts to fetch nonexistent WAL files from archive storage. I feel like this is phrased backwards. The goal is to avoid (attempting to) fetch nonextant WALs, and the mechanism is by skipping timelines. Maybe: * Avoid attempting to fetch nonexistent WAL files from archive storage during * recovery by skipping irrelevant timelines. -- Justin
Re: PG 13 release notes, first draft
On Mon, May 11, 2020 at 4:20 PM Bruce Momjian wrote: > Sorry for the delay in replying. Yes, I agree we should list all of > those operator class cases where we now take arguments. I looked at the > patch but got confused and missed the doc changes that clearly need to > be in the release notes. I see these operator classes now taking > parameters, as you helpfully listed in your commit message: > > tsvector_ops > gist_ltree_ops > gist__ltree_ops > gist_trgm_ops > gist_hstore_ops > gist__int_ops > gist__intbig_ops > > I assume the double-underscore is because the first underscore is to > separate words, and the second one is for the array designation, right? Yes, this is true. > So my big question is whether people will understand when they are using > these operator classes, since many of them are defaults. Can you use an > operator class parameter when you are just using the default operator > class and not specifying its name? Actually no. Initial version of patch allowed to explicitly specify DEFAULT keyword instead of opclass name. But I didn't like idea to allow keyword instead of name there. I've tried to implement syntax allowing specifying parameters without both new keyword and opclass name, but that causes a lot of grammar problems. Finally, I've decided to provide parameters functionality only when specifying opclass name. My motivation is that opclass parameters is functionality for advanced users, who are deeply concerned into what opclass do. For this category of users it's natural to know the opclass name. > What I thinking that just saying > the operator class take arguments might not be helpful. I think I see > enough detail in the documentation to write release note items for > these, but I will have to point out they need to specify the operator > class, even if it is the default, right? My point was that we should specify where to look to find new functionality. We can don't write opclass names, because those names might be confusing for users who are not aware of them. We may briefly say that new parameters are introduced for GiST for tsvector, contrib/intarray, contrib/pg_trgm, contrib/ltree, contrib/hstore. What do you think? -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: 2pc leaks fds
Hi Antonin, thanks for the review. On 2020-May-11, Antonin Houska wrote: > While looking at the changes, I've noticed a small comment issue in the > XLogReaderRoutine structure definition: > > * "tli_p" is an input/output argument. XLogRead() uses it to pass the > > The XLogRead() function has been renamed to WALRead() in 0dc8ead4. (This > incorrect comment was actually introduced by that commit.) Ah. I'll fix this, thanks for pointing it out. (It might be that the TLI situation can be improved with some callback, too.) -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: making update/delete of inheritance trees scale better
Hi Ashutosh, Thanks for chiming in. On Mon, May 11, 2020 at 9:58 PM Ashutosh Bapat wrote: > On Fri, May 8, 2020 at 7:03 PM Amit Langote wrote: > > I do think that doing this would be worthwhile even if we may be > > increasing ModifyTable's per-row overhead slightly, because planning > > overhead of the current approach is very significant, especially for > > partition trees with beyond a couple of thousand partitions. As to > > how bad the problem is, trying to create a generic plan for `update > > foo set ... where key = $1`, where foo has over 2000 partitions, > > causes OOM even on a machine with 6GB of memory. > > Per row overhead would be incurred for every row whereas the plan time > overhead is one-time or in case of a prepared statement almost free. > So we need to compare it esp. when there are 2000 partitions and all > of them are being updated. I assume that such UPDATEs would be uncommon. > But generally I agree that this would be a > better approach. It might help using PWJ when the result relation > joins with other partitioned table. It does, because the plan below ModifyTable is same as if the query were SELECT instead of UPDATE; with my PoC: explain (costs off) update foo set a = foo2.a + 1 from foo foo2 where foo.a = foo2.a; QUERY PLAN -- Update on foo Update on foo_1 Update on foo_2 -> Append -> Merge Join Merge Cond: (foo_1.a = foo2_1.a) -> Sort Sort Key: foo_1.a -> Seq Scan on foo_1 -> Sort Sort Key: foo2_1.a -> Seq Scan on foo_1 foo2_1 -> Merge Join Merge Cond: (foo_2.a = foo2_2.a) -> Sort Sort Key: foo_2.a -> Seq Scan on foo_2 -> Sort Sort Key: foo2_2.a -> Seq Scan on foo_2 foo2_2 (20 rows) as opposed to what you get today: explain (costs off) update foo set a = foo2.a + 1 from foo foo2 where foo.a = foo2.a; QUERY PLAN -- Update on foo Update on foo_1 Update on foo_2 -> Merge Join Merge Cond: (foo_1.a = foo2.a) -> Sort Sort Key: foo_1.a -> Seq Scan on foo_1 -> Sort Sort Key: foo2.a -> Append -> Seq Scan on foo_1 foo2 -> Seq Scan on foo_2 foo2_1 -> Merge Join Merge Cond: (foo_2.a = foo2.a) -> Sort Sort Key: foo_2.a -> Seq Scan on foo_2 -> Sort Sort Key: foo2.a -> Append -> Seq Scan on foo_1 foo2 -> Seq Scan on foo_2 foo2_1 (23 rows) > > For child result relations that are foreign tables, their FDW adds > > junk attribute(s) to the query’s targetlist by updating it in-place > > (AddForeignUpdateTargets). However, as the child tables will no > > longer get their own parsetree, we must use some hack around this > > interface to obtain the foreign table specific junk attributes and add > > them to the original/parent query’s targetlist. Assuming that all or > > most of the children will belong to the same FDW, we will end up with > > only a handful such junk columns in the final targetlist. I am not > > sure if it's worthwhile to change the API of AddForeignUpdateTargets > > to require FDWs to not scribble on the passed-in parsetree as part of > > this patch. > > What happens if there's a mixture of foreign and local partitions or > mixture of FDWs? Injecting junk columns from all FDWs in the top level > target list will cause error because those attributes won't be > available everywhere. That is a good question and something I struggled with ever since I started started thinking about implementing this. For the problem that FDWs may inject junk columns that could neither be present in local tables (root parent and other local children) nor other FDWs, I couldn't think of any solution other than to restrict what those junk columns can be -- to require them to be either "ctid", "wholerow", or a set of only *inherited* user columns. I think that's what Tom was getting at when he said the following in the email I cited in my first email: "...It gets a bit harder if the tree contains some foreign tables, because they might have different concepts of row identity, but I'd think in most cases you could still combine those into a small number of output columns." Maybe I misunderstood what Tom said, but I can't imagine how to let these junk columns be anything that *all* tables contained in an inheritance tree, especially the root parent, cannot emit, if they are to be emitted out of a single plan. > > As for how ModifyTable will create the new tuple for updates, I
Re: Add "-Wimplicit-fallthrough" to default flags (was Re: pgsql: Support FETCH FIRST WITH TIES)
Julien Rouhaud writes: > On Mon, May 11, 2020 at 3:41 PM Tom Lane wrote: >> Why? It uses "fallthrough" which is a legal spelling per level 4. > GCC documentation mentions [ \t]*FALLTHR(OUGH|U)[ \t]* for level 4 > (out of the view other alternatives), which AFAICT is case sensitive > (level 3 has fall(s | |-)?thr(ough|u)[ \t.!]*(-[^\n\r]*)?). Oh, I'd missed that that was case sensitive. Ugh --- that seems unreasonable. Maybe we'd better settle for level 3 after all; I don't think there's much room to doubt the intentions of a comment spelled that way. regards, tom lane
Re: Add "-Wimplicit-fallthrough" to default flags (was Re: pgsql: Support FETCH FIRST WITH TIES)
On Mon, May 11, 2020 at 3:41 PM Tom Lane wrote: > > Julien Rouhaud writes: > > Note that timezone/zic.c would also have to be changed. > > Why? It uses "fallthrough" which is a legal spelling per level 4. GCC documentation mentions [ \t]*FALLTHR(OUGH|U)[ \t]* for level 4 (out of the view other alternatives), which AFAICT is case sensitive (level 3 has fall(s | |-)?thr(ough|u)[ \t.!]*(-[^\n\r]*)?).
Re: Add "-Wimplicit-fallthrough" to default flags (was Re: pgsql: Support FETCH FIRST WITH TIES)
Julien Rouhaud writes: > Note that timezone/zic.c would also have to be changed. Why? It uses "fallthrough" which is a legal spelling per level 4. regards, tom lane
Re: PG 13 release notes, first draft
On Thu, May 7, 2020 at 08:12:30PM +0300, Alexander Korotkov wrote: > On Thu, May 7, 2020 at 2:46 AM Bruce Momjian wrote: > > > > Allow CREATE INDEX to specify the GiST signature length and maximum > > number of integer ranges (Nikita Glukhov) > > > > Should we specify which particular opclasses are affected? Or at > least mention it affects core and particular contribs? Sorry for the delay in replying. Yes, I agree we should list all of those operator class cases where we now take arguments. I looked at the patch but got confused and missed the doc changes that clearly need to be in the release notes. I see these operator classes now taking parameters, as you helpfully listed in your commit message: tsvector_ops gist_ltree_ops gist__ltree_ops gist_trgm_ops gist_hstore_ops gist__int_ops gist__intbig_ops I assume the double-underscore is because the first underscore is to separate words, and the second one is for the array designation, right? So my big question is whether people will understand when they are using these operator classes, since many of them are defaults. Can you use an operator class parameter when you are just using the default operator class and not specifying its name? What I thinking that just saying the operator class take arguments might not be helpful. I think I see enough detail in the documentation to write release note items for these, but I will have to point out they need to specify the operator class, even if it is the default, right? -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: making update/delete of inheritance trees scale better
On Fri, May 8, 2020 at 7:03 PM Amit Langote wrote: > > Here is a sketch for implementing the design that Tom described here: > https://www.postgresql.org/message-id/flat/357.1550612935%40sss.pgh.pa.us > > In short, we would like to have only one plan for ModifyTable to get > tuples out of to update/delete, not N for N child result relations as > is done currently. > > I suppose things are the way they are because creating a separate plan > for each result relation makes the job of ModifyTable node very > simple, which is currently this: > > 1. Take the plan's output tuple, extract the tupleid of the tuple to > update/delete in the currently active result relation, > 2. If delete, go to 3, else if update, filter out the junk columns > from the above tuple > 3. Call ExecUpdate()/ExecDelete() on the result relation with the new > tuple, if any > > If we make ModifyTable do a little more work for the inheritance case, > we can create only one plan but without "expanding" the targetlist. > That is, it will contain entries only for attributes that are assigned > values in the SET clause. This makes the plan reusable across result > relations, because all child relations must have those attributes, > even though the attribute numbers might be different. Anyway, the > work that ModifyTable will now have to do is this: > > 1. Take the plan's output tuple, extract tupleid of the tuple to > update/delete and "tableoid" > 2. Select the result relation to operate on using the tableoid > 3. If delete, go to 4, else if update, fetch the tuple identified by > tupleid from the result relation and fill in the unassigned columns > using that "old" tuple, also filtering out the junk columns > 4. Call ExecUpdate()/ExecDelete() on the result relation with the new > tuple, if any > > I do think that doing this would be worthwhile even if we may be > increasing ModifyTable's per-row overhead slightly, because planning > overhead of the current approach is very significant, especially for > partition trees with beyond a couple of thousand partitions. As to > how bad the problem is, trying to create a generic plan for `update > foo set ... where key = $1`, where foo has over 2000 partitions, > causes OOM even on a machine with 6GB of memory. Per row overhead would be incurred for every row whereas the plan time overhead is one-time or in case of a prepared statement almost free. So we need to compare it esp. when there are 2000 partitions and all of them are being updated. But generally I agree that this would be a better approach. It might help using PWJ when the result relation joins with other partitioned table. I am not sure whether that effectively happens today by partition pruning. More on this later. > > The one plan shared by all result relations will be same as the one we > would get if the query were SELECT, except it will contain junk > attributes such as ctid needed to identify tuples and a new "tableoid" > junk attribute if multiple result relations will be present due to > inheritance. One major way in which this targetlist differs from the > current per-result-relation plans is that it won't be passed through > expand_targetlist(), because the set of unassigned attributes may not > be unique among children. As mentioned above, those missing > attributes will be filled by ModifyTable doing some extra work, > whereas previously they would have come with the plan's output tuple. > > For child result relations that are foreign tables, their FDW adds > junk attribute(s) to the query’s targetlist by updating it in-place > (AddForeignUpdateTargets). However, as the child tables will no > longer get their own parsetree, we must use some hack around this > interface to obtain the foreign table specific junk attributes and add > them to the original/parent query’s targetlist. Assuming that all or > most of the children will belong to the same FDW, we will end up with > only a handful such junk columns in the final targetlist. I am not > sure if it's worthwhile to change the API of AddForeignUpdateTargets > to require FDWs to not scribble on the passed-in parsetree as part of > this patch. What happens if there's a mixture of foreign and local partitions or mixture of FDWs? Injecting junk columns from all FDWs in the top level target list will cause error because those attributes won't be available everywhere. > > As for how ModifyTable will create the new tuple for updates, I have > decided to use a ProjectionInfo for each result relation, which > projects a full, *clean* tuple ready to be put back into the relation. > When projecting, plan’s output tuple serves as OUTER tuple and the old > tuple fetched to fill unassigned attributes serves as SCAN tuple. By > having this ProjectionInfo also serve as the “junk filter”, we don't > need JunkFilters. The targetlist that this projection computes is > same as that of the result-relation-specific plan. Initially, I > thought to generate this "expanded" targetlist
Re: Concurrency bug in amcheck
On Tue, Apr 28, 2020 at 4:05 AM Peter Geoghegan wrote: > On Mon, Apr 27, 2020 at 4:17 AM Alexander Korotkov > wrote: > > > Assuming it doesn't seem we actually need any items on deleted pages, > > > I can propose to delete them on primary as well. That would make > > > contents of primary and standby more consistent. What do you think? > > > > So, my proposal is following. Backpatch my fix upthread to 11. In > > master additionally make _bt_unlink_halfdead_page() remove page items > > on primary as well. Do you have any objections? > > What I meant was that we might as well review the behavior of > _bt_unlink_halfdead_page() here, since we have to change it anyway. > But I think you're right: No matter what happens or doesn't happen to > _bt_unlink_halfdead_page(), the fact is that deleted pages may or may > not have a single remaining item (which was originally the "top > parent" item from the page at offset number P_HIKEY), now and forever. > We have to conservatively assume that it could be either state, now > and forever. That means that we definitely have to give up on the > check, per your patch. So, +1 for backpatching that back to 11. Thank you. I've worked a bit on comments and commit message. I would appreciate you review. > (BTW, I don't think that this is a concurrency issue, except in the > sense that a test case that recreates the false positive is sensitive > to concurrency -- I imagine you agree with this.) Yes, I agree it's related to concurrency, but not exactly concurrency issue. > I like your idea of making the primary consistent with the REDO > routine on the master branch only. I wonder if that will make it > possible to change btree_mask() so that wal_consistency_checking can > check deleted pages as well. The contents of a deleted page's special > area do matter, and yet we don't currently verify that it matches (we > use mask_page_content() within btree_mask() for deleted pages, which > seems inappropriately broad). In particular, the left and right > sibling links should be consistent with the primary on a deleted page. Thank you. 2nd patch is proposed for master and makes btree page unlink remove all the items from the page being deleted. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company 0001-Fix-amcheck-for-page-checks-concurrent-to-replay-of-.patch Description: Binary data 0002-Remove-btree-page-items-after-page-unlink.patch Description: Binary data
Re: Add "-Wimplicit-fallthrough" to default flags (was Re: pgsql: Support FETCH FIRST WITH TIES)
On Mon, May 11, 2020 at 2:07 PM Andrew Dunstan wrote: > > > On 5/11/20 7:29 AM, Julien Rouhaud wrote: > > On Mon, May 11, 2020 at 6:47 AM Tom Lane wrote: > >> Alvaro Herrera writes: > >>> On 2020-May-06, Alvaro Herrera wrote: > This doesn't allow whitespace between "fall" and "through", which means > we generate 217 such warnings currently. Or we can just use > -Wimplicit-fallthrough=3, which does allow whitespace (among other > detritus). > >>> If we're OK with patching all those places, I volunteer to do so. Any > >>> objections? Or I can keep it at level 3, which can be done with minimal > >>> patching. > >> If we're gonna do it at all, I think we should go for the level 4 > >> warnings. Level 3's idea of a fallthrough comment is too liberal > >> for my tastes... > > Here's a patch that also takes care of cleaning all warning due to the > > level 4 setting (at least the one I got with my other config options). > > I went with "FALLTHRU" as this seemed more used. > > > > Note that timezone/zic.c would also have to be changed. > > > > Since this is external code maybe we should leave that at level 3? I > think that should be doable via a Makefile override. Yes that was my concern. The best way I found to avoid changing zic.c is something like that in src/timezone/Makefile, before the zic target: ifneq (,$(filter -Wimplicit-fallthrough=4,$(CFLAGS))) CFLAGS := $(filter-out -Wimplicit-fallthrough=4,$(CFLAGS)) CFLAGS += '-Wimplicit-fallthrough=3' endif
Re: gcov coverage data not full with immediate stop
On Mon, May 11, 2020 at 2:30 PM Alexander Lakhin wrote: > > But I can confirm that calling __gcov_flush() in SignalHandlerForCrashExit() > really improves a code coverage report. > > Finally I've managed to get an expected coverage when I performed > $node_standby->stop() (but __gcov_flush() in SignalHandlerForCrashExit() > helps too). What happens if a coverage tool other than gcov is used? From that perspective, it's better to perform a clean shutdown in the TAP tests instead of immediate if that's possible. -- Best Wishes, Ashutosh Bapat
Re: Parallel copy
On Wed, Apr 15, 2020 at 11:49 PM Andres Freund wrote: > > To be clear, I was only thinking of using a ringbuffer to indicate split > boundaries. And that workers would just pop entries from it before they > actually process the data (stored outside of the ringbuffer). Since the > split boundaries will always be read in order by workers, and the > entries will be tiny, there's no need to avoid copying out entries. > I think the binary mode processing will be slightly different because unlike text and csv format, the data is stored in Length, Value format for each column and there are no line markers. I don't think there will be a big difference but still, we need to somewhere keep the information what is the format of data in ring buffers. Basically, we can copy the data in Length, Value format and once the writers know about the format, they will parse the data in the appropriate format. We currently also have a different way of parsing the binary format, see NextCopyFrom. I think we need to be careful about avoiding duplicate work as much as possible. Apart from this, we have analyzed the other cases as mentioned below where we need to decide whether we can allow parallelism for the copy command. Case-1: Do we want to enable parallelism for a copy when transition tables are involved? Basically, during the copy, we do capture tuples in transition tables for certain cases like when after statement trigger accesses the same relation on which we have a trigger. See the example below [1]. We decide this in function MakeTransitionCaptureState. For such cases, we collect minimal tuples in the tuple store after processing them so that later after statement triggers can access them. Now, if we want to enable parallelism for such cases, we instead need to store and access tuples from shared tuple store (sharedtuplestore.c/sharedtuplestore.h). However, it doesn't have the facility to store tuples in-memory, so we always need to store and access from a file which could be costly unless we also have an additional way to store minimal tuples in shared memory till work_memory and then in shared tuple store. It is possible to do all this or part of this work to enable parallel copy for such cases but I am not sure if it is worth it. We can decide to not enable parallelism for such cases and later allow if we see demand for the same and it will also help us to not introduce additional work/complexity in the first version of the patch. Case-2: The Single Insertion mode (CIM_SINGLE) is performed in various scenarios and whether we can allow parallelism for those depends on case to case basis which is discussed below: a. When there are BEFORE/INSTEAD OF triggers on the table. We don't allow multi-inserts in such cases because such triggers might query the table we're inserting into and act differently if the tuples that have already been processed and prepared for insertion are not there. Now, if we allow parallelism with such triggers the behavior would depend on if the parallel worker has already inserted or not that particular row. I guess such functions should ideally be marked as parallel-unsafe. So, in short in this case whether to allow parallelism or not depends upon the parallel-safety marking of this function. b. For partitioned tables, we can't support multi-inserts when there are any statement-level insert triggers. This is because as of now, we expect that any before row insert and statement-level insert triggers are on the same relation. Now, there is no harm in allowing parallelism for such cases but it depends upon if we have the infrastructure (basically allow tuples to be collected in shared tuple store) to support statement-level insert triggers. c. For inserts into foreign tables. We can't allow the parallelism in this case because each worker needs to establish the FDW connection and operate in a separate transaction. Now unless we have a capability to provide a two-phase commit protocol for "Transactions involving multiple postgres foreign servers" (which is being discussed in a separate thread [2]), we can't allow this. d. If there are volatile default expressions or the where clause contains a volatile expression. Here, we can check if the expression is parallel-safe, then we can allow parallelism. Case-3: In copy command, for performing foreign key checks, we take KEY SHARE lock on primary key table rows which inturn will increment the command counter and updates the snapshot. Now, as we share the snapshots at the beginning of the command, we can't allow it to be changed later. So, unless we do something special for it, I think we can't allow parallelism in such cases. I couldn't think of many problems if we allow parallelism in such cases. One inconsistency, if we allow FK checks via workers, would be that at the end of COPY the value of command_counter will not be what we expect as we wouldn't have accounted for that from workers. Now, if COPY is being done in a
Re: Add "-Wimplicit-fallthrough" to default flags (was Re: pgsql: Support FETCH FIRST WITH TIES)
On 5/11/20 7:29 AM, Julien Rouhaud wrote: > On Mon, May 11, 2020 at 6:47 AM Tom Lane wrote: >> Alvaro Herrera writes: >>> On 2020-May-06, Alvaro Herrera wrote: This doesn't allow whitespace between "fall" and "through", which means we generate 217 such warnings currently. Or we can just use -Wimplicit-fallthrough=3, which does allow whitespace (among other detritus). >>> If we're OK with patching all those places, I volunteer to do so. Any >>> objections? Or I can keep it at level 3, which can be done with minimal >>> patching. >> If we're gonna do it at all, I think we should go for the level 4 >> warnings. Level 3's idea of a fallthrough comment is too liberal >> for my tastes... > Here's a patch that also takes care of cleaning all warning due to the > level 4 setting (at least the one I got with my other config options). > I went with "FALLTHRU" as this seemed more used. > > Note that timezone/zic.c would also have to be changed. Since this is external code maybe we should leave that at level 3? I think that should be doable via a Makefile override. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Add "-Wimplicit-fallthrough" to default flags (was Re: pgsql: Support FETCH FIRST WITH TIES)
On Mon, May 11, 2020 at 6:47 AM Tom Lane wrote: > > Alvaro Herrera writes: > > On 2020-May-06, Alvaro Herrera wrote: > >> This doesn't allow whitespace between "fall" and "through", which means > >> we generate 217 such warnings currently. Or we can just use > >> -Wimplicit-fallthrough=3, which does allow whitespace (among other > >> detritus). > > > If we're OK with patching all those places, I volunteer to do so. Any > > objections? Or I can keep it at level 3, which can be done with minimal > > patching. > > If we're gonna do it at all, I think we should go for the level 4 > warnings. Level 3's idea of a fallthrough comment is too liberal > for my tastes... Here's a patch that also takes care of cleaning all warning due to the level 4 setting (at least the one I got with my other config options). I went with "FALLTHRU" as this seemed more used. Note that timezone/zic.c would also have to be changed. diff --git a/configure b/configure index 83abe872aa..4dfe4fcfe1 100755 --- a/configure +++ b/configure @@ -5552,6 +5552,97 @@ if test x"$pgac_cv_prog_CXX_cxxflags__Wmissing_format_attribute" = x"yes"; then fi + +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether ${CC} supports -Wimplicit-fallthrough=4, for CFLAGS" >&5 +$as_echo_n "checking whether ${CC} supports -Wimplicit-fallthrough=4, for CFLAGS... " >&6; } +if ${pgac_cv_prog_CC_cflags__Wimplicit_fallthrough_4+:} false; then : + $as_echo_n "(cached) " >&6 +else + pgac_save_CFLAGS=$CFLAGS +pgac_save_CC=$CC +CC=${CC} +CFLAGS="${CFLAGS} -Wimplicit-fallthrough=4" +ac_save_c_werror_flag=$ac_c_werror_flag +ac_c_werror_flag=yes +cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ + +int +main () +{ + + ; + return 0; +} +_ACEOF +if ac_fn_c_try_compile "$LINENO"; then : + pgac_cv_prog_CC_cflags__Wimplicit_fallthrough_4=yes +else + pgac_cv_prog_CC_cflags__Wimplicit_fallthrough_4=no +fi +rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext +ac_c_werror_flag=$ac_save_c_werror_flag +CFLAGS="$pgac_save_CFLAGS" +CC="$pgac_save_CC" +fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_prog_CC_cflags__Wimplicit_fallthrough_4" >&5 +$as_echo "$pgac_cv_prog_CC_cflags__Wimplicit_fallthrough_4" >&6; } +if test x"$pgac_cv_prog_CC_cflags__Wimplicit_fallthrough_4" = x"yes"; then + CFLAGS="${CFLAGS} -Wimplicit-fallthrough=4" +fi + + + { $as_echo "$as_me:${as_lineno-$LINENO}: checking whether ${CXX} supports -Wimplicit-fallthrough=4, for CXXFLAGS" >&5 +$as_echo_n "checking whether ${CXX} supports -Wimplicit-fallthrough=4, for CXXFLAGS... " >&6; } +if ${pgac_cv_prog_CXX_cxxflags__Wimplicit_fallthrough_4+:} false; then : + $as_echo_n "(cached) " >&6 +else + pgac_save_CXXFLAGS=$CXXFLAGS +pgac_save_CXX=$CXX +CXX=${CXX} +CXXFLAGS="${CXXFLAGS} -Wimplicit-fallthrough=4" +ac_save_cxx_werror_flag=$ac_cxx_werror_flag +ac_cxx_werror_flag=yes +ac_ext=cpp +ac_cpp='$CXXCPP $CPPFLAGS' +ac_compile='$CXX -c $CXXFLAGS $CPPFLAGS conftest.$ac_ext >&5' +ac_link='$CXX -o conftest$ac_exeext $CXXFLAGS $CPPFLAGS $LDFLAGS conftest.$ac_ext $LIBS >&5' +ac_compiler_gnu=$ac_cv_cxx_compiler_gnu + +cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ + +int +main () +{ + + ; + return 0; +} +_ACEOF +if ac_fn_cxx_try_compile "$LINENO"; then : + pgac_cv_prog_CXX_cxxflags__Wimplicit_fallthrough_4=yes +else + pgac_cv_prog_CXX_cxxflags__Wimplicit_fallthrough_4=no +fi +rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext +ac_ext=c +ac_cpp='$CPP $CPPFLAGS' +ac_compile='$CC -c $CFLAGS $CPPFLAGS conftest.$ac_ext >&5' +ac_link='$CC -o conftest$ac_exeext $CFLAGS $CPPFLAGS $LDFLAGS conftest.$ac_ext $LIBS >&5' +ac_compiler_gnu=$ac_cv_c_compiler_gnu + +ac_cxx_werror_flag=$ac_save_cxx_werror_flag +CXXFLAGS="$pgac_save_CXXFLAGS" +CXX="$pgac_save_CXX" +fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_prog_CXX_cxxflags__Wimplicit_fallthrough_4" >&5 +$as_echo "$pgac_cv_prog_CXX_cxxflags__Wimplicit_fallthrough_4" >&6; } +if test x"$pgac_cv_prog_CXX_cxxflags__Wimplicit_fallthrough_4" = x"yes"; then + CXXFLAGS="${CXXFLAGS} -Wimplicit-fallthrough=4" +fi + + # This was included in -Wall/-Wformat in older GCC versions { $as_echo "$as_me:${as_lineno-$LINENO}: checking whether ${CC} supports -Wformat-security, for CFLAGS" >&5 diff --git a/configure.in b/configure.in index ecdf172396..94fba67261 100644 --- a/configure.in +++ b/configure.in @@ -496,6 +496,8 @@ if test "$GCC" = yes -a "$ICC" = no; then PGAC_PROG_CXX_CFLAGS_OPT([-Wendif-labels]) PGAC_PROG_CC_CFLAGS_OPT([-Wmissing-format-attribute]) PGAC_PROG_CXX_CFLAGS_OPT([-Wmissing-format-attribute]) + PGAC_PROG_CC_CFLAGS_OPT([-Wimplicit-fallthrough=4]) + PGAC_PROG_CXX_CFLAGS_OPT([-Wimplicit-fallthrough=4]) # This was included in -Wall/-Wformat in older GCC versions PGAC_PROG_CC_CFLAGS_OPT([-Wformat-security]) PGAC_PROG_CXX_CFLAGS_OPT([-Wformat-security]) diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c index
Re: Index Skip Scan
> On Mon, May 11, 2020 at 04:04:00PM +0530, Dilip Kumar wrote: > > > > +static inline bool > > > +_bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key, > > > + Buffer buf, ScanDirection dir) > > > +{ > > > + OffsetNumber low, high; > > > + Page page = BufferGetPage(buf); > > > + BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page); > > > + > > > + low = P_FIRSTDATAKEY(opaque); > > > + high = PageGetMaxOffsetNumber(page); > > > + > > > + if (unlikely(high < low)) > > > + return false; > > > + > > > + return (_bt_compare(scan->indexRelation, key, page, low) > 0 && > > > + _bt_compare(scan->indexRelation, key, page, high) < 1); > > > +} > > > > > > I think the high key condition should be changed to > > > _bt_compare(scan->indexRelation, key, page, high) < 0 ? Because if > > > prefix qual is equal to the high key then also there is no point in > > > searching on the current page so we can directly skip. > > > > From nbtree/README and comments to functions like _bt_split I've got an > > impression that the high key could be equal to the last item on the leaf > > page, so there is a point in searching. Is that incorrect? > > But IIUC, here we want to decide whether we will get the next key in > the current page or not? In general this function does what it says, it checks wether or not the provided scankey could be found within the page. All the logic about finding a proper next key to fetch is implemented on the call site, and within this function we want to test whatever was passed in. Does it answer the question?
Re: MultiXact\SLRU buffers configuration
> 8 мая 2020 г., в 21:36, Andrey M. Borodin написал(а): > > *** The problem *** > I'm investigating some cases of reduced database performance due to > MultiXactOffsetLock contention (80% MultiXactOffsetLock, 20% IO DataFileRead). > The problem manifested itself during index repack and constraint validation. > Both being effectively full table scans. > The database workload contains a lot of select for share\select for update > queries. I've tried to construct synthetic world generator and could not > achieve similar lock configuration: I see a lot of different locks in wait > events, particularly a lot more MultiXactMemberLocks. But from my experiments > with synthetic workload, contention of MultiXactOffsetLock can be reduced by > increasing NUM_MXACTOFFSET_BUFFERS=8 to bigger numbers. > > *** Question 1 *** > Is it safe to increase number of buffers of MultiXact\All SLRUs, recompile > and run database as usual? > I cannot experiment much with production. But I'm mostly sure that bigger > buffers will solve the problem. > > *** Question 2 *** > Probably, we could do GUCs for SLRU sizes? Are there any reasons not to do > them configurable? I think multis, clog, subtransactions and others will > benefit from bigger buffer. But, probably, too much of knobs can be confusing. > > *** Question 3 *** > MultiXact offset lock is always taken as exclusive lock. It turns MultiXact > Offset subsystem to single threaded. If someone have good idea how to make it > more concurrency-friendly, I'm willing to put some efforts into this. > Probably, I could just add LWlocks for each offset buffer page. Is it > something worth doing? Or are there any hidden cavers and difficulties? I've created benchmark[0] imitating MultiXact pressure on my laptop: 7 clients are concurrently running select "select * from table where primary_key = ANY ($1) for share" where $1 is array of identifiers so that each tuple in a table is locked by different set of XIDs. During this benchmark I observe contention of MultiXactControlLock in pg_stat_activity пятница, 8 мая 2020 г. 15:08:37 (every 1s) pid | wait_event | wait_event_type | state | query ---++-++ 41344 | ClientRead | Client | idle | insert into t1 select generate_series(1,100,1) 41375 | MultiXactOffsetControlLock | LWLock | active | select * from t1 where i = ANY ($1) for share 41377 | MultiXactOffsetControlLock | LWLock | active | select * from t1 where i = ANY ($1) for share 41378 || | active | select * from t1 where i = ANY ($1) for share 41379 | MultiXactOffsetControlLock | LWLock | active | select * from t1 where i = ANY ($1) for share 41381 || | active | select * from t1 where i = ANY ($1) for share 41383 | MultiXactOffsetControlLock | LWLock | active | select * from t1 where i = ANY ($1) for share 41385 | MultiXactOffsetControlLock | LWLock | active | select * from t1 where i = ANY ($1) for share (8 rows) Finally, the benchmark is measuring time to execute select for update 42 times. I've went ahead and created 3 patches: 1. Configurable SLRU buffer sizes for MultiXacOffsets and MultiXactMembers 2. Reduce locking level to shared on read of MultiXactId members 3. Configurable cache size I've found out that: 1. When MultiXact working set does not fit into buffers - benchmark results grow very high. Yet, very big buffers slow down benchmark too. For this benchmark optimal SLRU size id 32 pages for offsets and 64 pages for members (defaults are 8 and 16 respectively). 2. Lock optimisation increases performance by 5% on default SLRU sizes. Actually, benchmark does not explicitly read MultiXactId members, but when it replaces one with another - it have to read previous set. I understand that we can construct benchmark to demonstrate dominance of any algorithm and 5% of synthetic workload is not a very big number. But it just make sense to try to take shared lock for reading. 3. Manipulations with cache size do not affect benchmark anyhow. It's somewhat expected: benchmark is designed to defeat cache, either way OffsetControlLock would not be stressed. For our workload, I think we will just increase numbers of SLRU sizes. But patchset may be useful for tuning and as a performance optimisation of MultiXact. Also MultiXacts seems to be not very good fit into SLRU design. I think it would be better to use B-tree as a container. Or at least make MultiXact members extendable in-place (reserve some size when multixact is created). When we want to extend number of locks for a tuple currently we will: 1. Iterate through all SLRU buffers for offsets to read
Re: Index Skip Scan
On Sun, May 10, 2020 at 11:17 PM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > > On Sat, Apr 11, 2020 at 03:17:25PM +0530, Dilip Kumar wrote: > > > > Some more comments... > > Thanks for reviewing. Since this patch took much longer than I expected, > it's useful to have someone to look at it with a "fresh eyes". > > > + so->skipScanKey->nextkey = ScanDirectionIsForward(dir); > > + _bt_update_skip_scankeys(scan, indexRel); > > + > > ... > > + /* > > + * We haven't found scan key within the current page, so let's scan from > > + * the root. Use _bt_search and _bt_binsrch to get the buffer and offset > > + * number > > + */ > > + so->skipScanKey->nextkey = ScanDirectionIsForward(dir); > > + stack = _bt_search(scan->indexRelation, so->skipScanKey, > > +, BT_READ, scan->xs_snapshot); > > > > Why do we need to set so->skipScanKey->nextkey = > > ScanDirectionIsForward(dir); multiple times? I think it is fine to > > just set it once? > > I believe it was necessary for previous implementations, but in the > current version we can avoid this, you're right. > > > +static inline bool > > +_bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key, > > + Buffer buf, ScanDirection dir) > > +{ > > + OffsetNumber low, high; > > + Page page = BufferGetPage(buf); > > + BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page); > > + > > + low = P_FIRSTDATAKEY(opaque); > > + high = PageGetMaxOffsetNumber(page); > > + > > + if (unlikely(high < low)) > > + return false; > > + > > + return (_bt_compare(scan->indexRelation, key, page, low) > 0 && > > + _bt_compare(scan->indexRelation, key, page, high) < 1); > > +} > > > > I think the high key condition should be changed to > > _bt_compare(scan->indexRelation, key, page, high) < 0 ? Because if > > prefix qual is equal to the high key then also there is no point in > > searching on the current page so we can directly skip. > > From nbtree/README and comments to functions like _bt_split I've got an > impression that the high key could be equal to the last item on the leaf > page, so there is a point in searching. Is that incorrect? But IIUC, here we want to decide whether we will get the next key in the current page or not? Is my understanding is correct? So if our key (the last tuple key) is equal to the high key means the max key on this page is the same as what we already got in the last tuple so why would we want to go on this page? because this will not give us the new key. So ideally, we should only be looking into this page if our last tuple key is smaller than the high key. Am I missing something? > > > + /* Check if an index skip scan is possible. */ > > + can_skip = enable_indexskipscan & index->amcanskip; > > + > > + /* > > + * Skip scan is not supported when there are qual conditions, which are not > > + * covered by index. The reason for that is that those conditions are > > + * evaluated later, already after skipping was applied. > > + * > > + * TODO: This implementation is too restrictive, and doesn't allow e.g. > > + * index expressions. For that we need to examine index_clauses too. > > + */ > > + if (root->parse->jointree != NULL) > > + { > > + ListCell *lc; > > + > > + foreach(lc, (List *)root->parse->jointree->quals) > > + { > > + Node *expr, *qual = (Node *) lfirst(lc); > > + Var *var; > > > > I think we can avoid checking for expression if can_skip is already false. > > Yes, that makes sense. I'll include your suggestions into the next > rebased version I'm preparing. Ok. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: 2020-05-14 Press Release Draft
On Mon, 11 May 2020 at 14:09, Jonathan S. Katz wrote: > Attached is a draft of the press release for the 2020-05-14 cumulative > update. Please let me know your feedback by 2020-05-13 :) Hi, Thanks for drafting those up. For: * Several fixes for GENERATED columns, including an issue where it was possible to crash or corrupt data in a table when the output of the generated column was the exact copy of a physical column on the table. I think it's important to include the "or if the expression called a function which could, in certain cases, return its own input". The reason I think that's important is because there's likely no legitimate case for having the expression an exact copy of the column. David
Re: gcov coverage data not full with immediate stop
Hello Alvaro, 11.05.2020 06:42, Alvaro Herrera wrote: > (Strangely, I was just thinking about these branches of mine as I > closed my week last Friday...) > > On 2020-May-10, Alexander Lakhin wrote: > >> So if we want to make the coverage reports more precise, I see the three >> ways: >> 1. Change the stop mode in teardown_node to fast (probably only when >> configured with --enable-coverage); >> 2. Explicitly stop nodes in TAP tests (where it's important) -- seems >> too tedious and troublesome; >> 3. Explicitly call __gcov_flush in SIGQUIT handler (quickdie)? > I tried your idea 3 a long time ago and my experiments didn't show an > increase in coverage [1]. But I like this idea the best, and maybe I > did something wrong. Attached is the patch I had (on top of > fc115d0f9fc6), but I don't know if it still applies. Thanks for the reference to that discussion and your patch. As I see the issue with that patch is that quickdie() is not the only SIGQUIT handler. When a backend is interrupted with SIGQUIT, it's exiting in SignalHandlerForCrashExit(). In fact if I only add __gcov_flush() in SignalHandlerForCrashExit(), it raises test coverage for `make check -C src/test/recovery/` from 106198 lines/6319 functions to 106420 lines/6328 functions It's not yet clear to me what happens when __gcov_flush() called inside __gcov_flush(). The test coverage changes to: 108432 lines/5417 functions (number of function calls decreased) And for example in coverage/src/backend/utils/cache/catcache.c.gcov.html I see 147 8 : int2eqfast(Datum a, Datum b) ... 153 0 : int2hashfast(Datum datum) but without __gcov_flush in quickdie() we have: 147 78038 : int2eqfast(Datum a, Datum b) ... 153 255470 : int2hashfast(Datum datum) So it needs more investigation. But I can confirm that calling __gcov_flush() in SignalHandlerForCrashExit() really improves a code coverage report. I tried to develop a test to elevate a coverage for gist: https://coverage.postgresql.org/src/backend/access/gist/gistxlog.c.gcov.html (Please look at the attached test if it could be interesting.) and came to this issue with a coverage. I tried to play with GCOV_PREFIX, but without luck. Yesterday I found the more recent discussion: https://www.postgresql.org/message-id/flat/44ecae53-9861-71b7-1d43-4658acc52519%402ndquadrant.com#d02e2e61212831fbceadf290637913a0 (where probably the same problem came out). Finally I've managed to get an expected coverage when I performed $node_standby->stop() (but __gcov_flush() in SignalHandlerForCrashExit() helps too). Best regards, Alexander diff --git a/src/test/recovery/t/021_indexes.pl b/src/test/recovery/t/021_indexes.pl new file mode 100644 index 00..902133d55a --- /dev/null +++ b/src/test/recovery/t/021_indexes.pl @@ -0,0 +1,102 @@ +# Test testing indexes replication +use strict; +use warnings; +use PostgresNode; +use TestLib; +use Test::More tests => 1; + +# Initialize master node +my $node_master = get_new_node('master'); +$node_master->init(allows_streaming => 1); +$node_master->start; + +# Add a table for gist index check +$node_master->safe_psql('postgres', +qq{ +create table gist_point_tbl(id int, p point); +insert into gist_point_tbl(id, p) +select g, point(g*10, g*10) from generate_series(1, 11000) g; +create index gist_pointidx on gist_point_tbl using gist(p);}); + +# Take backup +my $backup_name = 'my_backup'; +$node_master->backup($backup_name); + +# Create streaming standby from backup +my $node_standby = get_new_node('standby'); +$node_standby->init_from_backup($node_master, $backup_name, +has_streaming => 1); +$node_standby->start; + +$node_master->safe_psql('postgres', qq{ +create temp table gist_point_tbl_t(id int, p point); +create index gist_pointidx_t on gist_point_tbl_t using gist(p); +insert into gist_point_tbl_t(id, p) +select g, point(g*10, g*10) from generate_series(1, 1000) g; +set enable_seqscan=off; +set enable_bitmapscan=off; +explain (costs off) +select p from gist_point_tbl_t where p <@ box(point(0,0), point(100, 100)); +}); + +$node_master->safe_psql('postgres', qq{ +create unlogged table gist_point_tbl_u(id int, p point); +create index gist_pointidx_u on gist_point_tbl_u using gist(p); +insert into gist_point_tbl_u(id, p) +select g, point(g*10, g*10) from generate_series(1, 1000) g; +set enable_seqscan=off; +set enable_bitmapscan=off; +explain (costs off) +select p from gist_point_tbl_u where p <@ box(point(0,0), point(100, 100)); +}); + +$node_master->safe_psql('postgres', qq{ +insert into gist_point_tbl (id, p) +select g, point(g*10, g*10) from generate_series(1, 1000) g;}); + +$node_master->safe_psql('postgres', "delete from gist_point_tbl where id < 500"); + +$node_master->safe_psql('postgres', qq{ +create table test as (select x, box(point(x, x),point(x, x)) + from generate_series(1,400) as x); + +create index test_idx on test using gist (box); + +set enable_seqscan TO false; +set
A patch for get origin from commit_ts.
Hello hackers, I am researching about 'origin' in PostgreSQL, mainly it used in logical decoding to filter transaction from non-local source. I notice that the 'origin' is stored in commit_ts so that I think we are possible to get 'origin' of a transaction from commit_ts. But I can not fond any code to get 'origin' from commit_ts, just like it is producing data which nobody cares about. Can I know what's the purpose of the 'origin' in commit_ts? Do you think we should add some support to the careless data? For example, I add a function to get 'origin' from commit_ts: === postgres=# select pg_xact_commit_origin('490'); pg_xact_commit_origin --- test_origin (1 row) postgres=# select pg_xact_commit_origin('491'); pg_xact_commit_origin --- test_origin1 (1 row) postgres=# === Regards, Highgo Software (Canada/China/Pakistan) URL : www.highgo.ca EMAIL: mailto:movead(dot)li(at)highgo(dot)ca get_origin_from_commit_ts.patch Description: Binary data
Re: 2pc leaks fds
Alvaro Herrera wrote: > On 2020-May-08, Kyotaro Horiguchi wrote: > > > I agree to the direction of this patch. Thanks for the explanation. > > The patch looks good to me except the two points below. > > Thanks! I pushed the patch. I tried to follow the discussion but haven't had better idea. This looks better than the previous version. Thanks! While looking at the changes, I've noticed a small comment issue in the XLogReaderRoutine structure definition: * "tli_p" is an input/output argument. XLogRead() uses it to pass the The XLogRead() function has been renamed to WALRead() in 0dc8ead4. (This incorrect comment was actually introduced by that commit.) -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: pg_regress cleans up tablespace twice.
At Fri, 21 Feb 2020 17:05:07 +0900 (JST), Kyotaro Horiguchi wrote in > At Thu, 20 Feb 2020 14:23:22 +0900, Michael Paquier > wrote in > > Removing this code from pg_regress.c makes also sense to me. Now, the > > patch breaks "vcregress installcheck" as this is missing to patch > > installcheck_internal() for the tablespace path creation. I would > > also recommend using a full path for the directory location to avoid > > any potential issues if this code is refactored or moved around, the > > patch now relying on the current path used. > > Hmm. Right. I confused database directory and tablespace > directory. Tablespace directory should be provided by the test script, > even though the database directory is preexisting in installcheck. To > avoid useless future failure, I would do that that for all > subcommands, as regress/GNUmakefile does. Tablespace directory cleanup is not done for all testing targets. Actually it is not done for the tools under bin/ except pg_upgrade. On the other hand, it was done every by pg_regress run for Windows build. So I made vcregress.pl do the same with that. Spcecifically to set up tablespace always before pg_regress is executed. There is a place where --outputdir is specified for pg_regress, pg_upgrade/test.sh. It is explained as the follows. # Send installcheck outputs to a private directory. This avoids conflict when # check-world runs pg_upgrade check concurrently with src/test/regress check. # To retrieve interesting files after a run, use pattern tmp_check/*/*.diffs. outputdir="$temp_root/regress" EXTRA_REGRESS_OPTS="$EXTRA_REGRESS_OPTS --outputdir=$outputdir" Where the $temp_root is $(TOP)/src/bin/pg_upgrade/tmp_check/regress. Thus the current regress/GNUMakefile does break this consideration and the current vc_regress (of Windows build) does the right thing in the light of the comment. Don't we need to avoid cleaning up "$(TOP)/src/test/regress/tablesapce" in that case? (the second patch attached) regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From fb351e0b075fbb2e6398aa0991b5a824e3c95e5a Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Thu, 30 Apr 2020 14:06:51 +0900 Subject: [PATCH 1/2] Move tablespace cleanup out of pg_regress. Tablespace directory is cleaned-up in regress/GNUmakefile for all platforms other than Windows. For Windoiws pg_regress does that on behalf of make, which is ugly. In addition to that, pg_regress does the clean up twice at once. This patch moves the cleanup code out of pg_regress into vcregress.pl, which is more sutable place. --- src/test/regress/pg_regress.c | 22 -- src/tools/msvc/vcregress.pl | 14 ++ 2 files changed, 14 insertions(+), 22 deletions(-) diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c index 38b2b1e8e1..c56bfaf7f5 100644 --- a/src/test/regress/pg_regress.c +++ b/src/test/regress/pg_regress.c @@ -491,28 +491,6 @@ convert_sourcefiles_in(const char *source_subdir, const char *dest_dir, const ch snprintf(testtablespace, MAXPGPATH, "%s/testtablespace", outputdir); -#ifdef WIN32 - - /* - * On Windows only, clean out the test tablespace dir, or create it if it - * doesn't exist. On other platforms we expect the Makefile to take care - * of that. (We don't migrate that functionality in here because it'd be - * harder to cope with platform-specific issues such as SELinux.) - * - * XXX it would be better if pg_regress.c had nothing at all to do with - * testtablespace, and this were handled by a .BAT file or similar on - * Windows. See pgsql-hackers discussion of 2008-01-18. - */ - if (directory_exists(testtablespace)) - if (!rmtree(testtablespace, true)) - { - fprintf(stderr, _("\n%s: could not remove test tablespace \"%s\"\n"), - progname, testtablespace); - exit(2); - } - make_directory(testtablespace); -#endif - /* finally loop on each file and do the replacement */ for (name = names; *name; name++) { diff --git a/src/tools/msvc/vcregress.pl b/src/tools/msvc/vcregress.pl index f95f7a5c7a..74d37a31de 100644 --- a/src/tools/msvc/vcregress.pl +++ b/src/tools/msvc/vcregress.pl @@ -104,6 +104,7 @@ exit 0; sub installcheck_internal { my ($schedule, @EXTRA_REGRESS_OPTS) = @_; + my @args = ( "../../../$Config/pg_regress/pg_regress", "--dlpath=.", @@ -114,6 +115,7 @@ sub installcheck_internal "--no-locale"); push(@args, $maxconn) if $maxconn; push(@args, @EXTRA_REGRESS_OPTS); + CleanupTablespaceDirectory(); system(@args); my $status = $? >> 8; exit $status if $status; @@ -143,6 +145,7 @@ sub check "--temp-instance=./tmp_check"); push(@args, $maxconn) if $maxconn; push(@args, $temp_config) if $temp_config; + CleanupTablespaceDirectory(); system(@args); my $status = $? >> 8; exit $status if $status; @@ -178,6 +181,7 @@ sub ecpgcheck sub isolationcheck { chdir "../isolation"; + CleanupTablespaceDirectory();
Re: Problem with logical replication
On Sun, May 10, 2020 at 07:08:03PM -0300, Euler Taveira wrote: > I attached a patch with the described solution. I also included a test that > covers this scenario. - Assert(RelationGetReplicaIndex(rel) == RelationGetRelid(idxrel)); + Assert(GetRelationIdentityOrPK(rel) == RelationGetRelid(idxrel)); Not much a fan of adding a routine to relcache.c to do the work of two routines already present, so I think that we had better add an extra condition based on RelationGetPrimaryKeyIndex, and give up on GetRelationIdentityOrPK() in execReplication.c. Wouldn't it also be better to cross-check the replica identity here depending on if RelationGetReplicaIndex() returns an invalid OID or not? -- Michael signature.asc Description: PGP signature