Re: [doc] remove reference to pg_dump pre-8.1 switch behaviour
On Mon, Nov 30, 2020 at 03:46:19PM -0500, Tom Lane wrote: > Michael Paquier writes: > > So this comes down to 5 items, as per the attached. Thoughts? > > These items look fine to me, except this bit seems a bit awkward: > > + Note that the delayed indexing technique used for GIN > + (see for details) makes this advice > + less necessary, but for very large updates it may still be best to > + drop and recreate the index. > > Less necessary than what? Maybe instead write > > When fastupdate is enabled (see ...), the penalty is much less than > when it is not. But for very large updates it may still be best to > drop and recreate the index. Thanks, that's indeed better. I used your wording, looked at that again, and applied that. -- Michael signature.asc Description: PGP signature
Re: autovac issue with large number of tables
On 2020/12/01 16:23, Masahiko Sawada wrote: On Tue, Dec 1, 2020 at 1:48 PM Kasahara Tatsuhito wrote: Hi, On Mon, Nov 30, 2020 at 8:59 PM Fujii Masao wrote: On 2020/11/30 10:43, Masahiko Sawada wrote: On Sun, Nov 29, 2020 at 10:34 PM Kasahara Tatsuhito wrote: Hi, Thanks for you comments. On Fri, Nov 27, 2020 at 9:51 PM Fujii Masao wrote: On 2020/11/27 18:38, Kasahara Tatsuhito wrote: Hi, On Fri, Nov 27, 2020 at 1:43 AM Fujii Masao wrote: On 2020/11/26 10:41, Kasahara Tatsuhito wrote: On Wed, Nov 25, 2020 at 8:46 PM Masahiko Sawada wrote: On Wed, Nov 25, 2020 at 4:18 PM Kasahara Tatsuhito wrote: Hi, On Wed, Nov 25, 2020 at 2:17 PM Masahiko Sawada wrote: On Fri, Sep 4, 2020 at 7:50 PM Kasahara Tatsuhito wrote: Hi, On Wed, Sep 2, 2020 at 2:10 AM Kasahara Tatsuhito wrote: I wonder if we could have table_recheck_autovac do two probes of the stats data. First probe the existing stats data, and if it shows the table to be already vacuumed, return immediately. If not, *then* force a stats re-read, and check a second time. Does the above mean that the second and subsequent table_recheck_autovac() will be improved to first check using the previous refreshed statistics? I think that certainly works. If that's correct, I'll try to create a patch for the PoC I still don't know how to reproduce Jim's troubles, but I was able to reproduce what was probably a very similar problem. This problem seems to be more likely to occur in cases where you have a large number of tables, i.e., a large amount of stats, and many small tables need VACUUM at the same time. So I followed Tom's advice and created a patch for the PoC. This patch will enable a flag in the table_recheck_autovac function to use the existing stats next time if VACUUM (or ANALYZE) has already been done by another worker on the check after the stats have been updated. If the tables continue to require VACUUM after the refresh, then a refresh will be required instead of using the existing statistics. I did simple test with HEAD and HEAD + this PoC patch. The tests were conducted in two cases. (I changed few configurations. see attached scripts) 1. Normal VACUUM case - SET autovacuum = off - CREATE tables with 100 rows - DELETE 90 rows for each tables - SET autovacuum = on and restart PostgreSQL - Measure the time it takes for all tables to be VACUUMed 2. Anti wrap round VACUUM case - CREATE brank tables - SELECT all of these tables (for generate stats) - SET autovacuum_freeze_max_age to low values and restart PostgreSQL - Consumes a lot of XIDs by using txid_curent() - Measure the time it takes for all tables to be VACUUMed For each test case, the following results were obtained by changing autovacuum_max_workers parameters to 1, 2, 3(def) 5 and 10. Also changing num of tables to 1000, 5000, 1 and 2. Due to the poor VM environment (2 VCPU/4 GB), the results are a little unstable, but I think it's enough to ask for a trend. === [1.Normal VACUUM case] tables:1000 autovacuum_max_workers 1: (HEAD) 20 sec VS (with patch) 20 sec autovacuum_max_workers 2: (HEAD) 18 sec VS (with patch) 16 sec autovacuum_max_workers 3: (HEAD) 18 sec VS (with patch) 16 sec autovacuum_max_workers 5: (HEAD) 19 sec VS (with patch) 17 sec autovacuum_max_workers 10: (HEAD) 19 sec VS (with patch) 17 sec tables:5000 autovacuum_max_workers 1: (HEAD) 77 sec VS (with patch) 78 sec autovacuum_max_workers 2: (HEAD) 61 sec VS (with patch) 43 sec autovacuum_max_workers 3: (HEAD) 38 sec VS (with patch) 38 sec autovacuum_max_workers 5: (HEAD) 45 sec VS (with patch) 37 sec autovacuum_max_workers 10: (HEAD) 43 sec VS (with patch) 35 sec tables:1 autovacuum_max_workers 1: (HEAD) 152 sec VS (with patch) 153 sec autovacuum_max_workers 2: (HEAD) 119 sec VS (with patch) 98 sec autovacuum_max_workers 3: (HEAD) 87 sec VS (with patch) 78 sec autovacuum_max_workers 5: (HEAD) 100 sec VS (with patch) 66 sec autovacuum_max_workers 10: (HEAD) 97 sec VS (with patch) 56 sec tables:2 autovacuum_max_workers 1: (HEAD) 338 sec VS (with patch) 339 sec autovacuum_max_workers 2: (HEAD) 231 sec VS (with patch) 229 sec autovacuum_max_workers 3: (HEAD) 220 sec VS (with patch) 191 sec autovacuum_max_workers 5: (HEAD) 234 sec VS (with patch) 147 sec autovacuum_max_workers 10: (HEAD) 320 sec VS (with patch) 113 sec [2.Anti wrap round VACUUM case] tables:1000 autovacuum_max_workers 1: (HEAD) 19 sec VS (with patch) 18 sec autovacuum_max_workers 2: (HEAD) 14 sec VS (with patch) 15 sec autovacuum_max_workers 3: (HEAD) 14 sec VS (with patch) 14 sec autovacuum_max_workers 5: (HEAD) 14 sec VS (with p
Re: [HACKERS] logical decoding of two-phase transactions
On Mon, Nov 30, 2020 at 7:17 PM Amit Kapila wrote: > > On Mon, Nov 30, 2020 at 2:36 PM Ajin Cherian wrote: > > Sure, but you can see in your example above it got skipped due to > start_decoding_at not due to DecodingContextReady. So, the problem as > mentioned by me previously was how we distinguish those cases because > it can skip due to start_decoding_at during restart as well when we > would have already sent the prepare to the subscriber. > > One idea could be that the subscriber skips the transaction if it sees > the transaction is already prepared. > To skip it, we need to send GID in begin message and then on subscriber-side, check if the prepared xact already exists, if so then set a flag. The flag needs to be set in begin/start_stream and reset in stop_stream/commit/abort. Using the flag, we can skip the entire contents of the prepared xact. In ReorderFuffer-side also, we need to get and set GID in txn even when we skip it because we need to send the same at commit time. In this solution, we won't be able to send it during normal start_stream because by that time we won't know GID and I think that won't be required. Note that this is only required when we skipped sending prepare, otherwise, we just need to send Commit-Prepared at commit time. Another way to solve this problem via publisher-side is to maintain in some file at slot level whether we have sent prepare for a particular txn? Basically, after sending prepare, we need to update the slot information on disk to indicate that the particular GID is sent (we can probably store GID and LSN of Prepare). Then next time whenever we have to skip prepare due to whatever reason, we can check the existence of persistent information on disk for that GID, if it exists then we need to send just Commit Prepared, otherwise, the entire transaction. We can remove this information during or after CheckPointSnapBuild, basically, we can remove the information of all GID's that are after cutoff LSN computed via ReplicationSlotsComputeLogicalRestartLSN. Now, we can even think of removing this information after Commit Prepared but not sure if that is correct because we can't lose this information unless start_decoding_at (or restart_lsn) is moved past the commit lsn Now, to persist this information, there could be multiple possibilities (a) maintain the flexible array for GID's at the end of ReplicationSlotPersistentData, (b) have a separate state file per-slot for prepared xacts, (c) have a separate state file for each prepared xact per-slot. With (a) during upgrade from the previous version there could be a problem because the previous data won't match new data but I am not sure if we maintain slots info intact after upgrade. I think (c) would be simplest but OTOH, having many such files (in case there are more prepared xacts) per-slot might not be a good idea. One more thing that needs to be thought about is when we are sending the entire xact at commit time whether we will send prepare separately? Because, if we don't send it separately, then later allowing the PREPARE on the master to wait for prepare via subscribers won't be possible? Thoughts? -- With Regards, Amit Kapila.
Re: autovac issue with large number of tables
On Tue, Dec 1, 2020 at 1:48 PM Kasahara Tatsuhito wrote: > > Hi, > > On Mon, Nov 30, 2020 at 8:59 PM Fujii Masao > wrote: > > > > > > > > On 2020/11/30 10:43, Masahiko Sawada wrote: > > > On Sun, Nov 29, 2020 at 10:34 PM Kasahara Tatsuhito > > > wrote: > > >> > > >> Hi, Thanks for you comments. > > >> > > >> On Fri, Nov 27, 2020 at 9:51 PM Fujii Masao > > >> wrote: > > >>> > > >>> > > >>> > > >>> On 2020/11/27 18:38, Kasahara Tatsuhito wrote: > > Hi, > > > > On Fri, Nov 27, 2020 at 1:43 AM Fujii Masao > > wrote: > > > > > > > > > > > > On 2020/11/26 10:41, Kasahara Tatsuhito wrote: > > >> On Wed, Nov 25, 2020 at 8:46 PM Masahiko Sawada > > >> wrote: > > >>> > > >>> On Wed, Nov 25, 2020 at 4:18 PM Kasahara Tatsuhito > > >>> wrote: > > > > Hi, > > > > On Wed, Nov 25, 2020 at 2:17 PM Masahiko Sawada > > wrote: > > > > > > On Fri, Sep 4, 2020 at 7:50 PM Kasahara Tatsuhito > > > wrote: > > >> > > >> Hi, > > >> > > >> On Wed, Sep 2, 2020 at 2:10 AM Kasahara Tatsuhito > > >> wrote: > > I wonder if we could have table_recheck_autovac do two probes > > of the stats > > data. First probe the existing stats data, and if it shows > > the table to > > be already vacuumed, return immediately. If not, *then* force > > a stats > > re-read, and check a second time. > > >>> Does the above mean that the second and subsequent > > >>> table_recheck_autovac() > > >>> will be improved to first check using the previous refreshed > > >>> statistics? > > >>> I think that certainly works. > > >>> > > >>> If that's correct, I'll try to create a patch for the PoC > > >> > > >> I still don't know how to reproduce Jim's troubles, but I was > > >> able to reproduce > > >> what was probably a very similar problem. > > >> > > >> This problem seems to be more likely to occur in cases where you > > >> have > > >> a large number of tables, > > >> i.e., a large amount of stats, and many small tables need VACUUM > > >> at > > >> the same time. > > >> > > >> So I followed Tom's advice and created a patch for the PoC. > > >> This patch will enable a flag in the table_recheck_autovac > > >> function to use > > >> the existing stats next time if VACUUM (or ANALYZE) has already > > >> been done > > >> by another worker on the check after the stats have been updated. > > >> If the tables continue to require VACUUM after the refresh, then > > >> a refresh > > >> will be required instead of using the existing statistics. > > >> > > >> I did simple test with HEAD and HEAD + this PoC patch. > > >> The tests were conducted in two cases. > > >> (I changed few configurations. see attached scripts) > > >> > > >> 1. Normal VACUUM case > > >> - SET autovacuum = off > > >> - CREATE tables with 100 rows > > >> - DELETE 90 rows for each tables > > >> - SET autovacuum = on and restart PostgreSQL > > >> - Measure the time it takes for all tables to be VACUUMed > > >> > > >> 2. Anti wrap round VACUUM case > > >> - CREATE brank tables > > >> - SELECT all of these tables (for generate stats) > > >> - SET autovacuum_freeze_max_age to low values and restart > > >> PostgreSQL > > >> - Consumes a lot of XIDs by using txid_curent() > > >> - Measure the time it takes for all tables to be VACUUMed > > >> > > >> For each test case, the following results were obtained by > > >> changing > > >> autovacuum_max_workers parameters to 1, 2, 3(def) 5 and 10. > > >> Also changing num of tables to 1000, 5000, 1 and 2. > > >> > > >> Due to the poor VM environment (2 VCPU/4 GB), the results are a > > >> little unstable, > > >> but I think it's enough to ask for a trend. > > >> > > >> === > > >> [1.Normal VACUUM case] > > >> tables:1000 > > >> autovacuum_max_workers 1: (HEAD) 20 sec VS (with patch) > > >> 20 sec > > >> autovacuum_max_workers 2: (HEAD) 18 sec VS (with patch) > > >> 16 sec > > >> autovacuum_max_workers 3: (HEAD) 18 sec VS (with patch) > > >> 16 sec > > >> autovacuum_max_workers 5: (HEAD) 19 sec VS (with patch) > > >> 17 sec > > >> autovacuum_max_workers 10: (HEAD) 19 sec VS (wi
Re: PG vs LLVM 12 on seawasp, next round
Hi, On 2020-12-01 17:35:49 +1300, Thomas Munro wrote: > Since seawasp's bleeding edge LLVM installation moved to "trunk > 20201114 c8f4e06b 12.0.0" ~16 days ago, it has been red. Further > updates didn't help it and it's now on "trunk 20201127 6ee22ca6 > 12.0.0". I wonder if there is something in Fabien's scripting that > needs to be tweaked, perhaps a symlink name or similar. I don't > follow LLVM development but I found my way to a commit[1] around the > right time that mentions breaking up the OrcJIT library, so *shrug* > maybe that's a clue. > > +ERROR: could not load library > "/home/fabien/pg/build-farm-11/buildroot/HEAD/pgsql.build/tmp_install/home/fabien/pg/build-farm-11/buildroot/HEAD/inst/lib/postgresql/llvmjit.so": > libLLVMOrcJIT.so.12git: cannot open shared object file: No such file > or directory It's a change in how LLVM dependencies are declared internally. Previously the 'native' component was - unintentionally - transitively included via the 'orcjit' component, but now that's not the case anymore. The attached patch should fix it, I think? Greetings, Andres Freund diff --git i/config/llvm.m4 w/config/llvm.m4 index a5f4a9af448..3a75cd8b4df 100644 --- i/config/llvm.m4 +++ w/config/llvm.m4 @@ -76,6 +76,7 @@ AC_DEFUN([PGAC_LLVM_SUPPORT], debuginfodwarf) pgac_components="$pgac_components $pgac_component";; orcjit) pgac_components="$pgac_components $pgac_component";; passes) pgac_components="$pgac_components $pgac_component";; + native) pgac_components="$pgac_components $pgac_component";; perfjitevents) pgac_components="$pgac_components $pgac_component";; esac done; diff --git i/configure w/configure index ffcd0c5b1d4..2a03ed0a018 100755 --- i/configure +++ w/configure @@ -5168,6 +5168,7 @@ fi debuginfodwarf) pgac_components="$pgac_components $pgac_component";; orcjit) pgac_components="$pgac_components $pgac_component";; passes) pgac_components="$pgac_components $pgac_component";; + native) pgac_components="$pgac_components $pgac_component";; perfjitevents) pgac_components="$pgac_components $pgac_component";; esac done;
Re: TAP test utility module 'PG_LSN.pm'
On Tue, Dec 01, 2020 at 03:14:06PM +0900, Fujii Masao wrote: > You mean the same function as the commit 9bae7e4cde provided? Completely forgot about this one, thanks. /me hides -- Michael signature.asc Description: PGP signature
Re: TAP test utility module 'PG_LSN.pm'
On 2020/12/01 14:58, Michael Paquier wrote: On Tue, Dec 01, 2020 at 12:03:41PM +0800, Craig Ringer wrote: I'd like to share the attached PG_LSN.pm module that I use when writing TAP tests. I suggest that it be considered for inclusion in core. It defines a Perl datatype PG_LSN with operator support, so you can write things like cmp_ok($got_lsn, "<", $expected_lsn, "testname") in TAP tests and get sensible results without any concern for LSN representation details, locale, etc. You can subtract LSNs to get a byte difference too. In my experience, any TAP tests making use of LSN operations can just let the backend do the maths, so I am not much a fan of duplicating that stuff in a perl module. Agreed. Wouldn't it be better to add an equivalent of your add() function in the backend then? You mean the same function as the commit 9bae7e4cde provided? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: TAP test utility module 'PG_LSN.pm'
On Tue, 1 Dec 2020 at 13:58, Michael Paquier wrote: > > On Tue, Dec 01, 2020 at 12:03:41PM +0800, Craig Ringer wrote: > > I'd like to share the attached PG_LSN.pm module that I use when > > writing TAP tests. I suggest that it be considered for inclusion in > > core. > > > > It defines a Perl datatype PG_LSN with operator support, so you can > > write things like > > > > cmp_ok($got_lsn, "<", $expected_lsn, "testname") > > > > in TAP tests and get sensible results without any concern for LSN > > representation details, locale, etc. You can subtract LSNs to get a > > byte difference too. > > In my experience, any TAP tests making use of LSN operations can just > let the backend do the maths, so I am not much a fan of duplicating > that stuff in a perl module. Wouldn't it be better to add an > equivalent of your add() function in the backend then? That could > also get tested in the main regression test suite. I find it convenient not to have as much log spam. Also, IIRC I needed it at some points where the target backend(s) were down while I was testing something related to a backend that was still in recovery and not yet available for querying. I don't really mind though, I'm just sharing what I have found useful.
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On Mon, Nov 30, 2020 at 11:43:08PM -0600, Justin Pryzby wrote: > I eyeballed the patch and rebased the rest of the series on top if it to play > with. Looks fine - thanks. Cool, thanks. > FYI, the commit messages have the proper "author" for attribution. I proposed > and implemented the grammar changes in 0002, and implemented INDEX_TABLESPACE, > but I'm a reviewer for the main patches. Well, my impression is that both of you kept exchanging patches, touching and reviewing each other's patch (note that Alexei has also sent a rebase of 0002 just yesterday), so I think that it is fair to say that both of you should be listed as authors and credited as such in the release notes for this one. -- Michael signature.asc Description: PGP signature
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Mon, Nov 30, 2020 at 6:49 PM Amit Kapila wrote: > > On Mon, Nov 30, 2020 at 3:14 AM Noah Misch wrote: > > > > On Mon, Sep 07, 2020 at 12:00:41PM +0530, Amit Kapila wrote: > > > Thanks, I have pushed the last patch. Let's wait for a day or so to > > > see the buildfarm reports > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2020-09-08%2006%3A24%3A14 > > failed the new 015_stream.pl test with the subscriber looping like this: > > > > 2020-09-08 11:22:49.848 UTC [13959252:1] LOG: logical replication apply > > worker for subscription "tap_sub" has started > > 2020-09-08 11:22:54.045 UTC [13959252:2] ERROR: could not open temporary > > file "16393-510.changes.0" from BufFile "16393-510.changes": No such file > > or directory > > 2020-09-08 11:22:54.055 UTC [7602182:1] LOG: logical replication apply > > worker for subscription "tap_sub" has started > > 2020-09-08 11:22:54.101 UTC [31785284:4] LOG: background worker "logical > > replication worker" (PID 13959252) exited with exit code 1 > > 2020-09-08 11:23:01.142 UTC [7602182:2] ERROR: could not open temporary > > file "16393-510.changes.0" from BufFile "16393-510.changes": No such file > > or directory > > ... > > > > What happened there? > > > > What is going on here is that the expected streaming file is missing. > Normally, the first time we send a stream of changes (some percentage > of transaction changes) we create the streaming file, and then in > respective streams we just keep on writing in that file the changes we > receive from the publisher, and on commit, we read that file and apply > all the changes. > > The above kind of error can happen due to the following reasons: (a) > the first time we sent the stream and created the file and that got > removed before the second stream reached the subscriber. (b) from the > publisher-side, we never sent the indication that it is the first > stream and the subscriber directly tries to open the file thinking it > is already there. > > Now, the publisher and subscriber log doesn't directly indicate any of > the above problems but I have some observations. > > The subscriber log indicates that before the apply worker exits due to > an error the new apply worker gets started. We delete the > streaming-related temporary files on proc_exit, so one possibility > could have been that the new apply worker has created the streaming > file which the old apply worker has removed but that is not possible > because we always create these temp-files by having procid in the > path. Yeah, and I have tried to test on this line, basically, after the streaming has started I have set the binary=on. Now using gdb I have made the worker wait before it deletes the temp file and meanwhile the new worker started and it worked properly as expected. > The other thing I observed in the code is that we can mark the > transaction as streamed (via ReorderBufferTruncateTxn) if we try to > stream a transaction that has no changes the first time we try to > stream the transaction. This would lead to symptom (b) because the > second-time when there are more changes we would stream the changes as > it is not the first time. However, this shouldn't happen because we > never pick-up a transaction to stream which has no changes. I can try > to fix the code here such that we don't mark the transaction as > streamed unless we have streamed at least one change but I don't see > how it is related to this particular test failure. Yeah, this can be improved but as you mentioned that we never select an empty transaction for streaming so this case should not occur. I will perform some testing/review around this and report. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: BUG #16663: DROP INDEX did not free up disk space: idle connection hold file marked as deleted
On Tue, Dec 01, 2020 at 04:06:48PM +1300, Thomas Munro wrote: > I felt on balance it was a "bug", since it causes operational > difficulties for people and was clearly not our intended behaviour, > and I announced this intention 6 weeks ago. Oops, sorry for missing this discussion for such a long time :/ > Of course I'll be happy to revert it from the back-branches if > that's the consensus. Any > other opinions? If there are no other opinions, I am also fine to rely on your judgment. -- Michael signature.asc Description: PGP signature
Re: Improving spin-lock implementation on ARM.
Alexander Korotkov writes: > 2) None of the patches considered in this thread give a clear > advantage for PostgreSQL built with LSE. Yeah, I think so. > To further confirm this let's wait for Kunpeng 920 tests by Krunal > Bauskar and Amit Khandekar. Also it would be nice if someone will run > benchmarks similar to [1] on Apple M1. I did what I could in this department. It's late and I'm not going to have time to run read/write benchmarks before bed, but here are some results for the "pgbench -S" cases. I tried to match your testing choices, but could not entirely: * Configure options are --enable-debug, --disable-cassert, no other special configure options or CFLAG choices. * I have not been able to find a way to make Apple's compiler not use the LSE spinlock instructions, so all of these correspond to your LSE cases. * I used shared_buffers = 1GB, because this machine only has 16GB RAM so 32GB is clearly out of reach. Also I used pgbench scale factor 100 not 1000. Since we're trying to measure contention effects not I/O speed, I don't think a huge test case is appropriate. * I still haven't gotten pgbench to work with -j settings above 128, so these runs use -j equal to half -c. Shouldn't really affect conclusions about scaling. (BTW, I see a similar limitation on macOS Catalina x86_64, so whatever that is, it's not new.) * Otherwise, the first plot shows median of three results from "pgbench -S -M prepared -T 120 -c $n -j $j", as you had it. The right-hand plot shows all three of the values in yerrorbars format, just to give a sense of the noise level. Clearly, there is something weird going on at -c 4. There's a cluster of results around 180K TPS, and another cluster around 210-220K TPS, and nothing in between. I suspect that the scheduler is doing something bogus with sometimes putting pgbench onto the slow cores. Anyway, I believe that the apparent gap between HEAD and the other curves at -c 4 is probably an artifact: HEAD had two 180K-ish results and one 220K-ish result, while the other curves had the reverse, so the medians are different but there's probably not any non-chance effect there. Bottom line is that these patches don't appear to do much of anything on M1, as you surmised. regards, tom lane
Re: TAP test utility module 'PG_LSN.pm'
On Tue, Dec 01, 2020 at 12:03:41PM +0800, Craig Ringer wrote: > I'd like to share the attached PG_LSN.pm module that I use when > writing TAP tests. I suggest that it be considered for inclusion in > core. > > It defines a Perl datatype PG_LSN with operator support, so you can > write things like > > cmp_ok($got_lsn, "<", $expected_lsn, "testname") > > in TAP tests and get sensible results without any concern for LSN > representation details, locale, etc. You can subtract LSNs to get a > byte difference too. In my experience, any TAP tests making use of LSN operations can just let the backend do the maths, so I am not much a fan of duplicating that stuff in a perl module. Wouldn't it be better to add an equivalent of your add() function in the backend then? That could also get tested in the main regression test suite. -- Michael signature.asc Description: PGP signature
RE: [PATCH] Keeps tracking the uniqueness with UniqueKey
Hi I look into the patch again and have some comments. 1. + Size oid_cmp_len = sizeof(Oid) * ind1->ncolumns; + + return ind1->ncolumns == ind2->ncolumns && + ind1->unique == ind2->unique && + memcmp(ind1->indexkeys, ind2->indexkeys, sizeof(int) * ind1->ncolumns) == 0 && + memcmp(ind1->opfamily, ind2->opfamily, oid_cmp_len) == 0 && + memcmp(ind1->opcintype, ind2->opcintype, oid_cmp_len) == 0 && + memcmp(ind1->sortopfamily, ind2->sortopfamily, oid_cmp_len) == 0 && + equal(get_tlist_exprs(ind1->indextlist, true), + get_tlist_exprs(ind2->indextlist, true)); The length of sortopfamily,opfamily and opcintype seems ->nkeycolumns not ->ncolumns. I checked function get_relation_info where init the IndexOptInfo. (If there are more places where can change the length, please correct me) 2. + COPY_SCALAR_FIELD(ncolumns); + COPY_SCALAR_FIELD(nkeycolumns); + COPY_SCALAR_FIELD(unique); + COPY_SCALAR_FIELD(immediate); + /* We just need to know if it is NIL or not */ + COPY_SCALAR_FIELD(indpred); + COPY_SCALAR_FIELD(predOK); + COPY_POINTER_FIELD(indexkeys, from->ncolumns * sizeof(int)); + COPY_POINTER_FIELD(indexcollations, from->ncolumns * sizeof(Oid)); + COPY_POINTER_FIELD(opfamily, from->ncolumns * sizeof(Oid)); + COPY_POINTER_FIELD(opcintype, from->ncolumns * sizeof(Oid)); + COPY_POINTER_FIELD(sortopfamily, from->ncolumns * sizeof(Oid)); + COPY_NODE_FIELD(indextlist); The same as 1. Should use nkeycolumns if I am right. 3. + foreach(lc, newnode->indextlist) + { + TargetEntry *tle = lfirst_node(TargetEntry, lc); + /* Index on expression is ignored */ + Assert(IsA(tle->expr, Var)); + tle->expr = (Expr *) find_parent_var(appinfo, (Var *) tle->expr); + newnode->indexkeys[idx] = castNode(Var, tle->expr)->varattno; + idx++; + } The count variable 'idx' can be replaces by foreach_current_index(). Best regards, houzj
Re: proposal: unescape_text function
po 30. 11. 2020 v 22:56 odesílatel Pavel Stehule napsal: > > > po 30. 11. 2020 v 22:15 odesílatel Pavel Stehule > napsal: > >> >> >> po 30. 11. 2020 v 14:14 odesílatel Peter Eisentraut < >> peter.eisentr...@enterprisedb.com> napsal: >> >>> On 2020-11-29 18:36, Pavel Stehule wrote: >>> > >>> > I don't really get the point of this function. There is AFAICT no >>> > function to produce this escaped format, and it's not a recognized >>> > interchange format. So under what circumstances would one need to >>> > use this? >>> > >>> > >>> > Some corporate data can be in CSV format with escaped unicode >>> > characters. Without this function it is not possible to decode these >>> > files without external application. >>> >>> I would like some supporting documentation on this. So far we only have >>> one stackoverflow question, and then this implementation, and they are >>> not even the same format. My worry is that if there is not precise >>> specification, then people are going to want to add things in the >>> future, and there will be no way to analyze such requests in a >>> principled way. >>> >>> >> I checked this and it is "prefix backslash-u hex" used by Java, >> JavaScript or RTF - >> https://billposer.org/Software/ListOfRepresentations.html >> >> In some languages (Python), there is decoder "unicode-escape". Java has >> a method escapeJava, for conversion from unicode to ascii. I can imagine so >> these data are from Java systems exported to 8bit strings - so this >> implementation can be accepted as referential. This format is used by >> https://docs.oracle.com/javase/8/docs/technotes/tools/unix/native2ascii.html >> tool too. >> >> Postgres can decode this format too, and the patch is based on Postgres >> implementation. I just implemented a different interface. >> >> Currently decode function does only text->bytea transformation. Maybe a >> more generic function "decode_text" and "encode_text" for similar cases can >> be better (here we need text->text transformation). But it looks like >> overengineering now. >> >> Maybe we introduce new encoding "ascii" and we can implement new >> conversions "ascii_to_utf8" and "utf8_to_ascii". It looks like the most >> clean solution. What do you think about it? >> > > a better name of new encoding can be "unicode-escape" than "ascii". We use > "to_ascii" function for different use case. > > set client_encoding to unicode-escape; > copy tab from xxx; > ... > > but it doesn't help when only a few columns from the table are in > unicode-escape format. > > probably the most complete solution can be from two steps: 1. introducing new encoding - "ascii_unicode_escape" with related conversions 2. introducing two new functions - text_escape and text_unescape with two parameters - source text and conversion name select text_convert_to('Тимати', 'ascii_unicode_escape') \u0422\u0438\u043c\u0430\u0442\u0438 .. result is text select text_convert_from('\u0422\u0438\u043c\u0430\u0442\u0438', 'ascii_unicode_escape') ┌──┐ │ ?column? │ ╞══╡ │ Тимати │ └──┘ (1 row) > > >> Regards >> >> Pavel >> >> >>
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On Tue, Dec 01, 2020 at 11:46:55AM +0900, Michael Paquier wrote: > On Mon, Nov 30, 2020 at 05:12:42PM +0300, Alexey Kondratov wrote: > > Thanks. I have rebased the remaining patches on top of 873ea9ee to use > > 'utility_option_list' instead of 'common_option_list'. > > Thanks, that helps a lot. I have gone through 0002, and tweaked it as > the attached (note that this patch is also interesting for another > thing in development: backend-side reindex filtering of > collation-sensitive indexes). Does that look right to you? I eyeballed the patch and rebased the rest of the series on top if it to play with. Looks fine - thanks. FYI, the commit messages have the proper "author" for attribution. I proposed and implemented the grammar changes in 0002, and implemented INDEX_TABLESPACE, but I'm a reviewer for the main patches. -- Justin
RE: [POC] Fast COPY FROM command for the table with foreign partitions
From: Amit Langote > Andrey's original patch had the flag to, as I understand it, make the > partitioning case work correctly. When inserting into a > non-partitioned table, there's only one relation to care about. In > that case, CopyFrom() can use either the new COPY interface or the > INSERT interface for the entire operation when talking to a foreign > target relation's FDW driver. With partitions, that has to be > considered separately for each partition. What complicates the matter > further is that while the original target relation (the root > partitioned table in the partitioning case) is fully initialized in > CopyFrom(), partitions are lazily initialized by ExecFindPartition(). Yeah, I felt it a bit confusing to see the calls to Begin/EndForeignInsert() in both CopyFrom() and ExecInitRoutingInfo(). > Note that the initialization of a given target relation can also > optionally involve calling the FDW to perform any pre-COPY > initializations. So if a given partition is a foreign table, whether > the copy operation was initialized using the COPY interface or the > INSERT interface is determined away from CopyFrom(). Andrey created > ri_usesMultiInsert to remember which was used so that CopyFrom() can > use the correct interface during the subsequent interactions with the > partition's driver. > > Now, it does not seem outright impossible to do this without the flag, > but maybe Andrey thinks it is good for readability? If it is > confusing from a modularity standpoint, maybe we should rethink that. > That said, I still think that there should be a way for CopyFrom() to > tell ExecFindPartition() which FDW interface to initialize a given > foreign table partition's copy operation with -- COPY if the copy > allows multi-insert, INSERT if not. Maybe the multi_insert parameter > I mentioned earlier would serve that purpose. I agree with your idea of adding multi_insert argument to ExecFindPartition() to request a multi-insert-capable partition. At first, I thought ExecFindPartition() is used for all operations, insert/delete/update/select, so I found it odd to add multi_insert argument. But ExecFindPartion() is used only for insert, so multi_insert argument seems okay. Regards Takayuki Tsunakawa
Re: scram-sha-256 broken with FIPS and OpenSSL 1.0.2
On Mon, Nov 30, 2020 at 02:29:29PM +0100, Daniel Gustafsson wrote: > Yeah, that's along the lines of what I was thinking of. Hmm. I have looked at that, and thought first about having directly a reference to the resowner directly in pg_cryptohash_ctx, but that's not a good plan for two reasons: - common/cryptohash.h would get knowledge of that, requiring bundling in a bunch of dependencies. - There is no need for that in the non-OpenSSL case. So, instead, I have been thinking about using an extra context layer only for cryptohash_openssl.c with a structure saved as pg_cryptohash_context->data that stores the information about EVP_MD_CTX* and the resource owner. Then, I was thinking about storing directly pg_cryptohash_ctx in the resowner EVP array and just call pg_cryptohash_free() from resowner.c without the need of an extra routine. I have not tested this idea but that should work. What's your take? In parallel, I have spent more time today polishing and reviewing 0001 (indented, adjusted a couple of areas and added also brackets and extra comments as you suggested) and tested it on Linux and Windows, with and without OpenSSL down to 1.0.1, the oldest version supported on HEAD. So I'd like to apply the attached first and sort out the resowner stuff in a next step. -- Michael From 5ffd653336f1c41043635fad3618cf0bae7b1cec Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Tue, 1 Dec 2020 13:21:44 +0900 Subject: [PATCH v6] Rework SHA2 and crypto hash APIs This will make easier a switch to EVP for the OpenSSL SHA2 layer. Note that the layer introduced here is generalized for the purpose of a future integration with HMAC, MD5, and even more. --- src/include/common/checksum_helper.h | 13 +- src/include/common/cryptohash.h | 40 src/include/common/scram-common.h | 17 +- src/include/common/sha2.h | 89 +--- src/include/replication/backup_manifest.h | 3 +- src/backend/libpq/auth-scram.c| 113 ++ src/backend/replication/backup_manifest.c | 25 ++- src/backend/replication/basebackup.c | 24 ++- src/backend/utils/adt/cryptohashes.c | 53 +++-- src/common/Makefile | 6 +- src/common/checksum_helper.c | 79 +-- src/common/cryptohash.c | 190 + src/common/cryptohash_openssl.c | 197 ++ src/common/scram-common.c | 173 ++- src/common/sha2.c | 23 +- .../common/sha2.h => common/sha2_int.h} | 38 +--- src/common/sha2_openssl.c | 102 - src/bin/pg_verifybackup/parse_manifest.c | 15 +- src/bin/pg_verifybackup/pg_verifybackup.c | 24 ++- src/interfaces/libpq/fe-auth-scram.c | 118 ++- contrib/pgcrypto/internal-sha2.c | 188 - src/tools/msvc/Mkvcbuild.pm | 3 +- src/tools/pgindent/typedefs.list | 2 + 23 files changed, 955 insertions(+), 580 deletions(-) create mode 100644 src/include/common/cryptohash.h create mode 100644 src/common/cryptohash.c create mode 100644 src/common/cryptohash_openssl.c copy src/{include/common/sha2.h => common/sha2_int.h} (73%) delete mode 100644 src/common/sha2_openssl.c diff --git a/src/include/common/checksum_helper.h b/src/include/common/checksum_helper.h index 48b0745dad..b07a34e7e4 100644 --- a/src/include/common/checksum_helper.h +++ b/src/include/common/checksum_helper.h @@ -14,6 +14,7 @@ #ifndef CHECKSUM_HELPER_H #define CHECKSUM_HELPER_H +#include "common/cryptohash.h" #include "common/sha2.h" #include "port/pg_crc32c.h" @@ -41,10 +42,10 @@ typedef enum pg_checksum_type typedef union pg_checksum_raw_context { pg_crc32c c_crc32c; - pg_sha224_ctx c_sha224; - pg_sha256_ctx c_sha256; - pg_sha384_ctx c_sha384; - pg_sha512_ctx c_sha512; + pg_cryptohash_ctx *c_sha224; + pg_cryptohash_ctx *c_sha256; + pg_cryptohash_ctx *c_sha384; + pg_cryptohash_ctx *c_sha512; } pg_checksum_raw_context; /* @@ -66,8 +67,8 @@ typedef struct pg_checksum_context extern bool pg_checksum_parse_type(char *name, pg_checksum_type *); extern char *pg_checksum_type_name(pg_checksum_type); -extern void pg_checksum_init(pg_checksum_context *, pg_checksum_type); -extern void pg_checksum_update(pg_checksum_context *, const uint8 *input, +extern int pg_checksum_init(pg_checksum_context *, pg_checksum_type); +extern int pg_checksum_update(pg_checksum_context *, const uint8 *input, size_t len); extern int pg_checksum_final(pg_checksum_context *, uint8 *output); diff --git a/src/include/common/cryptohash.h b/src/include/common/cryptohash.h new file mode 100644 index 00..0e4a6631a3 --- /dev/null +++ b/src/include/common/cryptohash.h @@ -0,0 +1,40 @@ +/*- + * + * cryptohash.h +
Re: Add statistics to pg_stat_wal view for wal related parameter tuning
On 2020/11/26 16:07, Masahiro Ikeda wrote: On 2020-11-25 20:19, Fujii Masao wrote: On 2020/11/19 16:31, Masahiro Ikeda wrote: On 2020-11-17 11:46, Fujii Masao wrote: On 2020/11/16 16:35, Masahiro Ikeda wrote: On 2020-11-12 14:58, Fujii Masao wrote: On 2020/11/06 10:25, Masahiro Ikeda wrote: On 2020-10-30 11:50, Fujii Masao wrote: On 2020/10/29 17:03, Masahiro Ikeda wrote: Hi, Thanks for your comments and advice. I updated the patch. On 2020-10-21 18:03, Kyotaro Horiguchi wrote: At Tue, 20 Oct 2020 16:11:29 +0900, Masahiro Ikeda wrote in On 2020-10-20 12:46, Amit Kapila wrote: > I see that we also need to add extra code to capture these stats (some > of which is in performance-critical path especially in > XLogInsertRecord) which again makes me a bit uncomfortable. It might > be that it is all fine as it is very important to collect these stats > at cluster-level in spite that the same information can be gathered at > statement-level to help customers but I don't see a very strong case > for that in your proposal. We should avoid that duplication as possible even if the both number are important. Also about performance, I thought there are few impacts because it increments stats in memory. If I can implement to reuse pgWalUsage's value which already collects these stats, there is no impact in XLogInsertRecord. For example, how about pg_stat_wal() calculates the accumulated value of wal_records, wal_fpi, and wal_bytes to use pgWalUsage's value? I don't think that works, but it would work that pgstat_send_wal() takes the difference of that values between two successive calls. WalUsage prevWalUsage; ... pgstat_send_wal() { .. /* fill in some values using pgWalUsage */ WalStats.m_wal_bytes = pgWalUsage.wal_bytes - prevWalUsage.wal_bytes; WalStats.m_wal_records = pgWalUsage.wal_records - prevWalUsage.wal_records; WalStats.m_wal_wal_fpi = pgWalUsage.wal_fpi - prevWalUsage.wal_fpi; ... pgstat_send(&WalStats, sizeof(WalStats)); /* remember the current numbers */ prevWalUsage = pgWalUsage; Thanks for Horiguchi-san's advice, I changed to reuse pgWalUsage which is already defined and eliminates the extra overhead. + /* fill in some values using pgWalUsage */ + WalStats.m_wal_bytes = pgWalUsage.wal_bytes - prevWalUsage.wal_bytes; + WalStats.m_wal_records = pgWalUsage.wal_records - prevWalUsage.wal_records; + WalStats.m_wal_fpi = pgWalUsage.wal_fpi - prevWalUsage.wal_fpi; It's better to use WalUsageAccumDiff() here? Yes, thanks. I fixed it. prevWalUsage needs to be initialized with pgWalUsage? + if (AmWalWriterProcess()){ + WalStats.m_wal_write_walwriter++; + } + else + { + WalStats.m_wal_write_backend++; + } I think that it's better not to separate m_wal_write_xxx into two for walwriter and other processes. Instead, we can use one m_wal_write_xxx counter and make pgstat_send_wal() send also the process type to the stats collector. Then the stats collector can accumulate the counters per process type if necessary. If we adopt this approach, we can easily extend pg_stat_wal so that any fields can be reported per process type. I'll remove the above source code because these counters are not useful. On 2020-10-30 12:00, Fujii Masao wrote: On 2020/10/20 11:31, Masahiro Ikeda wrote: Hi, I think we need to add some statistics to pg_stat_wal view. Although there are some parameter related WAL, there are few statistics for tuning them. I think it's better to provide the following statistics. Please let me know your comments. ``` postgres=# SELECT * from pg_stat_wal; -[ RECORD 1 ]---+-- wal_records | 2000224 wal_fpi | 47 wal_bytes | 248216337 wal_buffers_full | 20954 wal_init_file | 8 wal_write_backend | 20960 wal_write_walwriter | 46 wal_write_time | 51 wal_sync_backend | 7 wal_sync_walwriter | 8 wal_sync_time | 0 stats_reset | 2020-10-20 11:04:51.307771+09 ``` 1. Basic statistics of WAL activity - wal_records: Total number of WAL records generated - wal_fpi: Total number of WAL full page images generated - wal_bytes: Total amount of WAL bytes generated To understand DB's performance, first, we will check the performance trends for the entire database instance. For example, if the number of wal_fpi becomes higher, users may tune "wal_compression", "checkpoint_timeout" and so on. Although users can check the above statistics via EXPLAIN, auto_explain, autovacuum and pg_stat_statements now, if users want to see the performance trends for the entire database, they must recalculate the statistics. I think it is useful to add the sum of the basic statistics. 2. WAL segment file creation - wal_init_file: Total number of WAL segment files created. To create a new WAL file may have an impact on the performance of a write-heavy workloa
Re: autovac issue with large number of tables
Hi, On Mon, Nov 30, 2020 at 8:59 PM Fujii Masao wrote: > > > > On 2020/11/30 10:43, Masahiko Sawada wrote: > > On Sun, Nov 29, 2020 at 10:34 PM Kasahara Tatsuhito > > wrote: > >> > >> Hi, Thanks for you comments. > >> > >> On Fri, Nov 27, 2020 at 9:51 PM Fujii Masao > >> wrote: > >>> > >>> > >>> > >>> On 2020/11/27 18:38, Kasahara Tatsuhito wrote: > Hi, > > On Fri, Nov 27, 2020 at 1:43 AM Fujii Masao > wrote: > > > > > > > > On 2020/11/26 10:41, Kasahara Tatsuhito wrote: > >> On Wed, Nov 25, 2020 at 8:46 PM Masahiko Sawada > >> wrote: > >>> > >>> On Wed, Nov 25, 2020 at 4:18 PM Kasahara Tatsuhito > >>> wrote: > > Hi, > > On Wed, Nov 25, 2020 at 2:17 PM Masahiko Sawada > wrote: > > > > On Fri, Sep 4, 2020 at 7:50 PM Kasahara Tatsuhito > > wrote: > >> > >> Hi, > >> > >> On Wed, Sep 2, 2020 at 2:10 AM Kasahara Tatsuhito > >> wrote: > I wonder if we could have table_recheck_autovac do two probes of > the stats > data. First probe the existing stats data, and if it shows the > table to > be already vacuumed, return immediately. If not, *then* force a > stats > re-read, and check a second time. > >>> Does the above mean that the second and subsequent > >>> table_recheck_autovac() > >>> will be improved to first check using the previous refreshed > >>> statistics? > >>> I think that certainly works. > >>> > >>> If that's correct, I'll try to create a patch for the PoC > >> > >> I still don't know how to reproduce Jim's troubles, but I was able > >> to reproduce > >> what was probably a very similar problem. > >> > >> This problem seems to be more likely to occur in cases where you > >> have > >> a large number of tables, > >> i.e., a large amount of stats, and many small tables need VACUUM at > >> the same time. > >> > >> So I followed Tom's advice and created a patch for the PoC. > >> This patch will enable a flag in the table_recheck_autovac > >> function to use > >> the existing stats next time if VACUUM (or ANALYZE) has already > >> been done > >> by another worker on the check after the stats have been updated. > >> If the tables continue to require VACUUM after the refresh, then a > >> refresh > >> will be required instead of using the existing statistics. > >> > >> I did simple test with HEAD and HEAD + this PoC patch. > >> The tests were conducted in two cases. > >> (I changed few configurations. see attached scripts) > >> > >> 1. Normal VACUUM case > >> - SET autovacuum = off > >> - CREATE tables with 100 rows > >> - DELETE 90 rows for each tables > >> - SET autovacuum = on and restart PostgreSQL > >> - Measure the time it takes for all tables to be VACUUMed > >> > >> 2. Anti wrap round VACUUM case > >> - CREATE brank tables > >> - SELECT all of these tables (for generate stats) > >> - SET autovacuum_freeze_max_age to low values and restart > >> PostgreSQL > >> - Consumes a lot of XIDs by using txid_curent() > >> - Measure the time it takes for all tables to be VACUUMed > >> > >> For each test case, the following results were obtained by changing > >> autovacuum_max_workers parameters to 1, 2, 3(def) 5 and 10. > >> Also changing num of tables to 1000, 5000, 1 and 2. > >> > >> Due to the poor VM environment (2 VCPU/4 GB), the results are a > >> little unstable, > >> but I think it's enough to ask for a trend. > >> > >> === > >> [1.Normal VACUUM case] > >> tables:1000 > >> autovacuum_max_workers 1: (HEAD) 20 sec VS (with patch) 20 > >> sec > >> autovacuum_max_workers 2: (HEAD) 18 sec VS (with patch) 16 > >> sec > >> autovacuum_max_workers 3: (HEAD) 18 sec VS (with patch) 16 > >> sec > >> autovacuum_max_workers 5: (HEAD) 19 sec VS (with patch) 17 > >> sec > >> autovacuum_max_workers 10: (HEAD) 19 sec VS (with patch) 17 > >> sec > >> > >> tables:5000 > >> autovacuum_max_workers 1: (HEAD) 77 sec VS (with patch) 78 > >> sec > >> autovacuum_max_workers 2: (HEAD) 61 sec VS (with patch) 43 > >> sec > >> autovacuum_max_workers 3: (HEAD) 38 sec
PG vs LLVM 12 on seawasp, next round
Hi Since seawasp's bleeding edge LLVM installation moved to "trunk 20201114 c8f4e06b 12.0.0" ~16 days ago, it has been red. Further updates didn't help it and it's now on "trunk 20201127 6ee22ca6 12.0.0". I wonder if there is something in Fabien's scripting that needs to be tweaked, perhaps a symlink name or similar. I don't follow LLVM development but I found my way to a commit[1] around the right time that mentions breaking up the OrcJIT library, so *shrug* maybe that's a clue. +ERROR: could not load library "/home/fabien/pg/build-farm-11/buildroot/HEAD/pgsql.build/tmp_install/home/fabien/pg/build-farm-11/buildroot/HEAD/inst/lib/postgresql/llvmjit.so": libLLVMOrcJIT.so.12git: cannot open shared object file: No such file or directory [1] https://github.com/llvm/llvm-project/commit/1d0676b54c4e3a517719220def96dfdbc26d8048
TAP test utility module 'PG_LSN.pm'
Hi all I'd like to share the attached PG_LSN.pm module that I use when writing TAP tests. I suggest that it be considered for inclusion in core. It defines a Perl datatype PG_LSN with operator support, so you can write things like cmp_ok($got_lsn, "<", $expected_lsn, "testname") in TAP tests and get sensible results without any concern for LSN representation details, locale, etc. You can subtract LSNs to get a byte difference too. It's small but I've found it handy. PG_LSN.pm Description: Perl program
Re: Printing backtrace of postgres processes
Andres Freund writes: > It should be quite doable to emit such backtraces directly to stderr, > instead of using appendStringInfoString()/elog(). No, please no. (1) On lots of logging setups (think syslog), anything that goes to stderr is just going to wind up in the bit bucket. I realize that we have that issue already for memory context dumps on OOM errors, but that doesn't make it a good thing. (2) You couldn't really write "to stderr", only to fileno(stderr), creating issues about interleaving of the output with regular stderr output. For instance it's quite likely that the backtrace would appear before stderr output that had actually been emitted earlier, which'd be tremendously confusing. (3) This isn't going to do anything good for my concerns about interleaved output from different processes, either. regards, tom lane
Re: [HACKERS] logical decoding of two-phase transactions
On Tue, Dec 1, 2020 at 7:55 AM Ajin Cherian wrote: > > On Tue, Dec 1, 2020 at 12:46 AM Amit Kapila wrote: > > > > > Sure, but you can see in your example above it got skipped due to > > start_decoding_at not due to DecodingContextReady. So, the problem as > > mentioned by me previously was how we distinguish those cases because > > it can skip due to start_decoding_at during restart as well when we > > would have already sent the prepare to the subscriber. > > The distinguishing factor is that at restart, the Prepare does satisfy > DecodingContextReady (because the snapshot is consistent then). > In both cases, the prepare is prior to start_decoding_at, but when the > prepare is before a consistent point, > it does not satisfy DecodingContextReady. > I think it won't be true when we reuse some already serialized snapshot from some other slot. It is possible that we wouldn't have encountered such a serialized snapshot while creating a slot but later during replication, we might use it because by that time some other slot has serialized the one at that point. -- With Regards, Amit Kapila.
Re: runtime error copying oids field
Alvaro, et al: Please let me know how to proceed with the patch. Running test suite with the patch showed no regression. Cheers On Mon, Nov 30, 2020 at 3:24 PM Zhihong Yu wrote: > Hi, > See attached patch which is along the line Alvaro outlined. > > Cheers > > On Mon, Nov 30, 2020 at 3:01 PM Alvaro Herrera > wrote: > >> On 2020-Nov-30, Zhihong Yu wrote: >> >> > This was the line runtime error was raised: >> > >> > memcpy(part_oids, partdesc->oids, sizeof(Oid) * nparts); >> > >> > From RelationBuildPartitionDesc we can see that: >> > >> > if (nparts > 0) >> > { >> > PartitionBoundInfo boundinfo; >> > int*mapping; >> > int next_index = 0; >> > >> > result->oids = (Oid *) palloc0(nparts * sizeof(Oid)); >> > >> > The cause was oids field was not assigned due to nparts being 0. >> > This is verified by additional logging added just prior to the memcpy >> call. >> > >> > I want to get the community's opinion on whether a null check should be >> > added prior to the memcpy() call. >> >> As far as I understand, we do want to avoid memcpy's of null pointers; >> see [1]. >> >> In this case I think it'd be sane to skip the complete block, not just >> the memcpy, something like >> >> diff --git a/src/backend/commands/indexcmds.c >> b/src/backend/commands/indexcmds.c >> index ca24620fd0..d35deb433a 100644 >> --- a/src/backend/commands/indexcmds.c >> +++ b/src/backend/commands/indexcmds.c >> @@ -1163,15 +1163,17 @@ DefineIndex(Oid relationId, >> >> if (partitioned) >> { >> + PartitionDesc partdesc; >> + >> /* >> * Unless caller specified to skip this step (via ONLY), >> process each >> * partition to make sure they all contain a >> corresponding index. >> * >> * If we're called internally (no stmt->relation), >> recurse always. >> */ >> - if (!stmt->relation || stmt->relation->inh) >> + partdesc = RelationGetPartitionDesc(rel); >> + if ((!stmt->relation || stmt->relation->inh) && >> partdesc->nparts > 0) >> { >> - PartitionDesc partdesc = >> RelationGetPartitionDesc(rel); >> int nparts = partdesc->nparts; >> Oid*part_oids = >> palloc(sizeof(Oid) * nparts); >> boolinvalidate_parent = false; >> >> [1] >> https://www.postgresql.org/message-id/flat/20200904023648.GB3426768%40rfd.leadboat.com >> >
Confusing behavior of psql's \e
If you quit the editor without saving, the current query buffer or the last executed SQL statement get run. This can be annoying and disruptive, and it requires you to empty the file and save it if you don't want to execute anything. But when editing a script, it is a clear POLA violation: test=> \! cat q.sql SELECT 99; test=> SELECT 42; ?column? -- 42 (1 row) test=> \e q.sql [quit the editor without saving] ?column? -- 42 (1 row) This is pretty bad: you either have to re-run the previous statement or you have to empty your script file. Both are unappealing. I have been annoyed about this myself, and I have heard other prople complain about it, so I propose to clear the query buffer if the editor exits without modifying the file. This behavior is much more intuitive for me. Yours, Laurenz Albe From 209e4a0ab51f88a82e1fc0d4e4efd24d38a7d26c Mon Sep 17 00:00:00 2001 From: Laurenz Albe Date: Tue, 1 Dec 2020 04:28:50 +0100 Subject: [PATCH] Discard query buffer if editor is quit in \e Before, the current query buffer or the previous query was executed when you quit the editor without saving. This was frequently annoying, but downright confusing if \e was used to edit a script file, because it would then execute the previous command rather than the script file. --- doc/src/sgml/ref/psql-ref.sgml | 4 +++- src/bin/psql/command.c | 2 ++ 2 files changed, 5 insertions(+), 1 deletion(-) diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml index 221a967bfe..002ed38fe7 100644 --- a/doc/src/sgml/ref/psql-ref.sgml +++ b/doc/src/sgml/ref/psql-ref.sgml @@ -1949,7 +1949,9 @@ testdb=> -The new contents of the query buffer are then re-parsed according to +If the editor is quit without modifying the file, the query buffer +is cleared. +Otherwise, the new contents of the query buffer are re-parsed according to the normal rules of psql, treating the whole buffer as a single line. Any complete queries are immediately executed; that is, if the query buffer contains or ends with a diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c index c7a83d5dfc..ffc5d209bc 100644 --- a/src/bin/psql/command.c +++ b/src/bin/psql/command.c @@ -3838,6 +3838,8 @@ do_edit(const char *filename_arg, PQExpBuffer query_buf, fclose(stream); } } + else + resetPQExpBuffer(query_buf); /* remove temp file */ if (!filename_arg) -- 2.26.2
Re: Printing backtrace of postgres processes
On Tue, 1 Dec 2020 at 11:31, Andres Freund wrote: > > Hi, > > On 2020-11-30 13:35:46 +0800, Craig Ringer wrote: > > I find that when I most often want a backtrace of a running, live > > backend, it's because the backend is doing something that isn't > > passing a CHECK_FOR_INTERRUPTS() so it's not responding to signals. So > > it wouldn't help if a backend is waiting on an LWLock, busy in a > > blocking call to some loaded library, a blocking syscall, etc. But > > there are enough other times I want live backtraces, and I'm not the > > only one whose needs matter. > > Random thought: Wonder if it could be worth adding a conditionally > compiled mode where we track what the longest time between two > CHECK_FOR_INTERRUPTS() calls is (with some extra logic for client > IO). > > Obviously the regression tests don't tend to hit the worst cases of > CFR() less code, but even if they did, we currently wouldn't know from > running the regression tests. We can probably determine that just as well with a perf or systemtap run on an --enable-dtrace build. Just tag CHECK_FOR_INTERRUPTS() with a SDT marker then record the timings. It might be convenient to have it built-in I guess, but if we tag the site and do the timing/tracing externally we don't have to bother about conditional compilation and special builds.
Re: Printing backtrace of postgres processes
Hi, On 2020-11-30 13:35:46 +0800, Craig Ringer wrote: > I find that when I most often want a backtrace of a running, live > backend, it's because the backend is doing something that isn't > passing a CHECK_FOR_INTERRUPTS() so it's not responding to signals. So > it wouldn't help if a backend is waiting on an LWLock, busy in a > blocking call to some loaded library, a blocking syscall, etc. But > there are enough other times I want live backtraces, and I'm not the > only one whose needs matter. Random thought: Wonder if it could be worth adding a conditionally compiled mode where we track what the longest time between two CHECK_FOR_INTERRUPTS() calls is (with some extra logic for client IO). Obviously the regression tests don't tend to hit the worst cases of CFR() less code, but even if they did, we currently wouldn't know from running the regression tests. Greetings, Andres Freund
Re: Printing backtrace of postgres processes
Hi, On 2020-11-22 01:25:08 -0500, Tom Lane wrote: > Surely this is *utterly* unsafe. You can't do that sort of stuff in > a signal handler. That's of course true for the current implementation - but I don't think it's a fundamental constraint. With a bit of care backtrace() and backtrace_symbols() itself can be signal safe: > backtrace() and backtrace_symbols_fd() don't call malloc() explicitly, > but they are part of libgcc, which gets loaded dynamically when first > used. Dynamic loading usually triggers a call to malloc(3). If you need > certain calls to these two functions to not allocate memory (in signal > handlers, for example), you need to make sure libgcc is loaded beforehand. It should be quite doable to emit such backtraces directly to stderr, instead of using appendStringInfoString()/elog(). Or even use a static buffer. It does have quite some appeal to be able to debug production workloads where queries can't be cancelled etc. And knowing that backtraces reliably work in case of SIGQUIT etc is also nice... > I would like to see some discussion of the security implications > of such a feature, as well. ("There aren't any" is the wrong > answer.) +1 Greetings, Andres Freund
Re: Improving spin-lock implementation on ARM.
On Tue, 1 Dec 2020 at 02:31, Alexander Korotkov wrote: > On Mon, Nov 30, 2020 at 7:00 AM Krunal Bauskar > wrote: > > 3. Problem with GCC approach is still a lot of distro don't support gcc > 9.4 as default. > > To use this approach: > > * PGSQL will have to roll out its packages using gcc-9.4+ only so > that they are compatible with all aarch64 machines > > * but this continues to affect all other users who tend to build > pgsql using standard distro based compiler. (unless they upgrade compiler). > > I think if a user, who runs PostgreSQL on a multicore machine with > high-concurrent load, can take care about installing the appropriate > package/using the appropriate compiler (especially if we publish > explicit instructions for that). In the same way such advanced users > tune Linux kernel etc. > > BTW, how do you get that required gcc version is 9.4? I've managed to > use LSE with gcc 9.3. > Did they backported it to 9.3? I am just looking at the gcc guide. https://gcc.gnu.org/gcc-9/changes.html GCC 9.4Target Specific ChangesAArch64 - The option -moutline-atomics has been added to aid deployment of the Large System Extensions (LSE) > > -- > Regards, > Alexander Korotkov > -- Regards, Krunal Bauskar
Re: Huge memory consumption on partitioned table with FKs
On Mon, Nov 30, 2020 at 9:48 PM Tom Lane wrote: > Corey Huinker writes: > > Given that we're already looking at these checks, I was wondering if this > > might be the time to consider implementing these checks by directly > > scanning the constraint index. > > Yeah, maybe. Certainly ri_triggers is putting a huge amount of effort > into working around the SPI/parser/planner layer, to not a lot of gain. > > However, it's not clear to me that that line of thought will work well > for the statement-level-trigger approach. In that case you might be > dealing with enough tuples to make a different plan advisable. > > regards, tom lane > Bypassing SPI would probably mean that we stay with row level triggers, and the cached query plan would go away, perhaps replaced by an already-looked-up-this-tuple hash sorta like what the cached nested loops effort is doing. I've been meaning to give this a try when I got some spare time. This may inspire me to try again.
Re: Printing backtrace of postgres processes
On Tue, 1 Dec 2020 at 07:04, Tom Lane wrote: > I'd feel better about it if the mechanism had you specify exactly > one target process, and were restricted to a superuser requestor. Er, rather. I actually assumed the former was the case already, not having looked closely yet.
Re: BUG #16663: DROP INDEX did not free up disk space: idle connection hold file marked as deleted
On Tue, Dec 1, 2020 at 3:55 PM Michael Paquier wrote: > On Mon, Nov 30, 2020 at 06:59:40PM +1300, Thomas Munro wrote: > > So I fixed that, by adding a return value to do_truncate() and > > checking it. That's the version I plan to commit tomorrow, unless > > there are further comments or objections. I've also attached a > > version suitable for REL_11_STABLE and earlier branches (with a name > > that cfbot should ignore), where things are slightly different. In > > those branches, the register_forget_request() logic is elsewhere. > > Hmm. Sorry for arriving late at the party. But is that really > something suitable for a backpatch? Sure, it is not optimal to not > truncate all the segments when a transaction dropping a relation > commits, but this was not completely broken either. I felt on balance it was a "bug", since it causes operational difficulties for people and was clearly not our intended behaviour, and I announced this intention 6 weeks ago. Of course I'll be happy to revert it from the back-branches if that's the consensus. Any other opinions?
Re: BUG #16663: DROP INDEX did not free up disk space: idle connection hold file marked as deleted
On Mon, Nov 30, 2020 at 06:59:40PM +1300, Thomas Munro wrote: > So I fixed that, by adding a return value to do_truncate() and > checking it. That's the version I plan to commit tomorrow, unless > there are further comments or objections. I've also attached a > version suitable for REL_11_STABLE and earlier branches (with a name > that cfbot should ignore), where things are slightly different. In > those branches, the register_forget_request() logic is elsewhere. Hmm. Sorry for arriving late at the party. But is that really something suitable for a backpatch? Sure, it is not optimal to not truncate all the segments when a transaction dropping a relation commits, but this was not completely broken either. -- Michael signature.asc Description: PGP signature
Re: BUG #16663: DROP INDEX did not free up disk space: idle connection hold file marked as deleted
On Mon, Nov 30, 2020 at 6:59 PM Thomas Munro wrote: > On Wed, Nov 25, 2020 at 8:00 AM Pavel Borisov wrote: > > The new status of this patch is: Ready for Committer > ... That's the version I plan to commit tomorrow, unless > there are further comments or objections. ... Done, and back-patched. I thought a bit more about the fact that we fail to unlink higher-numbered segments in certain error cases, potentially leaving stray files behind. As far as I can see, nothing we do in this code-path is going to be a bullet-proof solution to that problem. One simple idea would be for the checkpointer to refuse to unlink segment 0 (thereby allowing the relfilenode to be recycled) until it has scanned the parent directory for any related files that shouldn't be there. > While looking at trace output, I figured we should just use > truncate(2) on non-Windows, on the master branch only. It's not like > it really makes much difference, but I don't see why we shouldn't > allow ourselves to use ancient standardised Unix syscalls when we can. Also pushed, but only to master.
Re: Huge memory consumption on partitioned table with FKs
Corey Huinker writes: > Given that we're already looking at these checks, I was wondering if this > might be the time to consider implementing these checks by directly > scanning the constraint index. Yeah, maybe. Certainly ri_triggers is putting a huge amount of effort into working around the SPI/parser/planner layer, to not a lot of gain. However, it's not clear to me that that line of thought will work well for the statement-level-trigger approach. In that case you might be dealing with enough tuples to make a different plan advisable. regards, tom lane
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On Mon, Nov 30, 2020 at 05:12:42PM +0300, Alexey Kondratov wrote: > Thanks. I have rebased the remaining patches on top of 873ea9ee to use > 'utility_option_list' instead of 'common_option_list'. Thanks, that helps a lot. I have gone through 0002, and tweaked it as the attached (note that this patch is also interesting for another thing in development: backend-side reindex filtering of collation-sensitive indexes). Does that look right to you? These are mostly matters of consistency with the other commands using DefElem, but I think that it is important to get things right: - Having the list of options in parsenodes.h becomes incorrect, because these get now only used at execution time, like VACUUM. So I have moved that to cluster.h and index.h. - Let's use an enum for REINDEX, like the others. - Having parse_reindex_params() in utility.c is wrong for something aimed at being used only for REINDEX, so I have moved that to indexcmds.c, and renamed the routine to be more consistent with the rest. I think that we could more here by having an ExecReindex() that does all the work based on object types, but I have left that out for now to keep the change minimal. - Switched one of the existing tests to stress CONCURRENTLY within parenthesis. - Indented the whole. A couple of extra things below. * CLUSTER [VERBOSE] [ USING ] + * CLUSTER [VERBOSE] [(options)] [ USING ] This line is wrong, and should be: CLUSTER [ (options) ] [ USING ] +CLUSTER [VERBOSE] [ ( option +CLUSTER [VERBOSE] [ ( option [, ...] ) ] The docs in cluster.sgml are wrong as well, you can have VERBOSE as a single option or as a parenthesized option, but never both at the same time. On the contrary, psql completion got that right. I was first a bit surprised that you would not allow the parenthesized set for the case where a relation is not specified in the command, but I agree that this does not seem worth the extra complexity now as this thread aims at being able to use TABLESPACE which makes little sense database-wide. -VERBOSE +VERBOSE [ boolean ] Forgot about CONCURRENTLY as an option here, as this becomes possible. -- Michael diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h index f4559b09d7..c041628049 100644 --- a/src/include/catalog/index.h +++ b/src/include/catalog/index.h @@ -29,6 +29,15 @@ typedef enum INDEX_DROP_SET_DEAD } IndexStateFlagsAction; +/* options for REINDEX */ +typedef enum ReindexOption +{ + REINDEXOPT_VERBOSE = 1 << 0, /* print progress info */ + REINDEXOPT_REPORT_PROGRESS = 1 << 1, /* report pgstat progress */ + REINDEXOPT_MISSING_OK = 1 << 2, /* skip missing relations */ + REINDEXOPT_CONCURRENTLY = 1 << 3 /* concurrent mode */ +} ReindexOption; + /* state info for validate_index bulkdelete callback */ typedef struct ValidateIndexState { diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h index e05884781b..7cfb37c9b2 100644 --- a/src/include/commands/cluster.h +++ b/src/include/commands/cluster.h @@ -14,11 +14,19 @@ #define CLUSTER_H #include "nodes/parsenodes.h" +#include "parser/parse_node.h" #include "storage/lock.h" #include "utils/relcache.h" -extern void cluster(ClusterStmt *stmt, bool isTopLevel); +/* options for CLUSTER */ +typedef enum ClusterOption +{ + CLUOPT_RECHECK = 1 << 0, /* recheck relation state */ + CLUOPT_VERBOSE = 1 << 1 /* print progress info */ +} ClusterOption; + +extern void cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel); extern void cluster_rel(Oid tableOid, Oid indexOid, int options); extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid, bool recheck, LOCKMODE lockmode); diff --git a/src/include/commands/defrem.h b/src/include/commands/defrem.h index 7a079ef07f..1133ae1143 100644 --- a/src/include/commands/defrem.h +++ b/src/include/commands/defrem.h @@ -34,6 +34,7 @@ extern ObjectAddress DefineIndex(Oid relationId, bool check_not_in_use, bool skip_build, bool quiet); +extern int ReindexParseOptions(ParseState *pstate, ReindexStmt *stmt); extern void ReindexIndex(RangeVar *indexRelation, int options, bool isTopLevel); extern Oid ReindexTable(RangeVar *relation, int options, bool isTopLevel); extern void ReindexMultipleTables(const char *objectName, ReindexObjectType objectKind, diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h index d1f9ef29ca..d6a6969b0d 100644 --- a/src/include/nodes/parsenodes.h +++ b/src/include/nodes/parsenodes.h @@ -3196,18 +3196,12 @@ typedef struct AlterSystemStmt * Cluster Statement (support pbrown's cluster index implementation) * -- */ -typedef enum ClusterOption -{ - CLUOPT_RECHECK = 1 << 0, /* recheck relation state */ - CLUOPT_VERBOSE = 1 << 1 /* print progress info */ -} ClusterOption; - typedef struct ClusterStmt { NodeTag type; RangeVar *relation; /* relation being indexed, or NULL if all */ char *indexname;
RE: [Patch] Optimize dropping of relation buffers using dlist
From: Kyotaro Horiguchi > We are relying on the "fact" that the first lseek() call of a > (startup) process tells the truth. We added an assertion so that we > make sure that the cached value won't be cleared during recovery. A > possible remaining danger would be closing of an smgr object of a live > relation just after a file extension failure. I think we are thinking > that that doesn't happen during recovery. Although it seems to me > true, I'm not confident. > > If that's true, we don't even need to look at the "cached" flag at all > and always be able to rely on the returned value from msgrnblocks() > during recovery. Otherwise, we need to avoid the danger situation. Hmm, I've gotten to think that smgrnblocks() doesn't need the cached parameter, too. DropRel*Buffers() can just check InRecovery. Regarding the only concern about smgrclose() by startup process, I was afraid of the cache invalidation by CacheInvalidateSmgr(), but startup process doesn't receive shared inval messages. So, it doesn't call smgrclose*() due to shared cache invalidation. [InitRecoveryTransactionEnvironment()] /* Initialize shared invalidation management for Startup process, being * Initialize shared invalidation management for Startup process, being * careful to register ourselves as a sendOnly process so we don't need to * read messages, nor will we get signaled when the queue starts filling * up. */ SharedInvalBackendInit(true); Kirk-san, Can you check to see if smgrclose() and its friends are not called during recovery using the regression test? Regards Takayuki Tsunakawa
Re: Huge memory consumption on partitioned table with FKs
> > I think this can be solved easily in the patch, by having > ri_BuildQueryKey() compare the parent's fk_attnums to the parent; if > they are equal then use the parent's constaint_id, otherwise use the > child constraint. That way, the cache entry is reused in the common > case where they are identical. > Somewhat of a detour, but in reviewing the patch for Statement-Level RI checks, Andres and I observed that SPI made for a large portion of the RI overhead. Given that we're already looking at these checks, I was wondering if this might be the time to consider implementing these checks by directly scanning the constraint index.
回复: 回复: [PATCH] BUG FIX: Core dump could happen when VACUUM FULL in standalone mode
I think you are right after reading code in compute_parallel_vacuum_workers() :) 发件人: Tom Lane 发送时间: 2020年12月1日 2:54 收件人: Yulin PEI 抄送: Masahiko Sawada ; pgsql-hackers@lists.postgresql.org 主题: Re: 回复: [PATCH] BUG FIX: Core dump could happen when VACUUM FULL in standalone mode Yulin PEI writes: > Yes, I agree because (IsNormalProcessingMode() ) means that current process > is not in bootstrap mode and postmaster process will not build index. > So my new modified patch is attached. This is a good catch, but the proposed fix still seems pretty random and unlike how it's done elsewhere. It seems to me that since index_build() is relying on plan_create_index_workers() to assess parallel safety, that's where to check IsUnderPostmaster. Moreover, the existing code in compute_parallel_vacuum_workers (which gets this right) associates the IsUnderPostmaster check with the initial check on max_parallel_maintenance_workers. So I think that the right fix is to adopt the compute_parallel_vacuum_workers coding in plan_create_index_workers, and thereby create a model for future uses of max_parallel_maintenance_workers to follow. regards, tom lane
Re: [HACKERS] logical decoding of two-phase transactions
On Tue, Dec 1, 2020 at 12:46 AM Amit Kapila wrote: > So what caused it to skip due to start_decoding_at? Because the commit > where the snapshot became consistent is after Prepare. Does it happen > due to the below code in SnapBuildFindSnapshot() where we bump > start_decoding_at. > > { > ... > if (running->oldestRunningXid == running->nextXid) > { > if (builder->start_decoding_at == InvalidXLogRecPtr || > builder->start_decoding_at <= lsn) > /* can decode everything after this */ > builder->start_decoding_at = lsn + 1; I think the reason is that in the function DecodingContextFindStartpoint(), the code loops till it finds the consistent snapshot. Then once consistent snapshot is found, it sets slot->data.confirmed_flush = ctx->reader->EndRecPtr; This will be used as the start_decoding_at when the slot is restarted for decoding. > Sure, but you can see in your example above it got skipped due to > start_decoding_at not due to DecodingContextReady. So, the problem as > mentioned by me previously was how we distinguish those cases because > it can skip due to start_decoding_at during restart as well when we > would have already sent the prepare to the subscriber. The distinguishing factor is that at restart, the Prepare does satisfy DecodingContextReady (because the snapshot is consistent then). In both cases, the prepare is prior to start_decoding_at, but when the prepare is before a consistent point, it does not satisfy DecodingContextReady. Which is why I suggested using the check DecodingContextReady to mark the prepare as 'Not decoded". regards, Ajin Cherian Fujitsu Australia
Re: Fix generate_useful_gather_paths for parallel unsafe pathkeys
On Sun, Nov 29, 2020 at 4:20 PM Tomas Vondra wrote: > > On 11/29/20 3:26 PM, James Coleman wrote: > > On Mon, Nov 23, 2020 at 8:35 AM James Coleman wrote: > >> > >> On Sun, Nov 22, 2020 at 4:59 PM Tomas Vondra > >> wrote: > >>> ... > >>> Thanks for the patches, I'll take a closer look next week. The 0002 > >>> patch is clearly something we need to backpatch, not sure about 0001 as > >>> it essentially enables new behavior in some cases (Sort on unsorted > >>> paths below Gather Merge). > >> > >> Thanks for taking a look. > >> > >> I had the same question re: backporting. On the one hand it will > >> change behavior (this is always a bit of a gray area since in one > >> sense bugs change behavior), but IMO it's not a new feature, because > >> the code clearly intended to have that feature in the first place (it > >> creates gather merges on top of a full sort). So I'd lean towards > >> considering it a bug fix, but I'm also not going to make that a hill > >> to die on. > >> > >> One semi-related question: do you think we should add a comment to > >> prepare_sort_from_pathkeys explaining that modifying it may mean > >> find_em_expr_for_rel needs to be updated also? > > > > Here's an updated patch series. > > > > 0001 and 0002 as before, but with some minor cleanup. > > > > 0001 - seems fine > > 0002 - Should we rename the "parallel" parameter to something more > descriptive, like "require_parallel_safe"? Done. > > 0003 disallows SRFs in these sort expressions (per discussion over at [1]). > > > > OK, but I'd move the define from tlist.c to tlist.h, not optimizer.h. IS_SRF_CALL doesn't really seem to be particularly specific conceptually to target lists (and of course now we're using it in equivclass.c), so I'd tried to put it somewhere more general. Is moving it to tlist.h mostly to minimize changes? Or do you think it fits more naturally with the tlist code (I might be missing something)? > > 0004 removes the search through the target for matching volatile > > expressions (per discussion at [2]). > > > > OK > > > 0005 adds the comment I mentioned above clarifying these two functions > > are linked. > > > > Makes sense. I wonder if we need to be more specific about how > consistent those two places need to be. Do they need to match 1:1, or > how do people in the future decide which restrictions need to be in both > places? At this point I'm not sure I'd have a good way to describe that: one is a clear superset of the other (and we already talk in the comments about the other being a subset), but it's really going to be about "what do we want to allow to be sorted proactively"...hmm, that thought made me take a swing at it; let me know what you think. James From 3cc9d13869e25b546bf113d57e5015b5ab2c2abf Mon Sep 17 00:00:00 2001 From: jcoleman Date: Wed, 25 Nov 2020 15:46:00 -0500 Subject: [PATCH v3 3/5] Disallow SRFs in proactive sort So they can be evaluated at the proper time as determiend by make_sort_input_target. --- src/backend/optimizer/path/equivclass.c| 8 src/backend/optimizer/util/tlist.c | 5 - src/include/optimizer/optimizer.h | 5 + src/test/regress/expected/incremental_sort.out | 12 src/test/regress/sql/incremental_sort.sql | 2 ++ 5 files changed, 27 insertions(+), 5 deletions(-) diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c index 75cb1750a8..8c3ef98d73 100644 --- a/src/backend/optimizer/path/equivclass.c +++ b/src/backend/optimizer/path/equivclass.c @@ -835,6 +835,14 @@ find_em_expr_usable_for_sorting_rel(PlannerInfo *root, EquivalenceClass *ec, Rel if (require_parallel_safe && !is_parallel_safe(root, (Node *) em_expr)) continue; + + /* + * Disallow SRFs so that all of them can be evaluated at the correct + * time as determined by make_sort_input_target. + */ + if (IS_SRF_CALL((Node *) em_expr)) + continue; + /* * As long as the expression isn't volatile then * prepare_sort_from_pathkeys is able to generate a new target entry, diff --git a/src/backend/optimizer/util/tlist.c b/src/backend/optimizer/util/tlist.c index 02a3c6b165..01cea102ea 100644 --- a/src/backend/optimizer/util/tlist.c +++ b/src/backend/optimizer/util/tlist.c @@ -21,11 +21,6 @@ #include "optimizer/tlist.h" -/* Test if an expression node represents a SRF call. Beware multiple eval! */ -#define IS_SRF_CALL(node) \ - ((IsA(node, FuncExpr) && ((FuncExpr *) (node))->funcretset) || \ - (IsA(node, OpExpr) && ((OpExpr *) (node))->opretset)) - /* * Data structures for split_pathtarget_at_srfs(). To preserve the identity * of sortgroupref items even if they are textually equal(), what we track is diff --git a/src/include/optimizer/optimizer.h b/src/include/optimizer/optimizer.h index 3e4171056e..1e135652b9 100644 --- a/src/include/optimizer/optimizer.h +++ b/src/include/optimizer/optimizer.h @@ -24,6 +24,11 @@ #include "nodes/parsenodes.h" +/* Test if an ex
Re: VACUUM (DISABLE_PAGE_SKIPPING on)
On Fri, Nov 20, 2020 at 8:47 PM Simon Riggs wrote: > > On Fri, 20 Nov 2020 at 10:15, Simon Riggs wrote: > > > > On Fri, 20 Nov 2020 at 01:40, Masahiko Sawada wrote: > > > > > > On Thu, Nov 19, 2020 at 8:02 PM Simon Riggs wrote: > > > > > > > > On Wed, 18 Nov 2020 at 17:59, Robert Haas wrote: > > > > > > > > > > On Wed, Nov 18, 2020 at 12:54 PM Simon Riggs > > > > > wrote: > > > > > > Patches attached. > > > > > > 1. vacuum_anti_wraparound.v2.patch > > > > > > 2. vacuumdb_anti_wrap.v1.patch - depends upon (1) > > > > > > > > > > I don't like the use of ANTI_WRAPAROUND as a name for this new option. > > > > > Wouldn't it make more sense to call it AGGRESSIVE? Or maybe something > > > > > else, but I dislike anti-wraparound. > > > > > > > > -1 for using the term AGGRESSIVE, which seems likely to offend people. > > > > I'm sure a more descriptive term exists. > > > > > > Since we use the term aggressive scan in the docs, I personally don't > > > feel unnatural about that. But since this option also disables index > > > cleanup when not enabled explicitly, I’m concerned a bit if user might > > > get confused. I came up with some names like FEEZE_FAST and > > > FREEZE_MINIMAL but I'm not sure these are better. > > > > FREEZE_FAST seems good. > > > > > BTW if this option also disables index cleanup for faster freezing, > > > why don't we disable heap truncation as well? > > > > Good idea > > Patch attached, using the name "FAST_FREEZE" instead. > Thank you for updating the patch. Here are some comments on the patch. - if (params->options & VACOPT_DISABLE_PAGE_SKIPPING) + if (params->options & VACOPT_DISABLE_PAGE_SKIPPING || + params->options & VACOPT_FAST_FREEZE) I think we need to update the following comment that is above this change as well: /* * We request an aggressive scan if the table's frozen Xid is now older * than or equal to the requested Xid full-table scan limit; or if the * table's minimum MultiXactId is older than or equal to the requested * mxid full-table scan limit; or if DISABLE_PAGE_SKIPPING was specified. */ This mentions only DISABLE_PAGE_SKIPPING now. Or the second idea is to set both params.freeze_table_age and params.multixact_freeze_table_age to 0 at ExecVacuum() instead of getting aggressive turned on here. Considering the consistency between FREEZE and FREEZE_FAST, we might want to take the second option. --- + if (fast_freeze && + params.index_cleanup == VACOPT_TERNARY_DEFAULT) + params.index_cleanup = VACOPT_TERNARY_DISABLED; + + if (fast_freeze && + params.truncate == VACOPT_TERNARY_DEFAULT) + params.truncate = VACOPT_TERNARY_DISABLED; + + if (fast_freeze && freeze) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), +errmsg("cannot specify both FREEZE and FAST_FREEZE options on VACUUM"))); + I guess that you disallow enabling both FREEZE and FAST_FREEZE because it's contradictory, which makes sense to me. But it seems to me that enabling FAST_FREEZE, INDEX_CLEANUP, and TRUNCATE is also contradictory because it will no longer be “fast”. The purpose of this option to advance relfrozenxid as fast as possible by disabling index cleanup, heap truncation etc. Is there any use case where a user wants to enable these options (FAST_FREEZE, INDEX_CLEANUP, and TRUNCATE) at the same time? If not, probably it’s better to either disallow it or have FAST_FREEZE overwrites these two settings even if the user specifies them explicitly. Regards, -- Masahiko Sawada EnterpriseDB: https://www.enterprisedb.com/
RE: Disable WAL logging to speed up data loading
From: Kyotaro Horiguchi > We're emitting only redo logs. So I think theoretically we don't need > anything other than the shutdown checkpoint record because we don't > perform recovery and checkpoint record is required at startup. > > RM_XLOG_ID: > XLOG_FPI_FOR_HINT - not needed? > XLOG_FPI - not needed? > > XLOG_CHECKPOINT_SHUTDOWN - must have > > So how about the followings? > XLOG_CHECKPOINT_ONLINE > XLOG_NOOP > XLOG_NEXTOID > XLOG_SWITCH > XLOG_BACKUP_END > XLOG_PARAMETER_CHANGE > XLOG_RESTORE_POINT > XLOG_FPW_CHANGE > XLOG_END_OF_RECOVERY > > > RM_XACT_ID: > XLOG_XACT_COMMIT > XLOG_XACT_PREPARE > XLOG_XACT_ABORT > XLOG_XACT_COMMIT_PREPARED > XLOG_XACT_ABORT_PREPARED > XLOG_XACT_ASSIGNMENT > XLOG_XACT_INVALIDATIONS > > Do we need all of these? What need to be emitted even when wal_level = none are: RM_XLOG_ID: - XLOG_CHECKPOINT_SHUTDOWN - XLOG_PARAMETER_CHANGE RM_XACT_ID: - XLOG_XACT_PREPARE XLOG_PARAMETER_CHANGE is necessary to detect during archive recovery that wal_level was changed from >= replica to none, thus failing the archive recovery. This is for the fix discussed in this thread to change the error level from WARNING to FATAL. > Yeah, although it's enough only to restrict non-harmful records > practically, if we find that only a few kinds of records are needed, > maybe it's cleaner to allow only required record type(s). > > Maybe it's right that if we can filter-out records looking only rmid, > since the xlog facility doesn't need to know about record types of a > resource manager. But if we need to finer-grained control on the > record types, I'm afraid that that's wrong. However, if we need only > the XLOG_CHECKPOINT_SHUTDOWN record, it might be better to let > XLogInsert filter records rather than inserting that filtering code to > all the caller sites. Agreed. As the kind of WAL records to be emitted is very limited, I think XLogInsert() can filter them where the current patch does. (OTOH, the boot strap mode filters WAL coarsely as follows. I thought we may follow this coarse RMID-based filtering as a pessimistic safeguard against new kind of WAL records that are necessary even in wal_level = none.) /* * In bootstrap mode, we don't actually log anything but XLOG resources; * return a phony record pointer. */ if (IsBootstrapProcessingMode() && rmid != RM_XLOG_ID) { XLogResetInsertion(); EndPos = SizeOfXLogLongPHD; /* start of 1st chkpt record */ return EndPos; } > I don't dislike "none" since it seems to me practically "none". It > seems rather correct if we actually need only the shutdown checkpoint > record. > > "unrecoverable" is apparently misleading. "crash_unsafe" is precise > but seems somewhat alien being among "logical", "replica" and > "minimal". OK, I'm happy with "none" too. We can describe in the manual that some limited amount of WAL is emitted. Regards Takayuki Tsunakawa
Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY
On Tue, Nov 03, 2020 at 08:56:06PM -0300, Alvaro Herrera wrote: > Here's an updated version of this patch. > > Apart from rebasing to current master, I made the following changes: > > * On the first transaction (the one that marks the partition as > detached), the partition is locked with ShareLock rather than > ShareUpdateExclusiveLock. This means that DML is not allowed anymore. > This seems a reasonable restriction to me; surely by the time you're > detaching the partition you're not inserting data into it anymore. I don't think it's an issue with your patch, but FYI that sounds like something I had to do recently: detach *all* partitions of various tabls to promote their partition key column from timestamp to timestamptz. And we insert directly into child tables, not routed via parent. I don't your patch is still useful, but not to us. So the documentation should be clear about that. > * ALTER TABLE .. DETACH PARTITION FINALIZE now waits for concurrent old > snapshots, as previously discussed. This should ensure that the user > doesn't just cancel the wait during the second transaction by Ctrl-C and > run FINALIZE immediately afterwards, which I claimed would bring > inconsistency. > > * Avoid creating the CHECK constraint if an identical one already > exists. > > (Note I do not remove the constraint on ATTACH. That seems pointless.) > > Still to do: test this using the new hook added by 6f0b632f083b. tab complete? > +* Ensure that foreign keys still hold after this detach. This keeps > lock > +* on the referencing tables, which prevent concurrent transactions > from keeps locks or which prevents > +++ b/doc/src/sgml/ref/alter_table.sgml > @@ -947,6 +950,24 @@ WITH ( MODULUS class="parameter">numeric_literal, REM >attached to the target table's indexes are detached. Any triggers that >were created as clones of those in the target table are removed. > > + > + If CONCURRENTLY is specified, this process runs in > two > + transactions in order to avoid blocking other sessions that might be > accessing > + the partitioned table. During the first transaction, > + SHARE UPDATE EXCLUSIVE is taken in both parent > table and missing "lock" taken *on* ? > + partition, and the partition is marked detached; at that point, the > transaction probably "its partition," > + If FINALIZE is specified, complete actions of a > + previous DETACH CONCURRENTLY invocation that > + was cancelled or crashed. say "actions are completed" or: If FINALIZE is specified, a previous DETACH that was cancelled or interrupted is completed. > + if (!inhdetached && detached) > + ereport(ERROR, > + > (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("cannot complete > detaching partition \"%s\"", > + childname), > + errdetail("There's no partial > concurrent detach in progress."))); maybe say "partially-complete" or remove "partial" > + * the partition being detached? Putting them on the partition > being > + * detached would be wrong, since they'd become "lost" after > the but after *that* ? > + * Concurrent mode has to work harder; first we add a new constraint to > the > + * partition that matches the partition constraint, if there isn't a > matching > + * one already. The reason for this is that the planner may have made > + * optimizations that depend on the constraint. XXX Isn't it > sufficient to > + * invalidate the partition's relcache entry? Ha. I suggested this years ago. https://www.postgresql.org/message-id/20180601221428.gu5...@telsasoft.com |. The docs say: if detaching/re-attach a partition, should first ADD CHECK to | avoid a slow ATTACH operation. Perhaps DETACHing a partition could | implicitly CREATE a constraint which is usable when reATTACHing? > + * Then we close our existing transaction, and in a new one wait for > + * all process to catch up on the catalog updates we've done so far; at processes > + * We don't need to concern ourselves with waiting for a lock > the > + * partition itself, since we will acquire AccessExclusiveLock > below. lock *on* ? > + * If asked to, wait until existing snapshots are gone. This is > important > + * in the second transaction of DETACH PARTITION CONCURRENTLY is > canceled: s/in/if/ > +++ b/src/bin/psql/describe.c > - printfPQExpBuffer(&tmpbuf, _("Partition of: %s %s"), > parent_name, > - partdef); > + printfPQExpBuffer(&tmpbuf, _("Partition of: %s %s%s"), > parent_name, > +
Re: Why does create_gather_merge_plan need make_sort?
On Sun, Nov 29, 2020 at 4:10 PM Tomas Vondra wrote: > > > > On 11/29/20 3:44 PM, James Coleman wrote: > > On Mon, Nov 23, 2020 at 8:19 AM James Coleman wrote: > >> > >> .. > >> > >> So I'm +1 on turning it into an ERROR log instead, since it seems to > >> me that encountering this case would almost certainly represent a bug > >> in path generation. > > > > Here's a patch to do that. > > > > Thanks. Isn't the comment incomplete? Should it mention just parallel > safety or also volatility? Volatility makes it parallel unsafe, and I'm not sure I want to get into listing every possible issue here, but in the interest of making it more obviously representative of the kinds of issues we might encounter, I've tweaked it to mention volatility also, as well as refer to the issues as "examples" of such concerns. James From b4b08b70d8961ea29587412e9a2ef4dd39111ff0 Mon Sep 17 00:00:00 2001 From: jcoleman Date: Sun, 29 Nov 2020 09:38:59 -0500 Subject: [PATCH v2] Error if gather merge paths aren't sufficiently sorted --- src/backend/optimizer/plan/createplan.c | 14 -- 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c index 40abe6f9f6..5ecf9f4065 100644 --- a/src/backend/optimizer/plan/createplan.c +++ b/src/backend/optimizer/plan/createplan.c @@ -1793,13 +1793,15 @@ create_gather_merge_plan(PlannerInfo *root, GatherMergePath *best_path) &gm_plan->nullsFirst); - /* Now, insert a Sort node if subplan isn't sufficiently ordered */ + /* + * All gather merge paths should have already guaranteed the necessary sort + * order either by adding an explicit sort node or by using presorted input. + * We can't simply add a sort here on additional pathkeys, because we can't + * guarantee the sort would be safe. For example, expressions may be + * volatile or otherwise parallel unsafe. + */ if (!pathkeys_contained_in(pathkeys, best_path->subpath->pathkeys)) - subplan = (Plan *) make_sort(subplan, gm_plan->numCols, - gm_plan->sortColIdx, - gm_plan->sortOperators, - gm_plan->collations, - gm_plan->nullsFirst); + elog(ERROR, "gather merge input not sufficiently sorted"); /* Now insert the subplan under GatherMerge. */ gm_plan->plan.lefttree = subplan; -- 2.17.1
Re: remove spurious CREATE INDEX CONCURRENTLY wait
On Mon, Nov 30, 2020 at 04:15:27PM -0300, Alvaro Herrera wrote: > In the interest of showing progress, I'm going to mark this CF item as > committed, and I'll submit the remaining pieces in a new thread. Thanks for splitting! -- Michael signature.asc Description: PGP signature
Re: Huge memory consumption on partitioned table with FKs
At Mon, 30 Nov 2020 21:03:45 -0300, Alvaro Herrera wrote in > On 2020-Nov-26, Kyotaro Horiguchi wrote: > > > This shares RI_ConstraintInfo cache by constraints that shares the > > same parent constraints. But you forgot that the cache contains some > > members that can differ among partitions. > > > > Consider the case of attaching a partition that have experienced a > > column deletion. > > I think this can be solved easily in the patch, by having > ri_BuildQueryKey() compare the parent's fk_attnums to the parent; if > they are equal then use the parent's constaint_id, otherwise use the > child constraint. That way, the cache entry is reused in the common > case where they are identical. *I* think it's the direction. After an off-list discussion, we confirmed that even in that case the patch works as is because fk_attnum (or contuple.conkey) always stores key attnums compatible to the topmost parent when conparent has a valid value (assuming the current usage of fk_attnum), but I still feel uneasy to rely on that unclear behavior. > I would embed all this knowledge in ri_BuildQueryKey though, without > adding the new function ri_GetParentConstOid. I don't think that > function meaningful abstraction value, and instead it would make what I > suggest more difficult. It seems to me reasonable. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: vac_update_datfrozenxid will raise "wrong tuple length" if pg_database tuple contains toast attribute.
On Sun, Nov 29, 2020 at 5:27 PM Michael Paquier wrote: > > One thing that strikes me as unwise is that we could run into similar > > problems with vac_update_relstats() in the future, and there have been > > recent talks about having more toast tables like for pg_class. That > > is not worth caring about on stable branches because it is not an > > active problem there, but we could do something better on HEAD. > > For now, I have added just a comment at the top of > heap_inplace_update() to warn callers. > I am thinking if there is some way to assert this aspect, but seems no way. So, yes, having at least a comment is good for now. Junfeng and Ashwin also mentioned to me off-list that their patch used > a second copy for performance reasons, but I don't see why this could > become an issue as we update each pg_database row only once a job is > done. So I'd like to apply something like the attached on HEAD, > comments are welcome. Yes the attached patch looks good to me for PostgreSQL. Thanks Michael. (In Greenplum, due to per table dispatch to segments, during database wide vacuum this function gets called per table instead of only at the end, hence we were trying to be conservative.)
Re: [DOC] Document concurrent index builds waiting on each other
On Mon, Nov 30, 2020 at 4:53 PM Alvaro Herrera wrote: > > On 2020-Sep-30, Michael Paquier wrote: > > > + > > + CREATE INDEX (including the > > CONCURRENTLY > > + option) commands are included when VACUUM calculates > > what > > + dead tuples are safe to remove even on tables other than the one being > > indexed. > > + > > FWIW, this is true as well for REINDEX CONCURRENTLY because both use > > the same code paths for index builds and validation, with basically > > the same waiting phases. But is CREATE INDEX the correct place for > > that? Wouldn't it be better to tell about such things on the VACUUM > > doc? > > Yeah, I think it might be more sensible to document this in > maintenance.sgml, as part of the paragraph that discusses removing > tuples "to save space". But making it inline with the rest of the flow, > it seems to distract from higher-level considerations, so I suggest to > make it a footnote instead. I have mixed feelings about wholesale moving it; users aren't likely to read the vacuum doc when considering how running CIC might impact their system, though I do understand why it otherwise fits there. Even if the primary details are in the vacuum, I tend to think a reference note (or link to the vacuum docs) in the create index docs would be useful. The principle here is that 1.) vacuum is automatic/part of the background of the system, not just something people trigger manually, and 2.) we ought to document things where the user action triggering the behavior is documented. > I'm not sure on the wording to use; what about this? The wording seems fine to me. This is a replacement for what was 0002 earlier? And 0001 from earlier still seems to be a useful standalone patch? James
Re: error_severity of brin work item
On Mon, Nov 30, 2020 at 08:47:32PM -0300, Alvaro Herrera wrote: > The more I look at this, the less I like it. This would set a precedent > that any action that can be initiated from an autovac work-item has a > requirement of silently being discarded when it referenced a > non-existant relation. My original request was to change to INFO, which is what vacuum proper does at vacuum_open_relation(). I realize that still means that new work item handlers would have to do the LockOid, try_open dance - maybe it could be factored out. I noticed this on multiple servers immediate after changing a nagios script to look for all ERRORs (rather than a specific list of log_messages) and for a longer time period. > I'd rather have the code that drops the index go through the list of > work-items and delete those that reference that relation. > > I'm not sure if this is something that ought to be done in index_drop(); > One objection might be that if the drop is rolled back, the work-items > are lost. Should it be done in an AtCommit hook or something like that ? -- Justin
Re: enable_incremental_sort changes query behavior
On Tue, Nov 3, 2020 at 4:39 PM Tomas Vondra wrote: > I've pushed the 0001 part, i.e. the main fix. Not sure about the other > parts (comments and moving the code back to postgres_fdw) yet. I noticed the CF entry [1] got moved to the next CF; I'm thinking this entry should be marked as committed since the fix for the initial bug reported on this thread has been pushed. We have the parallel safety issue outstanding, but there's a separate thread and patch for that, so I'll make a new CF entry for that. I can mark it as committed, but I'm not sure how to "undo" (or if that's desirable) the move to the next CF. Thoughts? James 1: https://commitfest.postgresql.org/30/2754/
Re: Huge memory consumption on partitioned table with FKs
On 2020-Nov-26, Kyotaro Horiguchi wrote: > This shares RI_ConstraintInfo cache by constraints that shares the > same parent constraints. But you forgot that the cache contains some > members that can differ among partitions. > > Consider the case of attaching a partition that have experienced a > column deletion. I think this can be solved easily in the patch, by having ri_BuildQueryKey() compare the parent's fk_attnums to the parent; if they are equal then use the parent's constaint_id, otherwise use the child constraint. That way, the cache entry is reused in the common case where they are identical. I would embed all this knowledge in ri_BuildQueryKey though, without adding the new function ri_GetParentConstOid. I don't think that function meaningful abstraction value, and instead it would make what I suggest more difficult.
Consider parallel for lateral subqueries with limit
I've been investigating parallelizing certain correlated subqueries, and during that work stumbled across the fact that set_rel_consider_parallel disallows parallel query on what seems like a fairly simple case. Consider this query: select t.unique1 from tenk1 t join lateral (select t.unique1 from tenk1 offset 0) l on true; Current set_rel_consider_parallel sets consider_parallel=false on the subquery rel because it has a limit/offset. That restriction makes a lot of sense when we have a subquery whose results conceptually need to be "shared" (or at least be the same) across multiple workers (indeed the relevant comment in that function notes that cases where we could prove a unique ordering would also qualify, but punts on implementing that due to complexity). But if the subquery is LATERAL, then no such conceptual restriction. If we change the code slightly to allow considering parallel query even in the face of LIMIT/OFFSET for LATERAL subqueries, then our query above changes from the following plan: Nested Loop Output: t.unique1 -> Gather Output: t.unique1 Workers Planned: 2 -> Parallel Index Only Scan using tenk1_unique1 on public.tenk1 t Output: t.unique1 -> Gather Output: NULL::integer Workers Planned: 2 -> Parallel Index Only Scan using tenk1_hundred on public.tenk1 Output: NULL::integer to this plan: Gather Output: t.unique1 Workers Planned: 2 -> Nested Loop Output: t.unique1 -> Parallel Index Only Scan using tenk1_unique1 on public.tenk1 t Output: t.unique1 -> Index Only Scan using tenk1_hundred on public.tenk1 Output: NULL::integer The code change itself is quite simple (1 line). As far as I can tell we don't need to expressly check parallel safety of the limit/offset expressions; that appears to happen elsewhere (and that makes sense since the RTE_RELATION case doesn't check those clauses either). If I'm missing something about the safety of this (or any other issue), I'd appreciate the feedback. James From 0aff5f1b5e35e37a311c01e9f53caf6e088e8d43 Mon Sep 17 00:00:00 2001 From: jcoleman Date: Mon, 30 Nov 2020 11:36:35 -0500 Subject: [PATCH v1] Allow parallel LATERAL subqueries with LIMIT/OFFSET The code that determined whether or not a rel should be considered for parallel query excluded subqueries with LIMIT/OFFSET. That's correct in the general case: as the comment notes that'd mean we have to guarantee ordering (and claims it's not worth checking that) for results to be consistent across workers. However there's a simpler case that hasn't been considered: LATERAL subqueries with LIMIT/OFFSET don't fall under the same reasoning since they're executed (when not converted to a JOIN) per tuple anyway, so consistency of results across workers isn't a factor. --- src/backend/optimizer/path/allpaths.c | 4 +++- src/test/regress/expected/select_parallel.out | 15 +++ src/test/regress/sql/select_parallel.sql | 6 ++ 3 files changed, 24 insertions(+), 1 deletion(-) diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c index 84a69b064a..3c9313b5a9 100644 --- a/src/backend/optimizer/path/allpaths.c +++ b/src/backend/optimizer/path/allpaths.c @@ -686,11 +686,13 @@ set_rel_consider_parallel(PlannerInfo *root, RelOptInfo *rel, * inconsistent results at the top-level. (In some cases, where * the result is ordered, we could relax this restriction. But it * doesn't currently seem worth expending extra effort to do so.) + * LATERAL is an exception: LIMIT/OFFSET is safe to execute within + * workers since the sub-select is executed per tuple */ { Query *subquery = castNode(Query, rte->subquery); -if (limit_needed(subquery)) +if (!rte->lateral && limit_needed(subquery)) return; } break; diff --git a/src/test/regress/expected/select_parallel.out b/src/test/regress/expected/select_parallel.out index 9b0c418db7..9ba40ca2c5 100644 --- a/src/test/regress/expected/select_parallel.out +++ b/src/test/regress/expected/select_parallel.out @@ -1042,6 +1042,21 @@ explain (costs off) Filter: (stringu1 ~~ '%'::text) (11 rows) +-- ...unless it's LATERAL +savepoint settings; +set parallel_tuple_cost=0; +explain (costs off) select t.unique1 from tenk1 t +join lateral (select t.unique1 from tenk1 offset 0) l on true; + QUERY PLAN +- + Gather + Workers Planned: 4 + -> Nested Loop + -> Parallel Index Only Scan using tenk1_unique1 on tenk1 t + -> Index Only Scan using tenk1_hundred on tenk1 +(5 rows) + +rollback to savepoint settings; -- to increase the parallel query test coverage SAVEPOINT settings; SET LOCAL force_parallel_mode = 1; diff
Re: error_severity of brin work item
The more I look at this, the less I like it. This would set a precedent that any action that can be initiated from an autovac work-item has a requirement of silently being discarded when it referenced a non-existant relation. I'd rather have the code that drops the index go through the list of work-items and delete those that reference that relation. I'm not sure if this is something that ought to be done in index_drop(); One objection might be that if the drop is rolled back, the work-items are lost. It's the easiest, though; and work-items are supposed to be lossy anyway, and vacuum would fix the lack of summarization eventually. So, not pretty, but not all that bad. (Hopefully rolled-back drops are not all that common.)
Re: Online verification of checksums
On 11/30/20 9:27 AM, Stephen Frost wrote: Greetings, * Michael Paquier (mich...@paquier.xyz) wrote: On Fri, Nov 27, 2020 at 11:15:27AM -0500, Stephen Frost wrote: * Magnus Hagander (mag...@hagander.net) wrote: On Thu, Nov 26, 2020 at 8:42 AM Michael Paquier wrote: But here the checksum is broken, so while the offset is something we can rely on how do you make sure that the LSN is fine? A broken checksum could perfectly mean that the LSN is actually *not* fine if the page header got corrupted. Of course that could be the case, but it gets to be a smaller and smaller chance by checking that the LSN read falls within reasonable bounds. FWIW, I find that scary. There's ultimately different levels of 'scary' and the risk here that something is actually wrong following these checks strikes me as being on the same order as random bits being flipped in the page and still getting a valid checksum (which is entirely possible with our current checksum...), or maybe even less. I would say a lot less. First you'd need to corrupt one of the eight bytes that make up the LSN (pretty likely since corruption will probably affect the entire block) and then it would need to be updated to a value that falls within the current backup range, a 1 in 16 million chance if a terabyte of WAL is generated during the backup. Plus, the corruption needs to happen during the backup since we are going to check for that, and the corrupted LSN needs to be ascending, and the LSN originally read needs to be within the backup range (another 1 in 16 million chance) since pages written before the start backup checkpoint should not be torn. So as far as I can see there are more likely to be false negatives from the checksum itself. It would also be easy to add a few rounds of checks, i.e. test if the LSN ascends but stays in the backup LSN range N times. Honestly, I'm much more worried about corruption zeroing the entire page. I don't know how likely that is, but I know none of our proposed solutions would catch it. Andres, since you brought this issue up originally perhaps you'd like to weigh in? Regards, -- -David da...@pgmasters.net
Re: Add Information during standby recovery conflicts
On Tue, Dec 1, 2020 at 3:25 AM Alvaro Herrera wrote: > > On 2020-Dec-01, Fujii Masao wrote: > > > + if (proc) > > + { > > + if (nprocs == 0) > > + appendStringInfo(&buf, "%d", > > proc->pid); > > + else > > + appendStringInfo(&buf, ", %d", > > proc->pid); > > + > > + nprocs++; > > > > What happens if all the backends in wait_list have gone? In other words, > > how should we handle the case where nprocs == 0 (i.e., nprocs has not been > > incrmented at all)? This would very rarely happen, but can happen. > > In this case, since buf.data is empty, at least there seems no need to log > > the list of conflicting processes in detail message. > > Yes, I noticed this too; this can be simplified by changing the > condition in the ereport() call to be "nprocs > 0" (rather than > wait_list being null), otherwise not print the errdetail. (You could > test buf.data or buf.len instead, but that seems uglier to me.) +1 Maybe we can also improve the comment of this function from: + * This function also reports the details about the conflicting + * process ids if *wait_list is not NULL. to " This function also reports the details about the conflicting process ids if exist" or something. Regards, -- Masahiko Sawada EnterpriseDB: https://www.enterprisedb.com/
Re: runtime error copying oids field
Hi, See attached patch which is along the line Alvaro outlined. Cheers On Mon, Nov 30, 2020 at 3:01 PM Alvaro Herrera wrote: > On 2020-Nov-30, Zhihong Yu wrote: > > > This was the line runtime error was raised: > > > > memcpy(part_oids, partdesc->oids, sizeof(Oid) * nparts); > > > > From RelationBuildPartitionDesc we can see that: > > > > if (nparts > 0) > > { > > PartitionBoundInfo boundinfo; > > int*mapping; > > int next_index = 0; > > > > result->oids = (Oid *) palloc0(nparts * sizeof(Oid)); > > > > The cause was oids field was not assigned due to nparts being 0. > > This is verified by additional logging added just prior to the memcpy > call. > > > > I want to get the community's opinion on whether a null check should be > > added prior to the memcpy() call. > > As far as I understand, we do want to avoid memcpy's of null pointers; > see [1]. > > In this case I think it'd be sane to skip the complete block, not just > the memcpy, something like > > diff --git a/src/backend/commands/indexcmds.c > b/src/backend/commands/indexcmds.c > index ca24620fd0..d35deb433a 100644 > --- a/src/backend/commands/indexcmds.c > +++ b/src/backend/commands/indexcmds.c > @@ -1163,15 +1163,17 @@ DefineIndex(Oid relationId, > > if (partitioned) > { > + PartitionDesc partdesc; > + > /* > * Unless caller specified to skip this step (via ONLY), > process each > * partition to make sure they all contain a corresponding > index. > * > * If we're called internally (no stmt->relation), recurse > always. > */ > - if (!stmt->relation || stmt->relation->inh) > + partdesc = RelationGetPartitionDesc(rel); > + if ((!stmt->relation || stmt->relation->inh) && > partdesc->nparts > 0) > { > - PartitionDesc partdesc = > RelationGetPartitionDesc(rel); > int nparts = partdesc->nparts; > Oid*part_oids = palloc(sizeof(Oid) > * nparts); > boolinvalidate_parent = false; > > [1] > https://www.postgresql.org/message-id/flat/20200904023648.GB3426768%40rfd.leadboat.com > v01-check-nparts-for-defining-idx.patch Description: Binary data
Re: 回复: [PATCH] BUG FIX: Core dump could happen when VACUUM FULL in standalone mode
On Tue, Dec 1, 2020 at 3:54 AM Tom Lane wrote: > > Yulin PEI writes: > > Yes, I agree because (IsNormalProcessingMode() ) means that current process > > is not in bootstrap mode and postmaster process will not build index. > > So my new modified patch is attached. > > This is a good catch, but the proposed fix still seems pretty random > and unlike how it's done elsewhere. It seems to me that since > index_build() is relying on plan_create_index_workers() to assess > parallel safety, that's where to check IsUnderPostmaster. Moreover, > the existing code in compute_parallel_vacuum_workers (which gets > this right) associates the IsUnderPostmaster check with the initial > check on max_parallel_maintenance_workers. So I think that the > right fix is to adopt the compute_parallel_vacuum_workers coding > in plan_create_index_workers, and thereby create a model for future > uses of max_parallel_maintenance_workers to follow. +1 Regards, -- Masahiko Sawada EnterpriseDB: https://www.enterprisedb.com/
Re: Printing backtrace of postgres processes
Craig Ringer writes: >> I would like to see some discussion of the security implications >> of such a feature, as well. ("There aren't any" is the wrong >> answer.) > If the stack only goes to the log, I actually don't think there are > significant security implications beyond those we already have with > our existing backtrace printing features. We already trust anyone who > can read the log almost completely, and we can already emit stacks to > the log. But I'd still want it to be gated superuser-only, or a role > that's GRANTable by superuser only by default, since it exposes > arbitrary internals of the server. The concerns that I had were that the patch as submitted provides a mechanism that causes ALL processes in the system to dump backtraces, not a targeted request; and that it allows any user to issue such requests at an unbounded rate. That seems like a really easy route to denial of service. There's also a question of whether you'd even get intelligible results from dozens of processes simultaneously dumping many-line messages to the same place. (This might work out all right if you're using our syslogger, but it probably would not with any other logging technology.) I'd feel better about it if the mechanism had you specify exactly one target process, and were restricted to a superuser requestor. I'm not excited about adding on frammishes like letting one process extract another's stack trace. I think that just adds more points of failure, which is a bad thing in a feature that you're only going to care about when things are a bit pear-shaped already. regards, tom lane
Re: runtime error copying oids field
On 2020-Nov-30, Zhihong Yu wrote: > This was the line runtime error was raised: > > memcpy(part_oids, partdesc->oids, sizeof(Oid) * nparts); > > From RelationBuildPartitionDesc we can see that: > > if (nparts > 0) > { > PartitionBoundInfo boundinfo; > int*mapping; > int next_index = 0; > > result->oids = (Oid *) palloc0(nparts * sizeof(Oid)); > > The cause was oids field was not assigned due to nparts being 0. > This is verified by additional logging added just prior to the memcpy call. > > I want to get the community's opinion on whether a null check should be > added prior to the memcpy() call. As far as I understand, we do want to avoid memcpy's of null pointers; see [1]. In this case I think it'd be sane to skip the complete block, not just the memcpy, something like diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c index ca24620fd0..d35deb433a 100644 --- a/src/backend/commands/indexcmds.c +++ b/src/backend/commands/indexcmds.c @@ -1163,15 +1163,17 @@ DefineIndex(Oid relationId, if (partitioned) { + PartitionDesc partdesc; + /* * Unless caller specified to skip this step (via ONLY), process each * partition to make sure they all contain a corresponding index. * * If we're called internally (no stmt->relation), recurse always. */ - if (!stmt->relation || stmt->relation->inh) + partdesc = RelationGetPartitionDesc(rel); + if ((!stmt->relation || stmt->relation->inh) && partdesc->nparts > 0) { - PartitionDesc partdesc = RelationGetPartitionDesc(rel); int nparts = partdesc->nparts; Oid*part_oids = palloc(sizeof(Oid) * nparts); boolinvalidate_parent = false; [1] https://www.postgresql.org/message-id/flat/20200904023648.GB3426768%40rfd.leadboat.com
Re: [BUG] orphaned function
"Drouvot, Bertrand" writes: > here is a scenario that produces an orphaned function (means it does not > belong to any namespace): > [ drop schema before committing function creation ] Hm. Historically we've not been too fussed about preventing such race conditions, and I wonder just how far is sane to take it. Should we acquire lock on the function's argument/result data types? Its language? Given the precedent of RangeVarGetAndCheckCreationNamespace, I'm willing to accept this patch's goals as stated. But it feels like we need some clearer shared understanding of which things we are willing to bother with preventing races for, and which we aren't. > Please find attached a patch that adds a LockDatabaseObject() call in > QualifiedNameGetCreationNamespace() to prevent such orphaned situations. I don't think that actually succeeds in preventing the race, it'll just delay your process till the deletion is committed. But you already have the namespaceId. Note the complex retry logic in RangeVarGetAndCheckCreationNamespace; without something on the same order, you're not closing the hole in any meaningful way. Likely what this patch should do is refactor that function so that its guts can be used for other object-creation scenarios. (The fact that this is so far from trivial is one reason I'm not in a hurry to extend this sort of protection to other dependencies.) regards, tom lane
runtime error copying oids field
Hi, In our testPgRegressTrigger test log, I saw the following (this was for a relatively old version of PG): 197859 [ts-1] ../../../../../../src/postgres/src/backend/commands/indexcmds.c:1062:22: runtime error: null pointer passed as argument 2, which is declared to never be null 197860 [ts-1] /opt/yb-build/brew/linuxbrew-20181203T161736v9/include/string.h:43:28: note: nonnull attribute specified here 197861 [ts-1] #0 0xacbd0f in DefineIndex $YB_SRC_ROOT/src/postgres/src/backend/commands/../../../../../../src/postgres/src/backend/commands/indexcmds.c:1062:4 197862 [ts-1] #1 0x11441e0 in ProcessUtilitySlow $YB_SRC_ROOT/src/postgres/src/backend/tcop/../../../../../../src/postgres/src/backend/tcop/utility.c:1436:7 197863 [ts-1] #2 0x114141f in standard_ProcessUtility $YB_SRC_ROOT/src/postgres/src/backend/tcop/../../../../../../src/postgres/src/backend/tcop/utility.c:962:4 197864 [ts-1] #3 0x1140b65 in YBProcessUtilityDefaultHook $YB_SRC_ROOT/src/postgres/src/backend/tcop/../../../../../../src/postgres/src/backend/tcop/utility.c:3574:3 197865 [ts-1] #4 0x7f47d4950eac in pgss_ProcessUtility $YB_SRC_ROOT/src/postgres/contrib/pg_stat_statements/../../../../../src/postgres/contrib/pg_stat_statements/pg_stat_statements.c:1120:4 This was the line runtime error was raised: memcpy(part_oids, partdesc->oids, sizeof(Oid) * nparts); >From RelationBuildPartitionDesc we can see that: if (nparts > 0) { PartitionBoundInfo boundinfo; int*mapping; int next_index = 0; result->oids = (Oid *) palloc0(nparts * sizeof(Oid)); The cause was oids field was not assigned due to nparts being 0. This is verified by additional logging added just prior to the memcpy call. I want to get the community's opinion on whether a null check should be added prior to the memcpy() call. Cheers
Re: proposal: unescape_text function
po 30. 11. 2020 v 22:15 odesílatel Pavel Stehule napsal: > > > po 30. 11. 2020 v 14:14 odesílatel Peter Eisentraut < > peter.eisentr...@enterprisedb.com> napsal: > >> On 2020-11-29 18:36, Pavel Stehule wrote: >> > >> > I don't really get the point of this function. There is AFAICT no >> > function to produce this escaped format, and it's not a recognized >> > interchange format. So under what circumstances would one need to >> > use this? >> > >> > >> > Some corporate data can be in CSV format with escaped unicode >> > characters. Without this function it is not possible to decode these >> > files without external application. >> >> I would like some supporting documentation on this. So far we only have >> one stackoverflow question, and then this implementation, and they are >> not even the same format. My worry is that if there is not precise >> specification, then people are going to want to add things in the >> future, and there will be no way to analyze such requests in a >> principled way. >> >> > I checked this and it is "prefix backslash-u hex" used by Java, > JavaScript or RTF - > https://billposer.org/Software/ListOfRepresentations.html > > In some languages (Python), there is decoder "unicode-escape". Java has a > method escapeJava, for conversion from unicode to ascii. I can imagine so > these data are from Java systems exported to 8bit strings - so this > implementation can be accepted as referential. This format is used by > https://docs.oracle.com/javase/8/docs/technotes/tools/unix/native2ascii.html > tool too. > > Postgres can decode this format too, and the patch is based on Postgres > implementation. I just implemented a different interface. > > Currently decode function does only text->bytea transformation. Maybe a > more generic function "decode_text" and "encode_text" for similar cases can > be better (here we need text->text transformation). But it looks like > overengineering now. > > Maybe we introduce new encoding "ascii" and we can implement new > conversions "ascii_to_utf8" and "utf8_to_ascii". It looks like the most > clean solution. What do you think about it? > a better name of new encoding can be "unicode-escape" than "ascii". We use "to_ascii" function for different use case. set client_encoding to unicode-escape; copy tab from xxx; ... but it doesn't help when only a few columns from the table are in unicode-escape format. > Regards > > Pavel > > >
Re: [DOC] Document concurrent index builds waiting on each other
On 2020-Sep-30, Michael Paquier wrote: > + > + CREATE INDEX (including the > CONCURRENTLY > + option) commands are included when VACUUM calculates > what > + dead tuples are safe to remove even on tables other than the one being > indexed. > + > FWIW, this is true as well for REINDEX CONCURRENTLY because both use > the same code paths for index builds and validation, with basically > the same waiting phases. But is CREATE INDEX the correct place for > that? Wouldn't it be better to tell about such things on the VACUUM > doc? Yeah, I think it might be more sensible to document this in maintenance.sgml, as part of the paragraph that discusses removing tuples "to save space". But making it inline with the rest of the flow, it seems to distract from higher-level considerations, so I suggest to make it a footnote instead. I'm not sure on the wording to use; what about this? >From 6c9ad72e4e61dbf05f34146cb67706dd675a38f0 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Mon, 30 Nov 2020 18:50:06 -0300 Subject: [PATCH v5] Note CIC and RC in vacuum's doc Per James Coleman --- doc/src/sgml/maintenance.sgml | 12 +++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml index 4d8ad754f8..d68d7f936e 100644 --- a/doc/src/sgml/maintenance.sgml +++ b/doc/src/sgml/maintenance.sgml @@ -159,7 +159,17 @@ concurrency control (MVCC, see ): the row version must not be deleted while it is still potentially visible to other transactions. But eventually, an outdated or deleted row version is no -longer of interest to any transaction. The space it occupies must then be +longer of interest to any transaction. + + + Note that VACUUM needs to retain tuples that are + nominally visible to transactions running + CREATE INDEX CONCURRENTLY or + REINDEX CONCURRENTLY, even when run on tables + other than the tables being indexed. + + +The space it occupies must then be reclaimed for reuse by new rows, to avoid unbounded growth of disk space requirements. This is done by running VACUUM. -- 2.20.1
Re: support IncrementalSortPath type in outNode()
"Hou, Zhijie" writes: > I found IncrementalSortPath type is not analyzed in outNode. Indeed, that's supposed to be supported. Pushed, thanks for the patch! regards, tom lane
Re: proposal: unescape_text function
po 30. 11. 2020 v 14:14 odesílatel Peter Eisentraut < peter.eisentr...@enterprisedb.com> napsal: > On 2020-11-29 18:36, Pavel Stehule wrote: > > > > I don't really get the point of this function. There is AFAICT no > > function to produce this escaped format, and it's not a recognized > > interchange format. So under what circumstances would one need to > > use this? > > > > > > Some corporate data can be in CSV format with escaped unicode > > characters. Without this function it is not possible to decode these > > files without external application. > > I would like some supporting documentation on this. So far we only have > one stackoverflow question, and then this implementation, and they are > not even the same format. My worry is that if there is not precise > specification, then people are going to want to add things in the > future, and there will be no way to analyze such requests in a > principled way. > > I checked this and it is "prefix backslash-u hex" used by Java, JavaScript or RTF - https://billposer.org/Software/ListOfRepresentations.html In some languages (Python), there is decoder "unicode-escape". Java has a method escapeJava, for conversion from unicode to ascii. I can imagine so these data are from Java systems exported to 8bit strings - so this implementation can be accepted as referential. This format is used by https://docs.oracle.com/javase/8/docs/technotes/tools/unix/native2ascii.html tool too. Postgres can decode this format too, and the patch is based on Postgres implementation. I just implemented a different interface. Currently decode function does only text->bytea transformation. Maybe a more generic function "decode_text" and "encode_text" for similar cases can be better (here we need text->text transformation). But it looks like overengineering now. Maybe we introduce new encoding "ascii" and we can implement new conversions "ascii_to_utf8" and "utf8_to_ascii". It looks like the most clean solution. What do you think about it? Regards Pavel
Re: Printing LSN made easy
Hi, On 2020-11-29 12:10:21 -0500, Tom Lane wrote: > Agreed. snprintf.c is meant to implement a recognized standard > (ok, %m is a GNU extension, but it's still pretty standard). > I'm not on board with putting PG-only extensions in there. I wonder if there's something we could layer on the elog.c level instead. But I also don't like the idea of increasing the use of wrapper functions that need to allocate string buffers than then need to get copied in turn. We could do something like errmsg("plain string arg: %s, conv string arg: %s", somestr, estr_lsn(lsn)) where estr_lsn() wouldn't do any conversion directly, instead setting up a callback in ErrorData that knows how to do the type specific conversion. Then during EVALUATE_MESSAGE() we'd evaluate the message piecemeal and call the output callbacks for each arg, using the StringInfo. There's two main issues with something roughly like this: 1) We'd need to do format string parsing somewhere above snprintf.c, which isn't free. 2) Without relying on C11 / _Generic() some ugly macro hackery would be needed to have a argument-number indexed state. If we did rely on _Generic() we probably could do better, even getting rid of the need for something like estr_lsn(). Greetings, Andres Freund
Re: Improving spin-lock implementation on ARM.
On Mon, Nov 30, 2020 at 7:00 AM Krunal Bauskar wrote: > 3. Problem with GCC approach is still a lot of distro don't support gcc 9.4 > as default. > To use this approach: > * PGSQL will have to roll out its packages using gcc-9.4+ only so that > they are compatible with all aarch64 machines > * but this continues to affect all other users who tend to build pgsql > using standard distro based compiler. (unless they upgrade compiler). I think if a user, who runs PostgreSQL on a multicore machine with high-concurrent load, can take care about installing the appropriate package/using the appropriate compiler (especially if we publish explicit instructions for that). In the same way such advanced users tune Linux kernel etc. BTW, how do you get that required gcc version is 9.4? I've managed to use LSE with gcc 9.3. -- Regards, Alexander Korotkov
Re: Improving spin-lock implementation on ARM.
On Mon, Nov 30, 2020 at 9:21 PM Tom Lane wrote: > Alexander Korotkov writes: > > I tend to think that LSE is enabled by default in Apple's clang based > > on your previous message[1]. In order to dispel the doubts could you > > please provide assembly of SpinLockAcquire for following clang > > options. > > "-O2" > > "-O2 -march=armv8-a+lse" > > "-O2 -march=armv8-a" > > Huh. Those options make exactly zero difference to the code generated > for SpinLockAcquire/SpinLockRelease; it's the same as I showed upthread, > for either the HEAD definition of TAS() or the CAS patch's version. > > So now I'm at a loss as to the reason for the performance difference > I got. -march=armv8-a+lse does make a difference to code generation > someplace, because the overall size of the postgres executable changes > by 16kB or so. One might argue that the performance difference is due > to better code elsewhere than the spinlocks ... but the test I'm running > is basically just > > while (count-- > 0) > { > XLogGetLastRemovedSegno(); > > CHECK_FOR_INTERRUPTS(); > } > > so it's hard to see where a non-spinlock-related code change would come > in. That loop itself definitely generates the same code either way. > > I did find this interesting output from "clang -v": > > -target-cpu vortex -target-feature +v8.3a -target-feature +fp-armv8 > -target-feature +neon -target-feature +crc -target-feature +crypto > -target-feature +fullfp16 -target-feature +ras -target-feature +lse > -target-feature +rdm -target-feature +rcpc -target-feature +zcm > -target-feature +zcz -target-feature +sha2 -target-feature +aes > > whereas adding -march=armv8-a+lse changes that to just > > -target-cpu vortex -target-feature +neon -target-feature +lse -target-feature > +zcm -target-feature +zcz > > On the whole, that would make one think that -march=armv8-a+lse > should produce worse code than the default settings. Great, thanks. So, I think the following hypothesis isn't disproved yet. 1) On ARM with LSE support, PostgreSQL built with LSE is faster than PostgreSQL built without LSE. Even if the latter is patched with anything considered in this thread. 2) None of the patches considered in this thread give a clear advantage for PostgreSQL built with LSE. To further confirm this let's wait for Kunpeng 920 tests by Krunal Bauskar and Amit Khandekar. Also it would be nice if someone will run benchmarks similar to [1] on Apple M1. Links 1. https://www.postgresql.org/message-id/CAPpHfdsGqVd6EJ4mr_RZVE5xSiCNBy4MuSvdTrKmTpM0eyWGpg%40mail.gmail.com -- Regards, Alexander Korotkov
Re: [doc] remove reference to pg_dump pre-8.1 switch behaviour
Michael Paquier writes: > So this comes down to 5 items, as per the attached. Thoughts? These items look fine to me, except this bit seems a bit awkward: + Note that the delayed indexing technique used for GIN + (see for details) makes this advice + less necessary, but for very large updates it may still be best to + drop and recreate the index. Less necessary than what? Maybe instead write When fastupdate is enabled (see ...), the penalty is much less than when it is not. But for very large updates it may still be best to drop and recreate the index. regards, tom lane
Re: range_agg
On Mon, Nov 30, 2020 at 10:35 PM Alexander Korotkov wrote: > On Sun, Nov 29, 2020 at 11:53 PM Paul A Jungwirth > wrote: > > > > On Sun, Nov 29, 2020 at 11:43 AM Alexander Korotkov > > wrote: > > > Thank you. Could you please, update doc/src/sgml/catalogs.sgml, > > > because pg_type and pg_range catalogs are updated. > > > > Attached! :-) > > You're quick, thank you. Please, also take a look at cfbot failure > https://travis-ci.org/github/postgresql-cfbot/postgresql/builds/746623942 > I've tried to reproduce it, but didn't manage yet. Got it. type_sanity test fails on any platform, you just need to repeat "make check" till it fails. The failed query checked consistency of range types, but it didn't take into account ranges of domains and ranges of records, which are exercised by multirangetypes test running in parallel. We could teach this query about such kinds of ranges, but I think that would be overkill, because we're not going to introduce such builtin ranges yet. So, I'm going to just move multirangetypes test into another group of parallel tests. -- Regards, Alexander Korotkov
Re: Change JOIN tutorial to focus more on explicit joins
On Mon, Nov 30, 2020 at 1:15 PM Jürgen Purtz wrote: > On 30.11.20 20:45, Anastasia Lubennikova wrote: > > As far as I see something got committed and now the discussion is stuck > in arguing about parenthesis. > > FWIW, I think it is a matter of personal taste. Maybe we can compromise > on simply leaving this part unchanged. > > With or without parenthesis is a little more than a personal taste, but > it's a very tiny detail. I'm happy with either of the two variants. > > Sorry, I managed to overlook the most recent patch. I admitted my use of parentheses was incorrect and I don't see anyone else defending them. Please remove them. Minor typos: "the database compare" -> needs an "s" (compares) "In this case, the definition how to compare their rows." -> remove, redundant with the first sentence "The results from the older implicit syntax, and the newer explicit JOIN/ON syntax, are identical" -> move the commas around to what is shown here David J.
Re: [patch] [doc] Introduce view updating options more succinctly
The following review has been posted through the commitfest application: make installcheck-world: not tested Implements feature: not tested Spec compliant: not tested Documentation:tested, passed I wonder, why this patch didn't get a review during the CF. This minor improvement looks good to me, so I mark it Ready for Committer. The new status of this patch is: Ready for Committer
Re: Change JOIN tutorial to focus more on explicit joins
On 30.11.20 20:45, Anastasia Lubennikova wrote: As far as I see something got committed and now the discussion is stuck in arguing about parenthesis. FWIW, I think it is a matter of personal taste. Maybe we can compromise on simply leaving this part unchanged. With or without parenthesis is a little more than a personal taste, but it's a very tiny detail. I'm happy with either of the two variants. -- J. Purtz
Re: [DOC] Document concurrent index builds waiting on each other
On 2020-Nov-30, Anastasia Lubennikova wrote: > The commitfest is nearing the end and I wonder what is this discussion > waiting for. > It looks like the proposed patch received its fair share of review, so > I mark it as ReadyForCommitter and lay responsibility for the final > decision on them. I'll get these pushed now, thanks for the reminder.
Re: [DOC] Document concurrent index builds waiting on each other
Status update for a commitfest entry. The commitfest is nearing the end and I wonder what is this discussion waiting for. It looks like the proposed patch received its fair share of review, so I mark it as ReadyForCommitter and lay responsibility for the final decision on them. The new status of this patch is: Ready for Committer
{CREATE INDEX, REINDEX} CONCURRENTLY improvements
Hello, In a previous thread [1], we added smarts so that processes running CREATE INDEX CONCURRENTLY would not wait for each other. One is adding the same to REINDEX CONCURRENTLY. I've attached patch 0002 here which does that. Why 0002, you ask? That's because preparatory patch 0001 simplifies the ReindexRelationConcurrently somewhat by adding a struct to be used of indexes that are going to be processed, instead of just a list of Oids. This is a good change in itself because it let us get rid of duplicative open/close of the index rels in order to obtain some info that's already known at the start. The other thing is that it'd be good if we can make VACUUM also ignore Xmin of processes doing CREATE INDEX CONCURRENTLY and REINDEX CONCURRENTLY, when possible. I have two possible ideas to handle this, about which I'll post later. [1] https://postgr.es/m/20200810233815.GA18970@alvherre.pgsql -- Álvaro Herrera Valdivia, Chile >From 623a460b791dd873ae5daf6a0cd4e8f446a772f8 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Mon, 30 Nov 2020 16:01:13 -0300 Subject: [PATCH 1/2] create ReindexIndexInfo --- src/backend/commands/indexcmds.c | 130 --- src/tools/pgindent/typedefs.list | 1 + 2 files changed, 70 insertions(+), 61 deletions(-) diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c index ca24620fd0..b1ce83e1dd 100644 --- a/src/backend/commands/indexcmds.c +++ b/src/backend/commands/indexcmds.c @@ -114,6 +114,16 @@ typedef struct ReindexErrorInfo char relkind; } ReindexErrorInfo; +/* + * Index to process in ReindexRelationConcurrently + */ +typedef struct ReindexIndexInfo +{ + Oid indexId; + Oid tableId; + Oid amId; +} ReindexIndexInfo; + /* * CheckIndexCompatible * Determine whether an existing index definition is compatible with a @@ -3132,10 +3142,16 @@ ReindexRelationConcurrently(Oid relationOid, int options) get_rel_name(cellOid; else { + ReindexIndexInfo *idx; + /* Save the list of relation OIDs in private context */ oldcontext = MemoryContextSwitchTo(private_context); - indexIds = lappend_oid(indexIds, cellOid); + idx = palloc(sizeof(ReindexIndexInfo)); + idx->indexId = cellOid; + /* other fields set later */ + + indexIds = lappend(indexIds, idx); MemoryContextSwitchTo(oldcontext); } @@ -3172,13 +3188,18 @@ ReindexRelationConcurrently(Oid relationOid, int options) get_rel_name(cellOid; else { + ReindexIndexInfo *idx; + /* * Save the list of relation OIDs in private * context */ oldcontext = MemoryContextSwitchTo(private_context); - indexIds = lappend_oid(indexIds, cellOid); + idx = palloc(sizeof(ReindexIndexInfo)); + idx->indexId = cellOid; + indexIds = lappend(indexIds, idx); + /* other fields set later */ MemoryContextSwitchTo(oldcontext); } @@ -3197,6 +3218,7 @@ ReindexRelationConcurrently(Oid relationOid, int options) Oid heapId = IndexGetRelation(relationOid, (options & REINDEXOPT_MISSING_OK) != 0); Relation heapRelation; +ReindexIndexInfo *idx; /* if relation is missing, leave */ if (!OidIsValid(heapId)) @@ -3247,7 +3269,10 @@ ReindexRelationConcurrently(Oid relationOid, int options) * Save the list of relation OIDs in private context. Note * that invalid indexes are allowed here. */ -indexIds = lappend_oid(indexIds, relationOid); +idx = palloc(sizeof(ReindexIndexInfo)); +idx->indexId = relationOid; +indexIds = lappend(indexIds, idx); +/* other fields set later */ MemoryContextSwitchTo(oldcontext); break; @@ -3306,31 +3331,36 @@ ReindexRelationConcurrently(Oid relationOid, int options) foreach(lc, indexIds) { char *concurrentName; - Oid indexId = lfirst_oid(lc); + ReindexIndexInfo *idx = lfirst(lc); + ReindexIndexInfo *newidx; Oid newIndexId; Relation indexRel; Relation heapRel; Relation newIndexRel; LockRelId *lockrelid; - indexRel = index_open(indexId, ShareUpdateExclusiveLock); + indexRel = index_open(idx->indexId, ShareUpdateExclusiveLock); heapRel = table_open(indexRel->rd_index->indrelid, ShareUpdateExclusiveLock); + idx->tableId = RelationGetRelid(heapRel); + idx->amId = indexRel->rd_rel->relam; + /* This function shouldn't be called for temporary relations. */ if (indexRel->rd_rel->relpersistence == RELPERSISTENCE_TEMP) elog(ERROR, "cannot reindex a temporary table concurrently"); pgstat_progress_start_command(PROGRESS_COMMAND_CREATE_INDEX, - RelationGetRelid(heapRel)); + idx->tableId); + progress_vals[0] = PROGRESS_CREATEIDX_COMMAND_REINDEX_CONCURRENTLY; progress_vals[1] = 0; /* initializing */ - progress_vals[2] = indexId; - progress_vals[3] = indexRel->rd_rel->relam; + progress_vals[2] =
Re: Change JOIN tutorial to focus more on explicit joins
Status update for a commitfest entry. The commitfest is nearing the end and this thread was inactive for a while. As far as I see something got committed and now the discussion is stuck in arguing about parenthesis. FWIW, I think it is a matter of personal taste. Maybe we can compromise on simply leaving this part unchanged. If you are planning to continue working on it, please move it to the next CF.
Re: range_agg
On Sun, Nov 29, 2020 at 11:53 PM Paul A Jungwirth wrote: > > On Sun, Nov 29, 2020 at 11:43 AM Alexander Korotkov > wrote: > > Thank you. Could you please, update doc/src/sgml/catalogs.sgml, > > because pg_type and pg_range catalogs are updated. > > Attached! :-) You're quick, thank you. Please, also take a look at cfbot failure https://travis-ci.org/github/postgresql-cfbot/postgresql/builds/746623942 I've tried to reproduce it, but didn't manage yet. -- Regards, Alexander Korotkov
Re: remove spurious CREATE INDEX CONCURRENTLY wait
In the interest of showing progress, I'm going to mark this CF item as committed, and I'll submit the remaining pieces in a new thread. Thanks!
Re: Add docs stub for recovery.conf
On Mon, Nov 30, 2020 at 11:42 AM Bruce Momjian wrote: > > The downside is you end up with X-1 dummy sections just to allow for > references to old syntax, and you then have to find them all and remove > them when you implement the proper solution. I have no intention of > applying such an X-1 fix. > > X = 2; seems like a strong objection for such a minor issue. The status quo seems worse than that. David J.
Re: [PATCH] remove deprecated v8.2 containment operators
Justin Pryzby writes: > I think this is waiting on me to provide a patch for the contrib/ modules with > update script removing the SQL operators, and documentating their deprecation. Right. We can remove the SQL operators, but not (yet) the C code support. I'm not sure that the docs change would do more than remove any existing mentions of the operators. regards, tom lane
Re: [PATCH] remove deprecated v8.2 containment operators
On Mon, Nov 30, 2020 at 09:51:12PM +0300, Anastasia Lubennikova wrote: > On 16.11.2020 23:55, Justin Pryzby wrote: > > On Fri, Nov 13, 2020 at 10:03:43AM -0500, Stephen Frost wrote: > > > * Magnus Hagander (mag...@hagander.net) wrote: > > > > On Thu, Nov 12, 2020 at 11:28 PM Tom Lane wrote: > > > > > > The changes to the contrib modules appear to be incomplete in some > > > > > > ways. > > > > > >In cube, hstore, and seg, there are no changes to the extension > > > > > > scripts to remove the operators. All you're doing is changing the C > > > > > > code to no longer recognize the strategy, but that doesn't explain > > > > > > what > > > > > > will happen if the operator is still used. In intarray, by > > > > > > contrast, > > > > > > you're editing an existing extension script, but that should be > > > > > > done by > > > > > > an upgrade script instead. > > > > > In the contrib modules, I'm afraid what you gotta do is remove the > > > > > SQL operator definitions but leave the opclass code support in place. > > > > > That's because there's no guarantee that users will update the > > > > > extension's > > > > > SQL version immediately, so a v14 build of the .so might still be used > > > > > with the old SQL definitions. It's not clear how much window we need > > > > > give for people to do that update, but I don't think "zero" is an > > > > > acceptable answer. > > > > Based on my experience from the field, the answer is "never". > > > > > > > > As in, most people have no idea they are even *supposed* to do such an > > > > upgrade, so they don't do it. Until we solve that problem, I think > > > > we're basically stuck with keeping them "forever". (and even if/when > > > > we do, "zero" is probably not going to cut it, no) > > > Yeah, this is a serious problem and one that we should figure out a way > > > to fix or at least improve on- maybe by having pg_upgrade say something > > > about extensions that could/should be upgraded..? > > I think what's needed are: > > > > 1) a way to *warn* users about deprecation. CREATE EXTENSION might give an > > elog(WARNING), but it's probably not enough. It only happens once, and if > > it's > > in pg_restore/pg_upgrade, it be wrapped by vendor upgrade scripts. It > > needs to > > be more like first function call in every session. Or more likely, put it > > in > > documentation for 10 years. > > > > 2) a way to *enforce* it, like making CREATE EXTENSION fail when run > > against an > > excessively old server, including by pg_restore/pg_upgrade (which ought to > > also > > handle it in --check). > > > > Are there any contrib for which (1) is done and we're anywhere near doing > > (2) ? > > Status update for a commitfest entry. > > The commitfest is nearing the end and this thread is "Waiting on Author". As I think this is waiting on me to provide a patch for the contrib/ modules with update script removing the SQL operators, and documentating their deprecation. Is that right ? -- Justin
Re: [PATCH] remove deprecated v8.2 containment operators
Anastasia Lubennikova writes: > Status update for a commitfest entry. > The commitfest is nearing the end and this thread is "Waiting on > Author". As far as I see we don't have a patch here and discussion is a > bit stuck. > So, I am planning to return it with feedback. Any objections? AFAICS, the status is (1) core-code changes are committed; (2) proposed edits to contrib modules need significant rework; (3) there was also a bit of discussion about inventing a mechanism to prod users to update out-of-date extensions. Now, (3) is far outside the scope of this patch, and I do not think it should block finishing (2). We need a new patch for (2), but there's no real doubt as to what it should contain -- Justin just needs to turn the crank. You could either move this to the next CF in state WoA, or close it RWF. But the patch did make progress in this CF, so I'd tend to lean to the former. regards, tom lane
Re: 回复: [PATCH] BUG FIX: Core dump could happen when VACUUM FULL in standalone mode
Yulin PEI writes: > Yes, I agree because (IsNormalProcessingMode() ) means that current process > is not in bootstrap mode and postmaster process will not build index. > So my new modified patch is attached. This is a good catch, but the proposed fix still seems pretty random and unlike how it's done elsewhere. It seems to me that since index_build() is relying on plan_create_index_workers() to assess parallel safety, that's where to check IsUnderPostmaster. Moreover, the existing code in compute_parallel_vacuum_workers (which gets this right) associates the IsUnderPostmaster check with the initial check on max_parallel_maintenance_workers. So I think that the right fix is to adopt the compute_parallel_vacuum_workers coding in plan_create_index_workers, and thereby create a model for future uses of max_parallel_maintenance_workers to follow. regards, tom lane diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c index 247f7d4625..1a94b58f8b 100644 --- a/src/backend/optimizer/plan/planner.c +++ b/src/backend/optimizer/plan/planner.c @@ -6375,8 +6375,11 @@ plan_create_index_workers(Oid tableOid, Oid indexOid) double reltuples; double allvisfrac; - /* Return immediately when parallelism disabled */ - if (max_parallel_maintenance_workers == 0) + /* + * We don't allow performing parallel operation in standalone backend or + * when parallelism is disabled. + */ + if (!IsUnderPostmaster || max_parallel_maintenance_workers == 0) return 0; /* Set up largely-dummy planner state */
Re: [PATCH] remove deprecated v8.2 containment operators
On 16.11.2020 23:55, Justin Pryzby wrote: On Fri, Nov 13, 2020 at 10:03:43AM -0500, Stephen Frost wrote: * Magnus Hagander (mag...@hagander.net) wrote: On Thu, Nov 12, 2020 at 11:28 PM Tom Lane wrote: The changes to the contrib modules appear to be incomplete in some ways. In cube, hstore, and seg, there are no changes to the extension scripts to remove the operators. All you're doing is changing the C code to no longer recognize the strategy, but that doesn't explain what will happen if the operator is still used. In intarray, by contrast, you're editing an existing extension script, but that should be done by an upgrade script instead. In the contrib modules, I'm afraid what you gotta do is remove the SQL operator definitions but leave the opclass code support in place. That's because there's no guarantee that users will update the extension's SQL version immediately, so a v14 build of the .so might still be used with the old SQL definitions. It's not clear how much window we need give for people to do that update, but I don't think "zero" is an acceptable answer. Based on my experience from the field, the answer is "never". As in, most people have no idea they are even *supposed* to do such an upgrade, so they don't do it. Until we solve that problem, I think we're basically stuck with keeping them "forever". (and even if/when we do, "zero" is probably not going to cut it, no) Yeah, this is a serious problem and one that we should figure out a way to fix or at least improve on- maybe by having pg_upgrade say something about extensions that could/should be upgraded..? I think what's needed are: 1) a way to *warn* users about deprecation. CREATE EXTENSION might give an elog(WARNING), but it's probably not enough. It only happens once, and if it's in pg_restore/pg_upgrade, it be wrapped by vendor upgrade scripts. It needs to be more like first function call in every session. Or more likely, put it in documentation for 10 years. 2) a way to *enforce* it, like making CREATE EXTENSION fail when run against an excessively old server, including by pg_restore/pg_upgrade (which ought to also handle it in --check). Are there any contrib for which (1) is done and we're anywhere near doing (2) ? Status update for a commitfest entry. The commitfest is nearing the end and this thread is "Waiting on Author". As far as I see we don't have a patch here and discussion is a bit stuck. So, I am planning to return it with feedback. Any objections? -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Add docs stub for recovery.conf
On Mon, Nov 30, 2020 at 11:31:35AM -0700, David G. Johnston wrote: > On Mon, Nov 30, 2020 at 11:25 AM Bruce Momjian wrote: > > On Mon, Nov 30, 2020 at 10:11:04AM +0800, Craig Ringer wrote: > > Can we please just address this docs issue? If you don't like my > solution > can > > you please supply a patch that you feel addresses the problem? Or > clearly > state > > that you don't think there is a problem, and do so in a way that > actually > > addresses the specific points I have raised about what's wrong with the > status > > quo? > > If we know there are X problems, and we fix one of them one way, then > later fix the rest another way, we have to undo the first fix. If you > don't want to fix all X, then let's wait until someone does want to fix > them all. > > IMO there is only the original problem with an acceptable solution presented > that can be committed without downside. If that has to be undone because > someone else in the future decides on a different solution that happens to > touch this too, fine, it can be changed again. The downside is you end up with X-1 dummy sections just to allow for references to old syntax, and you then have to find them all and remove them when you implement the proper solution. I have no intention of applying such an X-1 fix. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Re: Improve handling of parameter differences in physical replication
On 2020-11-20 16:47, Sergei Kornilov wrote: Hmm... Good question. How about putting CheckForStandbyTrigger() in a wait loop, but reporting FATAL with an appropriate message, such as "promotion is not possible because of insufficient parameter settings"? Also it suits me if we only document that we ignore promote here. I don't think this is an important case. And yes, it's not easy to allow promotion, since we have already updated control file. Probably we need pause only after we reached consistency? Here is an updated patch that implements both of these points. I have used hot standby active instead of reached consistency. I guess arguments could be made either way, but the original use case really cared about hot standby. -- Peter Eisentraut 2ndQuadrant, an EDB company https://www.2ndquadrant.com/ From 2d93c2918af7ff186aa4a6a2327286b6eb1a2859 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Mon, 30 Nov 2020 19:21:40 +0100 Subject: [PATCH v6] Pause recovery for insufficient parameter settings When certain parameters are changed on a physical replication primary, this is communicated to standbys using the XLOG_PARAMETER_CHANGE WAL record. The standby then checks whether its own settings are at least as big as the ones on the primary. If not, the standby shuts down with a fatal error. This patch changes this behavior for hot standbys to pause recovery at that point instead. That allows read traffic on the standby to continue while database administrators figure out next steps. When recovery is unpaused, the server shuts down (as before). The idea is to fix the parameters while recovery is paused and then restart when there is a maintenance window. Discussion: https://www.postgresql.org/message-id/flat/4ad69a4c-cc9b-0dfe-0352-8b1b0cd36...@2ndquadrant.com --- doc/src/sgml/high-availability.sgml | 51 +++-- src/backend/access/transam/xlog.c | 59 ++--- 2 files changed, 93 insertions(+), 17 deletions(-) diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml index 19d7bd2b28..71f7f58cb0 100644 --- a/doc/src/sgml/high-availability.sgml +++ b/doc/src/sgml/high-availability.sgml @@ -2120,18 +2120,14 @@ Administrator's Overview -The setting of some parameters on the standby will need reconfiguration -if they have been changed on the primary. For these parameters, -the value on the standby must -be equal to or greater than the value on the primary. -Therefore, if you want to increase these values, you should do so on all -standby servers first, before applying the changes to the primary server. -Conversely, if you want to decrease these values, you should do so on the -primary server first, before applying the changes to all standby servers. -If these parameters -are not set high enough then the standby will refuse to start. -Higher values can then be supplied and the server -restarted to begin recovery again. These parameters are: +The settings of some parameters determine the size of shared memory for +tracking transaction IDs, locks, and prepared transactions. These shared +memory structures must be no smaller on a standby than on the primary in +order to ensure that the standby does not run out of shared memory during +recovery. For example, if the primary had used a prepared transaction but +the standby had not allocated any shared memory for tracking prepared +transactions, then recovery could not continue until the standby's +configuration is changed. The parameters affected are: @@ -2160,6 +2156,37 @@ Administrator's Overview + +The easiest way to ensure this does not become a problem is to have these +parameters set on the standbys to values equal to or greater than on the +primary. Therefore, if you want to increase these values, you should do +so on all standby servers first, before applying the changes to the +primary server. Conversely, if you want to decrease these values, you +should do so on the primary server first, before applying the changes to +all standby servers. Keep in mind that when a standby is promoted, it +becomes the new reference for the required parameter settings for the +standbys that follow it. Therefore, to avoid this becoming a problem +during a switchover or failover, it is recommended to keep these settings +the same on all standby servers. + + + +The WAL tracks changes to these parameters on the +primary. If a hot standby processes WAL that indicates that the current +value on the primary is higher than its own value, it will log a warning +and pause recovery, for example: + +WARNING: hot standby is not possible because of insufficient parameter settings +DETAIL: max_connections = 80 is a lower setting than on the primary server, where its value was 100. +
Re: Add docs stub for recovery.conf
On Mon, Nov 30, 2020 at 11:25 AM Bruce Momjian wrote: > On Mon, Nov 30, 2020 at 10:11:04AM +0800, Craig Ringer wrote: > > Can we please just address this docs issue? If you don't like my > solution can > > you please supply a patch that you feel addresses the problem? Or > clearly state > > that you don't think there is a problem, and do so in a way that actually > > addresses the specific points I have raised about what's wrong with the > status > > quo? > > If we know there are X problems, and we fix one of them one way, then > later fix the rest another way, we have to undo the first fix. If you > don't want to fix all X, then let's wait until someone does want to fix > them all. > > IMO there is only the original problem with an acceptable solution presented that can be committed without downside. If that has to be undone because someone else in the future decides on a different solution that happens to touch this too, fine, it can be changed again. David J.
Re: Cost overestimation of foreign JOIN
On 30.11.2020 22:38, Tom Lane wrote: Andrey Lepikhov writes: Maybe it is needed to swap lines 2908 and 2909 (see attachment)? No; as explained in the comment immediately above here, we're assuming that the join conditions will be applied on the cross product of the input relations. Thank you. Now it is clear to me. Now admittedly, that's a worst-case assumption, since it amounts to expecting that the remote server will do the join in the dumbest possible nested-loop way. If the remote can use a merge or hash join, for example, the cost is likely to be a lot less. My goal is scaling Postgres on a set of the same servers with same postgres instances. If one server uses for the join a hash-join node, i think it is most likely that the other server will also use for this join a hash-join node (Maybe you remember, I also use the statistics copying technique to provide up-to-date statistics on partitions). Tests show good results with such an approach. But maybe this is my special case. But it is not the job of this code path to outguess the remote planner. It's certainly not appropriate to invent an unprincipled cost estimate as a substitute for trying to guess that. Agreed. If you're unhappy with the planning results you get for this, why don't you have use_remote_estimate turned on? I have a mixed load model. Large queries are suitable for additional estimate queries. But for many simple SELECT's that touch a small portion of the data, the latency has increased significantly. And I don't know how to switch the use_remote_estimate setting in such case. -- regards, Andrey Lepikhov Postgres Professional
Re: Add docs stub for recovery.conf
On Mon, Nov 30, 2020 at 10:11:04AM +0800, Craig Ringer wrote: > Can we please just address this docs issue? If you don't like my solution can > you please supply a patch that you feel addresses the problem? Or clearly > state > that you don't think there is a problem, and do so in a way that actually > addresses the specific points I have raised about what's wrong with the status > quo? If we know there are X problems, and we fix one of them one way, then later fix the rest another way, we have to undo the first fix. If you don't want to fix all X, then let's wait until someone does want to fix them all. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Re: Add Information during standby recovery conflicts
On 2020-Dec-01, Fujii Masao wrote: > + if (proc) > + { > + if (nprocs == 0) > + appendStringInfo(&buf, "%d", proc->pid); > + else > + appendStringInfo(&buf, ", %d", > proc->pid); > + > + nprocs++; > > What happens if all the backends in wait_list have gone? In other words, > how should we handle the case where nprocs == 0 (i.e., nprocs has not been > incrmented at all)? This would very rarely happen, but can happen. > In this case, since buf.data is empty, at least there seems no need to log > the list of conflicting processes in detail message. Yes, I noticed this too; this can be simplified by changing the condition in the ereport() call to be "nprocs > 0" (rather than wait_list being null), otherwise not print the errdetail. (You could test buf.data or buf.len instead, but that seems uglier to me.)
Re: Improving spin-lock implementation on ARM.
Alexander Korotkov writes: > I tend to think that LSE is enabled by default in Apple's clang based > on your previous message[1]. In order to dispel the doubts could you > please provide assembly of SpinLockAcquire for following clang > options. > "-O2" > "-O2 -march=armv8-a+lse" > "-O2 -march=armv8-a" Huh. Those options make exactly zero difference to the code generated for SpinLockAcquire/SpinLockRelease; it's the same as I showed upthread, for either the HEAD definition of TAS() or the CAS patch's version. So now I'm at a loss as to the reason for the performance difference I got. -march=armv8-a+lse does make a difference to code generation someplace, because the overall size of the postgres executable changes by 16kB or so. One might argue that the performance difference is due to better code elsewhere than the spinlocks ... but the test I'm running is basically just while (count-- > 0) { XLogGetLastRemovedSegno(); CHECK_FOR_INTERRUPTS(); } so it's hard to see where a non-spinlock-related code change would come in. That loop itself definitely generates the same code either way. I did find this interesting output from "clang -v": -target-cpu vortex -target-feature +v8.3a -target-feature +fp-armv8 -target-feature +neon -target-feature +crc -target-feature +crypto -target-feature +fullfp16 -target-feature +ras -target-feature +lse -target-feature +rdm -target-feature +rcpc -target-feature +zcm -target-feature +zcz -target-feature +sha2 -target-feature +aes whereas adding -march=armv8-a+lse changes that to just -target-cpu vortex -target-feature +neon -target-feature +lse -target-feature +zcm -target-feature +zcz On the whole, that would make one think that -march=armv8-a+lse should produce worse code than the default settings. regards, tom lane
Re: Add Information during standby recovery conflicts
On 2020/11/30 16:26, Masahiko Sawada wrote: On Mon, Nov 30, 2020 at 3:46 PM Drouvot, Bertrand wrote: Hi, On 11/30/20 4:41 AM, Masahiko Sawada wrote: On Sun, Nov 29, 2020 at 3:47 PM Drouvot, Bertrand wrote: Hi Alvaro, On 11/28/20 6:36 PM, Alvaro Herrera wrote: Hi Bertrand, On 2020-Nov-28, Drouvot, Bertrand wrote: + if (nprocs > 0) + { + ereport(LOG, + (errmsg("recovery still waiting after %ld.%03d ms: %s", + msecs, usecs, _(get_recovery_conflict_desc(reason))), + (errdetail_log_plural("Conflicting process: %s.", + "Conflicting processes: %s.", + nprocs, buf.data; + } + else + { + ereport(LOG, + (errmsg("recovery still waiting after %ld.%03d ms: %s", + msecs, usecs, _(get_recovery_conflict_desc(reason); + } + + pfree(buf.data); + } + else + ereport(LOG, + (errmsg("recovery still waiting after %ld.%03d ms: %s", + msecs, usecs, _(get_recovery_conflict_desc(reason); +} Another trivial stylistic point is that you can collapse all these ereport calls into one, with something like ereport(LOG, errmsg("recovery still waiting after ...", opts), waitlist != NULL ? errdetail_log_plural("foo bar baz", opts) : 0); where the "waitlist" has been constructed beforehand, or is set to NULL if there's no process list. Nice! + switch (reason) + { + case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN: + reasonDesc = gettext_noop("for recovery conflict on buffer pin"); + break; Pure bikeshedding after discussing this with my pillow: I think I'd get rid of the initial "for" in these messages. both comments implemented in the new attached version. Thank you for updating the patch! + /* Also, set deadlock timeout for logging purpose if necessary */ + if (log_recovery_conflict_waits && !need_log) + { + timeouts[cnt].id = STANDBY_TIMEOUT; + timeouts[cnt].type = TMPARAM_AFTER; + timeouts[cnt].delay_ms = DeadlockTimeout; + cnt++; + } You changed to 'need_log' but this condition seems not correct. I think we need to set this timeout when both log_recovery_conflict and need_log is true. Nice catch! In fact it behaves correctly, that's jut the 'need_log' name that is miss leading: I renamed it to 'already_logged' in the new attached version. Thanks! I'd prefer 'need_log' because we can check it using the affirmative form in that condition, which would make the code more readable a bit. But I'd like to leave it to committers. I've marked this patch as "Ready for Committer". I'm still in the middle of the review, but please let me share my current review comments. + /* Set wait start timestamp if logging is enabled */ + if (log_recovery_conflict_waits) + waitStart = GetCurrentTimestamp(); This seems to cause even the primary server to call GetCurrentTimestamp() if logging is enabled. To avoid unnecessary GetCurrentTimestamp(), we should add "InHotStandby" into the if-condition? + initStringInfo(&buf); + + if (wait_list) Isn't it better to call initStringInfo() only when wait_list is not NULL? For example, we can update the code so that it's executed when nprocs == 0. + if (proc) + { + if (nprocs == 0) + appendStringInfo(&buf, "%d", proc->pid); + else + appendStringInfo(&buf, ", %d", proc->pid); + + nprocs++; What happens if all the backends in wait_list have gone? In other words, how should we handle the case where nprocs == 0 (i.e., nprocs has not been incrmented at all)? This would very rarely happen, but can happen. In this case, since buf.data is empty, at least there seems no need to log the list of conflicting processes in detail message. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: Add Information during standby recovery conflicts
On 2020/11/20 18:17, Drouvot, Bertrand wrote: Hi, On 11/17/20 4:44 PM, Fujii Masao wrote: Thanks for updating the patch! Here are review comments. + Controls whether a log message is produced when the startup process + is waiting longer than deadlock_timeout + for recovery conflicts. But a log message can be produced also when the backend is waiting for recovery conflict. Right? If yes, this description needs to be corrected. Thanks for looking at it! I don't think so, only the startup process should write those new log messages. What makes you think that would not be the case? Probably my mis-underding of the patch did that. Sorry for noise.. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: Autovacuum on partitioned table (autoanalyze)
On 2020-Nov-10, Kyotaro Horiguchi wrote: > In second thought about the reason for the "toprel_oid". It is perhaps > to avoid "wrongly" propagated values to ancestors after a manual > ANALYZE on a partitioned table. But the same happens after an > autoanalyze iteration if some of the ancestors of a leaf relation are > analyzed before the leaf relation in a autoanalyze iteration. That > can trigger an unnecessary analyzing for some of the ancestors. I'm not sure I understand this point. I think we should only trigger this on analyzes of *leaf* partitions, not intermediate partitioned relations. That way you would never get these unnecesary analyzes. Am I missing something? (So with my proposal in the previous email, we would send the list of ancestor relations after analyzing a leaf partition. Whenever we analyze a non-leaf, then the list of ancestors is sent as an empty list.) > > > Regarding the counters in pg_stat_all_tables: maybe some of these should > > > be > > > null rather than zero ? Or else you should make an 0001 patch to fully > > > implement this view, with all relevant counters, not just > > > n_mod_since_analyze, > > > last_*analyze, and *analyze_count. These are specifically misleading: > > > > > > last_vacuum | > > > last_autovacuum | > > > n_ins_since_vacuum | 0 > > > vacuum_count| 0 > > > autovacuum_count| 0 > > > > > I haven't modified this part yet, but you meant that we should set > > null to counters > > about vacuum because partitioned tables are not vacuumed? > > Perhaps bacause partitioned tables *cannot* be vacuumed. I'm not sure > what is the best way here. Showing null seems reasonable but I'm not > sure that doesn't break anything. I agree that showing NULLs for the vacuum columns is better. Perhaps the most reasonable way to do this is use -1 as an indicator that NULL ought to be returned from pg_stat_get_vacuum_count() et al, and add a boolean in PgStat_TableCounts next to t_truncated, maybe "t_nullvacuum" that says to store -1 instead of 0 in pgstat_recv_tabstat.
Re: Cost overestimation of foreign JOIN
Andrey Lepikhov writes: > Maybe it is needed to swap lines 2908 and 2909 (see attachment)? No; as explained in the comment immediately above here, we're assuming that the join conditions will be applied on the cross product of the input relations. Now admittedly, that's a worst-case assumption, since it amounts to expecting that the remote server will do the join in the dumbest possible nested-loop way. If the remote can use a merge or hash join, for example, the cost is likely to be a lot less. But it is not the job of this code path to outguess the remote planner. It's certainly not appropriate to invent an unprincipled cost estimate as a substitute for trying to guess that. If you're unhappy with the planning results you get for this, why don't you have use_remote_estimate turned on? regards, tom lane
Re: Notes on physical replica failover with logical publisher or subscriber
Hi Craig, On 2020-11-30 06:59, Craig Ringer wrote: https://wiki.postgresql.org/wiki/Logical_replication_and_physical_standby_failover Thank you for sharing these notes. I have not dealt a lot with physical/logical replication interoperability, so those were mostly new problems for me to know. One point from the wiki page, which seems clear enough to me: ``` Logical slots can fill pg_wal and can't benefit from archiving. Teach the logical decoding page read callback how to use the restore_command to retrieve WAL segs temporarily if they're not found in pg_wal... ``` It does not look like a big deal to teach logical decoding process to use restore_command, but I have some doubts about how everything will perform in the case when we started getting WAL from archive for decoding purposes. If we started using restore_command, then subscriber lagged long enough to exceed max_slot_wal_keep_size. Taking into account that getting WAL files from the archive has an additional overhead and that primary continues generating (and archiving) new segments, there is a possibility for primary to start doing this double duty forever --- archive WAL file at first and get it back for decoding when requested. Another problem is that there are maybe several active decoders, IIRC, so they would have better to communicate in order to avoid fetching the same segment twice. I tried to address many of these issues with failover slots, but I am not trying to beat that dead horse now. I know that at least some people here are of the opinion that effort shouldn't go into logical/physical replication interoperation anyway - that we should instead address the remaining limitations in logical replication so that it can provide complete HA capabilities without use of physical replication. So for now I'm just trying to save others who go looking into these issues some time and warn them about some of the less obvious booby-traps. Another point to add regarding logical replication capabilities to build logical-only HA system --- logical equivalent of pg_rewind. At least I have not noticed anything after brief reading of the wiki page. IIUC, currently there is no way to quickly return ex-primary (ex-logical publisher) into HA-cluster without doing a pg_basebackup, isn't it? It seems that we should have the same problem here as with physical replication --- ex-primary may accept some xacts after promotion of new primary, so their history diverges and old primary should be rewound before being returned as standby (subscriber). Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: Re: parallel distinct union and aggregate support patch
> 1. > +#define BATCH_SORT_MAX_BATCHES 512 > > Did you decide this number based on some experiment or is there some > analysis behind selecting this number? When there are too few batches, if a certain process works too slowly, it will cause unbalanced load. When there are too many batches, FD will open and close files frequently. > 2. > +BatchSortState* ExecInitBatchSort(BatchSort *node, EState *estate, int > eflags) > +{ > + BatchSortState *state; > + TypeCacheEntry *typentry; > > + for (i=0;inumGroupCols;++i) > + { > ... > + InitFunctionCallInfoData(*fcinfo, flinfo, 1, attr->attcollation, NULL, > NULL); > + fcinfo->args[0].isnull = false; > + state->groupFuns = lappend(state->groupFuns, fcinfo); > + } > > From the variable naming, it appeared like the batch sort is dependent > upon the grouping node. I think instead of using the name > numGroupCols and groupFuns we need to use names that are more relevant > to the batch sort something like numSortKey. Not all data types support both sorting and hashing calculations, such as user-defined data types. We do not need all columns to support hash calculation when we batch, so I used two variables. > 3. > + if (eflags & (EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)) > + { > + /* for now, we only using in group aggregate */ > + ereport(ERROR, > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > + errmsg("not support execute flag(s) %d for group sort", eflags))); > + } > > Instead of ereport, you should just put an Assert for the unsupported > flag or elog. In fact, this is an unfinished feature, BatchSort should also support these features, welcome to supplement. > 4. > + state = makeNode(BatchSortState); > + state->ps.plan = (Plan*) node; > + state->ps.state = estate; > + state->ps.ExecProcNode = ExecBatchSortPrepare; > > I think the main executor entry function should be named ExecBatchSort > instead of ExecBatchSortPrepare, it will look more consistent with the > other executor machinery. The job of the ExecBatchSortPrepare function is to preprocess the data (batch and pre-sort), and when its work ends, it will call "ExecSetExecProcNode(pstate, ExecBatchSort)" to return the data to the ExecBatchSort function. There is another advantage of dividing into two functions, It is not necessary to judge whether tuplesort is now available every time the function is processed to improve the subtle performance. And I think this code is clearer.