Re: [HACKERS] pg_dump getBlobs query broken for 7.3 servers
On 2016/10/07 11:47, Amit Langote wrote: > Just noticed that the getBlobs() query does not work for a 7.3 server > (maybe <= 7.3) due to the following change in commit 23f34fa4 [1]: > > else if (fout->remoteVersion >= 70100) > appendPQExpBufferStr(blobQry, > - "SELECT DISTINCT loid, NULL::oid, NULL::oid" > + "SELECT DISTINCT loid, NULL::oid, NULL, " > + "NULL AS rlomacl, NULL AS initlomacl, " > + "NULL AS initrlomacl " > " FROM pg_largeobject"); > else > appendPQExpBufferStr(blobQry, > - "SELECT oid, NULL::oid, NULL::oid" > + "SELECT oid, NULL::oid, NULL, " > + "NULL AS rlomacl, NULL AS initlomacl, " > + "NULL AS initrlomacl " > " FROM pg_class WHERE relkind = 'l'"); > > The following error is reported by the server: > > pg_dump: [archiver (db)] query failed: ERROR: Unable to identify an > ordering operator '<' for type '"unknown"' > Use an explicit ordering operator or modify the query > pg_dump: [archiver (db)] query was: SELECT DISTINCT loid, NULL::oid, NULL, > NULL AS rlomacl, NULL AS initlomacl, NULL AS initrlomacl FROM pg_largeobject > > I could fix that using the attached patch. Forgot to mention that it needs to be fixed in both HEAD and 9.6. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench vs. wait events
Robert, This contention on WAL reminds me of another scenario I've heard about that was similar. To fix things what happened was that anyone that the first person to block would be responsible for writing out all buffers for anyone blocked behind "him". The for example if you have many threads, A, B, C, D If while A is writing to WAL and hold the lock, then B arrives, B of course blocks, then C comes along and blocks as well, then D. Finally A finishes its write and then Now you have two options for resolving this, either 1) A drops its lock, B picks up the lock... B writes its buffer and then drops the lock. Then C gets the lock, writes its buffer, drops the lock, then finally D gets the lock, writes its buffer and then drops the lock. 2) A then writes out B's, C's, and D's buffers, then A drops the lock, B, C and D wake up, note that their respective buffers are written and just return. This greatly speeds up the system. (just be careful to make sure A doesn't do "too much work" otherwise you can get a sort of livelock if too many threads are blocked behind it, generally only issue one additional flush on behalf of other threads, do not "loop until the queue is empty") I'm not sure if this is actually possible with the way WAL is implemented, (or perhaps if this strategy is already implemented) but it's definitely worth if not done already as it can speed things up enormously. On 10/6/16 11:38 AM, Robert Haas wrote: Hi, I decided to do some testing on hydra (IBM-provided community resource, POWER, 16 cores/64 threads, kernel 3.2.6-3.fc16.ppc64) using the newly-enhanced wait event stuff to try to get an idea of what we're waiting for during pgbench. I did 30-minute pgbench runs with various configurations, but all had max_connections = 200, shared_buffers = 8GB, maintenance_work_mem = 4GB, synchronous_commit = off, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9, log_line_prefix = '%t [%p] ', max_wal_size = 40GB, log_checkpoints = on. During each run, I ran this psql script in another window and captured the output: \t select wait_event_type, wait_event from pg_stat_activity where pid != pg_backend_pid() \watch 0.5 Then, I used a little shell-scripting to count up the number of times each wait event occurred in the output. First, I tried scale factor 3000 with 32 clients and got these results: 1 LWLockTranche | buffer_mapping 9 LWLockNamed | CLogControlLock 14 LWLockNamed | ProcArrayLock 16 Lock| tuple 25 LWLockNamed | CheckpointerCommLock 49 LWLockNamed | WALBufMappingLock 122 LWLockTranche | clog 182 Lock| transactionid 287 LWLockNamed | XidGenLock 1300 Client | ClientRead 1375 LWLockTranche | buffer_content 3990 Lock| extend 21014 LWLockNamed | WALWriteLock 28497 | 58279 LWLockTranche | wal_insert tps = 1150.803133 (including connections establishing) What jumps out here is, at least to me, is that there is furious contention on both the wal_insert locks and on WALWriteLock. Apparently, the system simply can't get WAL on disk fast enough to keep up with this workload. Relation extension locks and buffer_content locks also are also pretty common, both ahead of ClientRead, a relatively uncommon wait event on this test. The load average on the system was only about 3 during this test, indicating that most processes are in fact spending most of their time off-CPU. The first thing I tried was switching to unlogged tables, which produces these results: 1 BufferPin | BufferPin 1 LWLockTranche | lock_manager 2 LWLockTranche | buffer_mapping 8 LWLockNamed | ProcArrayLock 9 LWLockNamed | CheckpointerCommLock 9 LWLockNamed | CLogControlLock 11 LWLockTranche | buffer_content 37 LWLockTranche | clog 153 Lock| tuple 388 LWLockNamed | XidGenLock 827 Lock| transactionid 1267 Client | ClientRead 20631 Lock| extend 91767 | tps = 1223.239416 (including connections establishing) If you don't look at the TPS number, these results look like a vast improvement. The overall amount of time spent not waiting for anything is now much higher, and the problematic locks have largely disappeared from the picture. However, the load average now shoots up to about 30, because most of the time that the backends are "not waiting for anything" they are in fact in kernel wait state D; that is, they're stuck doing I/O. This suggests that we might want to consider advertising a wait state when a backend is doing I/O, so we can measure this sort of thing. Next, I tried lowering the scale factor to something that fits in shared buffers. Here are the results at scale factor 300: 14 Lock| tu
[HACKERS] pg_dump getBlobs query broken for 7.3 servers
Just noticed that the getBlobs() query does not work for a 7.3 server (maybe <= 7.3) due to the following change in commit 23f34fa4 [1]: else if (fout->remoteVersion >= 70100) appendPQExpBufferStr(blobQry, - "SELECT DISTINCT loid, NULL::oid, NULL::oid" + "SELECT DISTINCT loid, NULL::oid, NULL, " + "NULL AS rlomacl, NULL AS initlomacl, " + "NULL AS initrlomacl " " FROM pg_largeobject"); else appendPQExpBufferStr(blobQry, - "SELECT oid, NULL::oid, NULL::oid" + "SELECT oid, NULL::oid, NULL, " + "NULL AS rlomacl, NULL AS initlomacl, " + "NULL AS initrlomacl " " FROM pg_class WHERE relkind = 'l'"); The following error is reported by the server: pg_dump: [archiver (db)] query failed: ERROR: Unable to identify an ordering operator '<' for type '"unknown"' Use an explicit ordering operator or modify the query pg_dump: [archiver (db)] query was: SELECT DISTINCT loid, NULL::oid, NULL, NULL AS rlomacl, NULL AS initlomacl, NULL AS initrlomacl FROM pg_largeobject I could fix that using the attached patch. Thanks, Amit [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=23f34fa4ba358671adab16773e79c17c92cbc870 diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c index 299e887..a1165fa 100644 --- a/src/bin/pg_dump/pg_dump.c +++ b/src/bin/pg_dump/pg_dump.c @@ -2881,15 +2881,15 @@ getBlobs(Archive *fout) username_subquery); else if (fout->remoteVersion >= 70100) appendPQExpBufferStr(blobQry, - "SELECT DISTINCT loid, NULL::oid, NULL, " - "NULL AS rlomacl, NULL AS initlomacl, " - "NULL AS initrlomacl " + "SELECT DISTINCT loid, NULL::oid, NULL::oid, " + "NULL::oid AS rlomacl, NULL::oid AS initlomacl, " + "NULL::oid AS initrlomacl " " FROM pg_largeobject"); else appendPQExpBufferStr(blobQry, - "SELECT oid, NULL::oid, NULL, " - "NULL AS rlomacl, NULL AS initlomacl, " - "NULL AS initrlomacl " + "SELECT oid, NULL::oid, NULL::oid, " + "NULL::oid AS rlomacl, NULL::oid AS initlomacl, " + "NULL::oid AS initrlomacl " " FROM pg_class WHERE relkind = 'l'"); res = ExecuteSqlQuery(fout, blobQry->data, PGRES_TUPLES_OK); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgres_fdw : altering foreign table not invalidating prepare statement execution plan.
On 2016/10/07 10:26, Amit Langote wrote: On 2016/10/06 21:55, Etsuro Fujita wrote: On 2016/10/06 20:17, Amit Langote wrote: On 2016/10/05 20:45, Etsuro Fujita wrote: I noticed that we were wrong. Your patch was modified so that dependencies on FDW-related objects would be extracted from a given plan in create_foreignscan_plan (by Ashutosh) and then in set_foreignscan_references by me, but that wouldn't work well for INSERT cases. To fix that, I'd like to propose that we collect the dependencies from the given rte in add_rte_to_flat_rtable, instead. I see. So, doing it from set_foreignscan_references() fails to capture plan dependencies in case of INSERT because it won't be invoked at all unlike the UPDATE/DELETE case. Right. If some writable FDW consulted foreign table/server/FDW options in AddForeignUpdateTarget, which adds the extra junk columns for UPDATE/DELETE to the targetList of the given query tree, the rewritten query tree would also need to be invalidated. But I don't think such an FDW really exists because that routine in a practical FDW wouldn't change such columns depending on those options. I had second thoughts about that; since the possibility wouldn't be zero, I added to extract_query_dependencies_walker the same code I added to add_rte_to_flat_rtable. And here, since AddForeignUpdateTargets() could possibly utilize foreign options which would cause *query tree* dependencies. It's possible that add_rte_to_flat_rtable may not be called before an option change, causing invalidation of any cached objects created based on the changed options. So, must record dependencies from extract_query_dependencies as well. Right. I think this (v4) patch is in the best shape so far. Thanks for the review! Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgres_fdw : altering foreign table not invalidating prepare statement execution plan.
On 2016/10/06 21:55, Etsuro Fujita wrote: > On 2016/10/06 20:17, Amit Langote wrote: >> On 2016/10/05 20:45, Etsuro Fujita wrote: >>> On 2016/10/05 14:09, Ashutosh Bapat wrote: IMO, maintaining that extra function and the risk of bugs because of not keeping those two functions in sync outweigh the small not-so-frequent gain. >>> The inefficiency wouldn't be negligible in the case where there are large >>> numbers of cached plans. So, I'd like to introduce a helper function that >>> checks the dependency list for the generic plan, to eliminate most of the >>> duplication. > >> I too am inclined to have a PlanCacheForeignCallback(). >> >> Just one minor comment: > > Thanks for the comments! > > I noticed that we were wrong. Your patch was modified so that > dependencies on FDW-related objects would be extracted from a given plan > in create_foreignscan_plan (by Ashutosh) and then in > set_foreignscan_references by me, but that wouldn't work well for INSERT > cases. To fix that, I'd like to propose that we collect the dependencies > from the given rte in add_rte_to_flat_rtable, instead. I see. So, doing it from set_foreignscan_references() fails to capture plan dependencies in case of INSERT because it won't be invoked at all unlike the UPDATE/DELETE case. > Attached is an updated version, in which I removed the > PlanCacheForeignCallback and adjusted regression tests a bit. > >>> If some writable FDW consulted foreign table/server/FDW options in >>> AddForeignUpdateTarget, which adds the extra junk columns for >>> UPDATE/DELETE to the targetList of the given query tree, the rewritten >>> query tree would also need to be invalidated. But I don't think such an >>> FDW really exists because that routine in a practical FDW wouldn't change >>> such columns depending on those options. > > I had second thoughts about that; since the possibility wouldn't be zero, > I added to extract_query_dependencies_walker the same code I added to > add_rte_to_flat_rtable. And here, since AddForeignUpdateTargets() could possibly utilize foreign options which would cause *query tree* dependencies. It's possible that add_rte_to_flat_rtable may not be called before an option change, causing invalidation of any cached objects created based on the changed options. So, must record dependencies from extract_query_dependencies as well. > After all, the updated patch is much like your version, but I think your > patch, which modified extract_query_dependencies_walker only, is not > enough because a generic plan can have more dependencies than its query > tree (eg, consider inheritance). Agreed. I didn't know at the time that extract_query_dependencies is only able to capture dependencies using the RTEs in the *rewritten* query tree; it wouldn't have gone through the planner at that point. I think this (v4) patch is in the best shape so far. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] VACUUM's ancillary tasks
On Thu, Oct 6, 2016 at 2:46 PM, Robert Haas wrote: > > > Also, I think that doing more counts which get amalgamated into the same > > threshold, rather than introducing another parameter, is a bad thing. I > > have insert-mostly, most of the time, tables which are never going to > > benefit from index-only-scans, and I don't want to pay the cost of them > > getting vacuumed just because of some iOn Thu, Oct 6, 2016 at 3:56 PM, > Jeff Janes wrote: > >> Sure, I could handle each case separately, but the goal of this patch > > nserts, when I will never get a > > benefit out of it. I could turn autovacuum off for those tables, but > then I > > would have to remember to manually intervene on the rare occasion they do > > get updates or deletes. I want to be able to manually pick which tables > I > > sign up for this feature (and then forget it), not manually pick which > ones > > to exempt from it and have to constantly review that. > > Of course, if you do that, then what will happen is eventually it will > be time to advance relfrozenxid for that relation, and you'll get a > giant soul-crushing VACUUM of doom, rather than a bunch of small, > hopefully-innocuous VACUUMs. I think I will get the soul-crushing vacuum of doom anyway, if the database lasts that long. For one reason, in my case while updates and deletes are rare, they are common enough that the frozen vm bits from early vacuums will be mostly cleared before the vacuum of doom comes around. For a second reason, I don't think the frozen bit in the vm actually gets set by much of anything other than a autovacuum-for-wraparound or a manual VACUUM FREEZE. In commit 37484ad2aacef5ec7, you changed the way that frozen tuples were represented, so that we could make freezing more aggressive without losing forensic evidence. But I don't think we ever did anything to actually make the freezing more aggressive. When I applied the up-thread patch so that pgbench_history gets autovac, those autovacs don't actually cause any pages to get frozen until the wrap around kicks in, even when all the tuples on the early pages should be well beyond vacuum_freeze_min_age. Cheers, Jeff
Re: [HACKERS] pgbench vs. wait events
On Fri, Oct 7, 2016 at 3:38 AM, Robert Haas wrote: > I decided to do some testing on hydra (IBM-provided community > resource, POWER, 16 cores/64 threads, kernel 3.2.6-3.fc16.ppc64) using > the newly-enhanced wait event stuff to try to get an idea of what > we're waiting for during pgbench. I did 30-minute pgbench runs with > various configurations, but all had max_connections = 200, > shared_buffers = 8GB, maintenance_work_mem = 4GB, synchronous_commit = > off, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9, > log_line_prefix = '%t [%p] ', max_wal_size = 40GB, log_checkpoints = > on. During each run, I ran this psql script in another window and > captured the output: Nice. Thanks for sharing. > \t > select wait_event_type, wait_event from pg_stat_activity where pid != > pg_backend_pid() > \watch 0.5 > > Then, I used a little shell-scripting to count up the number of times > each wait event occurred in the output. Or an INSERT SELECT on an unlogged table? No need of extra maths then. > First, I tried scale factor > 3000 with 32 clients and got these results: > > 1 LWLockTranche | buffer_mapping > [...] > 21014 LWLockNamed | WALWriteLock > 28497 | > 58279 LWLockTranche | wal_insert > > tps = 1150.803133 (including connections establishing) > > What jumps out here is, at least to me, is that there is furious > contention on both the wal_insert locks and on WALWriteLock. > Apparently, the system simply can't get WAL on disk fast enough to > keep up with this workload. Relation extension locks and > buffer_content locks also are also pretty common, both ahead of > ClientRead, a relatively uncommon wait event on this test. The load > average on the system was only about 3 during this test, indicating > that most processes are in fact spending most of their time off-CPU. Increasing the number of WAL insert slots would help? With your tests this is at 8 all the time. > The first thing I tried was switching to unlogged tables, which > produces these results: > > 1 BufferPin | BufferPin > 1 LWLockTranche | lock_manager > 2 LWLockTranche | buffer_mapping > 8 LWLockNamed | ProcArrayLock > 9 LWLockNamed | CheckpointerCommLock > 9 LWLockNamed | CLogControlLock > 11 LWLockTranche | buffer_content > 37 LWLockTranche | clog > 153 Lock| tuple > 388 LWLockNamed | XidGenLock > 827 Lock| transactionid >1267 Client | ClientRead > 20631 Lock| extend > 91767 | > > tps = 1223.239416 (including connections establishing) > > If you don't look at the TPS number, these results look like a vast > improvement. The overall amount of time spent not waiting for > anything is now much higher, and the problematic locks have largely > disappeared from the picture. However, the load average now shoots up > to about 30, because most of the time that the backends are "not > waiting for anything" they are in fact in kernel wait state D; that > is, they're stuck doing I/O. This suggests that we might want to > consider advertising a wait state when a backend is doing I/O, so we > can measure this sort of thing. Maybe something in fd.c.. It may be a good idea to actually have a look at a perf output on Linux to see where this contention is happening, use this conclusion to decide the place of a wait point, and then see if on other OSes share a similar pattern. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Fri, Oct 7, 2016 at 8:05 AM, Peter Geoghegan wrote: > Your customer > databases might feature far more use of Japanese collations, for > example, which might be an important factor. Not really :) -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Sun, Oct 2, 2016 at 6:48 PM, Michael Paquier wrote: > Okay, moved to next CF. I may look at it finally I got some use-cases > for it, similar to yours I guess.. Let me know how that goes. One thing I've definitely noticed is that the tool is good at finding corner-case bugs, so even if you can only test a small fraction of the number of databases that I've been able to test, there could still be significant value in your performing your own exercise. Your customer databases might feature far more use of Japanese collations, for example, which might be an important factor. (Well, the use of Arabic code points turned out to be an important factor in the question of whether or not en_US.utf8 could have been affected by the Glibc + abbreviated keys bug.) -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: About CMake v2
Stas Kelvich wrote: On 17 Sep 2016, at 20:21, Yury Zhuravlev wrote: Michael Paquier wrote: ... Tried to generate Xcode project out of cmake, build fails on genbki.pl: can't locate Catalog.pm (which itself lives in src/backend/catalog/Catalog.pm) Can you try again? On my Xcode 7.3 / ElCapitan everything is working correctly. -- Yury Zhuravlev Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench vs. wait events
On Thu, Oct 6, 2016 at 4:40 PM, Jeff Janes wrote: > Scale factor 3000 obviously doesn't fit in shared_buffers. But does it fit > in RAM? That is, are the backends doing real IO, or they just doing fake IO > to the kernel's fs cache? That's a good question. [rhaas@hydra ~]$ free -g total used free sharedbuffers cached Mem:61 26 34 0 0 24 -/+ buffers/cache: 2 58 Swap: 19 4 15 rhaas=# select pg_size_pretty(pg_relation_size('pgbench_accounts')); pg_size_pretty 38 GB (1 row) rhaas=# select pg_size_pretty(pg_database_size(current_database())); pg_size_pretty 44 GB (1 row) That's pretty tight, especially since I now notice Andres also left a postmaster running on this machine back in April, with shared_buffers=8GB. 44GB for this database plus 8GB for shared_buffers plus 8GB for the other postmaster's shared_buffers leaves basically no slack, so it was probably doing quite a bit of real I/O, especially after the database got a bit of bloat. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] VACUUM's ancillary tasks
On Thu, Oct 6, 2016 at 3:56 PM, Jeff Janes wrote: >> Sure, I could handle each case separately, but the goal of this patch >> (as hinted at in the Subject) is to generalize all the different tasks >> we've been giving to VACUUM. The only missing piece is what the first >> patch addresses; which is insert-mostly tables never getting vacuumed. > > I don't buy the argument that we should do this in order to be "general". > Those other pieces are present to achieve a specific job, not out of > generality. +1. > If we want to have something to vacuum insert-mostly tables for the sake of > the index-only-scans, then I don't think we can ignore the fact that the > visibility map is page based, not tuple based. If we insert 10,000 tuples > into a full table and they all go into 100 pages at the end, that is very > different than inserting 10,000 tuples which each go into a separate page > (because each page has that amount of freespace). In one case you have > 10,000 tuples not marked as all visible, in the other case you have > 1,000,000 not marked as all visible. +1. > Also, I think that doing more counts which get amalgamated into the same > threshold, rather than introducing another parameter, is a bad thing. I > have insert-mostly, most of the time, tables which are never going to > benefit from index-only-scans, and I don't want to pay the cost of them > getting vacuumed just because of some inserts, when I will never get a > benefit out of it. I could turn autovacuum off for those tables, but then I > would have to remember to manually intervene on the rare occasion they do > get updates or deletes. I want to be able to manually pick which tables I > sign up for this feature (and then forget it), not manually pick which ones > to exempt from it and have to constantly review that. Of course, if you do that, then what will happen is eventually it will be time to advance relfrozenxid for that relation, and you'll get a giant soul-crushing VACUUM of doom, rather than a bunch of small, hopefully-innocuous VACUUMs. I've been wondering if would make sense to trigger vacuums based on inserts based on a *fixed* number of pages, rather than a percentage of the table. Say, for example, whenever we have 64MB of not-all-visible pages, we VACUUM. There are some difficulties: 1. We don't want to vacuum too early. For example, a bulk load may vastly inflate the size of a table, but vacuuming it while the load is in progress will be useless. You can maybe avoid that problem by basing this on statistics only reported at the end of the transaction, but there's another, closely-related issue: vacuuming right after the transaction completes may be useless, too, if there are old snapshots still in existence that render the newly-inserted tuples not-all-visible. 2. We don't want to trigger index vacuuming for a handful of dead tuples, or at least not too often. I've previously suggested requiring a certain minimum number of dead index tuples that would be required before index vacuuming occurs; prior to that, we'd just truncate to dead line pointers and leave those for the next vacuum cycle. This might need some refinement - should it be page-based? - but the basic idea still seems sound. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switch to unnamed POSIX semaphores as our preferred sema code?
Robert Haas writes: > Alternatively, get a bigger box. :-) So what's it take to get access to hydra? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] autonomous transactions
On 6 October 2016 at 21:27, Robert Haas wrote: >> * The labelling "Autonomous Transaction" is a simple coat of paint, >> which can easily be transferred to a better implementation if one >> comes. If one doesn't then its better to have something than nothing. >> So I suggest we commit Background Transactions first and then in a >> fairly thin commit, implement Autonomous Transactions on top of it for >> now and if we get a better one, switch it over. > > I think we should implement background transactions and call them > background transactions. That allows us to expose additional > functionality which is useful, like the ability to kick something off > and check back later for the results. There's no reason to call it > background transactions and also call it autonomous transactions: one > feature doesn't need two names. For myself, I don't care what you call it. I just want to be able to invoke it by saying PRAGMA AUTONOMOUS_TRANSACTION; and I know many others do also. If a better implementation emerges I would happily replace this one with another. I'm happy to also invoke it via an alternate mechanism or API, so that it can continue to be used even if the above mechanism changes. We have no need to wait for the perfect solution, even assuming we would ever agree that just one exists. -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench vs. wait events
On Thu, Oct 6, 2016 at 11:38 AM, Robert Haas wrote: > Hi, > > I decided to do some testing on hydra (IBM-provided community > resource, POWER, 16 cores/64 threads, kernel 3.2.6-3.fc16.ppc64) using > the newly-enhanced wait event stuff to try to get an idea of what > we're waiting for during pgbench. I did 30-minute pgbench runs with > various configurations, but all had max_connections = 200, > shared_buffers = 8GB, maintenance_work_mem = 4GB, synchronous_commit = > off, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9, > log_line_prefix = '%t [%p] ', max_wal_size = 40GB, log_checkpoints = > on. During each run, I ran this psql script in another window and > captured the output: > > \t > select wait_event_type, wait_event from pg_stat_activity where pid != > pg_backend_pid() > \watch 0.5 > > Then, I used a little shell-scripting to count up the number of times > each wait event occurred in the output. First, I tried scale factor > 3000 with 32 clients and got these results: > Scale factor 3000 obviously doesn't fit in shared_buffers. But does it fit in RAM? That is, are the backends doing real IO, or they just doing fake IO to the kernel's fs cache? Cheers, Jeff
Re: [HACKERS] autonomous transactions
On Thu, Oct 6, 2016 at 5:56 AM, Simon Riggs wrote: > Just to point out that I've actually written this approach already. > The patch is available, Autonomous Subtransactions. > We discussed it in Ottawa and it was rejected. (I thought Robert was > there, but Serge and Tom definitely were). Where is the patch? > See other posts in this thread by Serge and Craig to explain why. I don't think the posts on Craig and Serge explain why that approach was rejected or would be a bad idea. > We have various approaches... summarised in chronological order of > their suggestion > > 1. Use additional PGXACTs - rejected because it wouldn't provide enough room Of course, a background worker uses a PGXACT too and a lot more, so if you think extra PGXACTs are bad, you should *really* think background workers are bad. > 2. Use Autonomous SubTransactions - rejected because the semantics are > different to what we might expect from ATs In what way? I think the questions of how you implement it and what the semantics are are largely orthogonal questions. To which proposal is this referring? > 3. Use background transactions (this thread) Sure. > 4. Use pause and resume so we don't use up too many PGXACTs I don't know what "pause and resume" means. > * The labelling "Autonomous Transaction" is a simple coat of paint, > which can easily be transferred to a better implementation if one > comes. If one doesn't then its better to have something than nothing. > So I suggest we commit Background Transactions first and then in a > fairly thin commit, implement Autonomous Transactions on top of it for > now and if we get a better one, switch it over. I think we should implement background transactions and call them background transactions. That allows us to expose additional functionality which is useful, like the ability to kick something off and check back later for the results. There's no reason to call it background transactions and also call it autonomous transactions: one feature doesn't need two names. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] VACUUM's ancillary tasks
On Sat, Oct 1, 2016 at 1:34 PM, Vik Fearing wrote: > > Sure, I could handle each case separately, but the goal of this patch > (as hinted at in the Subject) is to generalize all the different tasks > we've been giving to VACUUM. The only missing piece is what the first > patch addresses; which is insert-mostly tables never getting vacuumed. > I don't buy the argument that we should do this in order to be "general". Those other pieces are present to achieve a specific job, not out of generality. If we want to have something to vacuum insert-mostly tables for the sake of the index-only-scans, then I don't think we can ignore the fact that the visibility map is page based, not tuple based. If we insert 10,000 tuples into a full table and they all go into 100 pages at the end, that is very different than inserting 10,000 tuples which each go into a separate page (because each page has that amount of freespace). In one case you have 10,000 tuples not marked as all visible, in the other case you have 1,000,000 not marked as all visible. Also, I think that doing more counts which get amalgamated into the same threshold, rather than introducing another parameter, is a bad thing. I have insert-mostly, most of the time, tables which are never going to benefit from index-only-scans, and I don't want to pay the cost of them getting vacuumed just because of some inserts, when I will never get a benefit out of it. I could turn autovacuum off for those tables, but then I would have to remember to manually intervene on the rare occasion they do get updates or deletes. I want to be able to manually pick which tables I sign up for this feature (and then forget it), not manually pick which ones to exempt from it and have to constantly review that. Cheers, Jeff
Re: [HACKERS] PostgreSQL - Weak DH group
Re: Heikki Linnakangas 2016-10-06 > I propose the attached patch. It gives up on trying to deal with multiple > key lengths (as noted earlier, OpenSSL just always passed keylength=1024, so > that was useless). Instead of using the callback, it just sets fixed DH > parameters with SSL_CTX_set_tmp_dh(), like we do for the ECDH curve. The DH > parameters are loaded from a file called "dh_params.pem" (instead of > "dh1024.pem"), if present, otherwise the built-in 2048 bit parameters are > used. Shouldn't this be a GUC pointing to a configurable location like ssl_cert_file? This way, people reading the security section of the default postgresql.conf would notice that there's something (new) to configure. (And I wouldn't want to start creating symlinks from PGDATA to /etc/ssl/something again...) Thanks, Christoph -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switch to unnamed POSIX semaphores as our preferred sema code?
On Thu, Oct 6, 2016 at 9:46 AM, Tom Lane wrote: > Can anyone think of a test case that would stress semaphore operations > more heavily, without being unrealistic? I think it's going to be pretty hard to come up with a non-artificial test case that has exhibits meaningful lwlock contention on an 8-core system. If you go back to 9.1, before we had fast-path locking, you can do it, because the relation locks and vxid locks do cause noticeable contention on the lock manager locks in that version. Alternatively, you might try something like "pgbench -n -S -c $N -j $N" with a scale factor that doesn't fit in shared buffers. This probably won't produce significant contention because there are 128 LWLocks and only 8 cores, but you could reduce the number of buffer mapping LWLocks to, say, 4 and then you'd probably hit it fairly hard. Alternatively, get a bigger box. :-) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] pgbench vs. wait events
Hi, I decided to do some testing on hydra (IBM-provided community resource, POWER, 16 cores/64 threads, kernel 3.2.6-3.fc16.ppc64) using the newly-enhanced wait event stuff to try to get an idea of what we're waiting for during pgbench. I did 30-minute pgbench runs with various configurations, but all had max_connections = 200, shared_buffers = 8GB, maintenance_work_mem = 4GB, synchronous_commit = off, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9, log_line_prefix = '%t [%p] ', max_wal_size = 40GB, log_checkpoints = on. During each run, I ran this psql script in another window and captured the output: \t select wait_event_type, wait_event from pg_stat_activity where pid != pg_backend_pid() \watch 0.5 Then, I used a little shell-scripting to count up the number of times each wait event occurred in the output. First, I tried scale factor 3000 with 32 clients and got these results: 1 LWLockTranche | buffer_mapping 9 LWLockNamed | CLogControlLock 14 LWLockNamed | ProcArrayLock 16 Lock| tuple 25 LWLockNamed | CheckpointerCommLock 49 LWLockNamed | WALBufMappingLock 122 LWLockTranche | clog 182 Lock| transactionid 287 LWLockNamed | XidGenLock 1300 Client | ClientRead 1375 LWLockTranche | buffer_content 3990 Lock| extend 21014 LWLockNamed | WALWriteLock 28497 | 58279 LWLockTranche | wal_insert tps = 1150.803133 (including connections establishing) What jumps out here is, at least to me, is that there is furious contention on both the wal_insert locks and on WALWriteLock. Apparently, the system simply can't get WAL on disk fast enough to keep up with this workload. Relation extension locks and buffer_content locks also are also pretty common, both ahead of ClientRead, a relatively uncommon wait event on this test. The load average on the system was only about 3 during this test, indicating that most processes are in fact spending most of their time off-CPU. The first thing I tried was switching to unlogged tables, which produces these results: 1 BufferPin | BufferPin 1 LWLockTranche | lock_manager 2 LWLockTranche | buffer_mapping 8 LWLockNamed | ProcArrayLock 9 LWLockNamed | CheckpointerCommLock 9 LWLockNamed | CLogControlLock 11 LWLockTranche | buffer_content 37 LWLockTranche | clog 153 Lock| tuple 388 LWLockNamed | XidGenLock 827 Lock| transactionid 1267 Client | ClientRead 20631 Lock| extend 91767 | tps = 1223.239416 (including connections establishing) If you don't look at the TPS number, these results look like a vast improvement. The overall amount of time spent not waiting for anything is now much higher, and the problematic locks have largely disappeared from the picture. However, the load average now shoots up to about 30, because most of the time that the backends are "not waiting for anything" they are in fact in kernel wait state D; that is, they're stuck doing I/O. This suggests that we might want to consider advertising a wait state when a backend is doing I/O, so we can measure this sort of thing. Next, I tried lowering the scale factor to something that fits in shared buffers. Here are the results at scale factor 300: 14 Lock| tuple 22 LWLockTranche | lock_manager 39 LWLockNamed | WALBufMappingLock 331 LWLockNamed | CLogControlLock 461 LWLockNamed | ProcArrayLock 536 Lock| transactionid 552 Lock| extend 716 LWLockTranche | buffer_content 763 LWLockNamed | XidGenLock 2113 LWLockNamed | WALWriteLock 6190 LWLockTranche | wal_insert 25002 Client | ClientRead 78466 | tps = 27651.562835 (including connections establishing) Obviously, there's a vast increase in TPS, and the backends seem to spend most of their time actually doing work. ClientRead is now the overwhelmingly dominant wait event, although wal_insert and WALWriteLock contention is clearly still a significant problem. Contention on other locks is apparently quite rare. Notice that client reads are really significant here - more than 20% of the time we sample what a backend is doing, it's waiting for the next query. It seems unlikely that performance on this workload can be improved very much by optimizing anything other than WAL writing, because no other wait event appears in more than 1% of the samples. It's not clear how much of the WAL-related stuff is artificial lock contention and how much of it is the limited speed at which the disk can rotate. However, I was curious about what's going on with CLogControlLock, persuant to the other thread on that topic, so I reran this with unlogged tables. At scale factor 300, 32 clie
[HACKERS] FSM corruption leading to errors
I investigated a bug report from one of our customers and it looked very similar to previous bug reports here [1], [2], [3] (and probably more). In these reports, the error looks something like this: ERROR: could not read block 28991 in file "base/16390/572026": read only 0 of 8192 bytes I traced it to the following code in MarkBufferDirtyHint(). The function returns without setting the DIRTY bit on the standby: 3413 /* 3414 * If we're in recovery we cannot dirty a page because of a hint. 3415 * We can set the hint, just not dirty the page as a result so the 3416 * hint is lost when we evict the page or shutdown. 3417 * 3418 * See src/backend/storage/page/README for longer discussion. 3419 */ 3420 if (RecoveryInProgress()) 3421 return; 3422 freespace.c freely uses MarkBufferDirtyHint() whenever changes are made to the FSM. I think that's usually alright because FSM changes are not WAL logged and if FSM ever returns a block with less free space than the caller needs, the caller is usually prepared to update the FSM and request for a new block. But if it returns a block that is outside the size of the relation, then we've a trouble. The very next ReadBuffer() fails to handle such a block and throws the error. When a relation is truncated, the FSM is truncated too to remove references to the heap blocks that are being truncated. But since the FSM buffer may not be marked DIRTY on the standby, if the buffer gets evicted from the buffer cache, the on-disk copy of the FSM page may be left with references to the truncated heap pages. When the standby is later promoted to be the master, and an insert/update is attempted to the table, the FSM may return a block that is outside the valid range of the relation. That results in the said error. Once this was clear, it was easy to put together a fully reproducible test case. See the attached script; you'll need to adjust to your environment. This affects all releases starting 9.3 and the script can reproduce the problem on all these releases. I believe the fix is very simple. The FSM change during truncation is critical and the buffer must be marked by MarkBufferDirty() i.e. those changes must make to the disk. I think it's alright not to WAL log them because XLOG_SMGR_TRUNCATE will redo() them if a crash occurs. But it must not be lost across a checkpoint. Also, since it happens only during relation truncation, I don't see any problem from performance perspective. What bothers me is how to fix the problem for already affected standbys. If the FSM for some table is already corrupted at the standby, users won't notice it until the standby is promoted to be the new master. If the standby starts throwing errors suddenly after failover, it will be a very bad situation for the users, like we noticed with our customers. The fix is simple and users can just delete the FSM (and VACUUM the table), but that doesn't sound nice and they would not know until they see the problem. One idea is to always check if the block returned by the FSM is outside the range and discard such blocks after setting the FSM (attached patch does that). The problem with that approach is that RelationGetNumberOfBlocks() is not really cheap and invoking it everytime FSM is consulted may not be a bright idea. Can we cache that value in the RelationData or some such place (BulkInsertState?) and use that as a hint for quickly checking if the block is (potentially) outside the range and discard it? Any other ideas? The other concern I've and TBH that's what I initially thought as the real problem, until I saw RecoveryInProgress() specific code, is: can this also affect stand-alone masters? The comments at MarkBufferDirtyHint() made me think so: 3358 * 3. This function does not guarantee that the buffer is always marked dirty 3359 *(due to a race condition), so it cannot be used for important changes. So I was working with a theory that somehow updates to the FSM page are lost because the race mentioned in the comment actually kicks in. But I'm not sure if the race is only possible when the caller is holding a SHARE lock on the buffer. When the FSM is truncated, the caller holds an EXCLUSIVE lock on the FSM buffer. So probably we're safe. I could not reproduce the issue on a stand-alone master. But probably worth checking. It might also be a good idea to inspect other callers of MarkBufferDirtyHint() and see if any of them is vulnerable, especially from standby perspective. I did one round, and couldn't see another problem. Thanks, Pavan [1] https://www.postgresql.org/message-id/CAJakt-8%3DaXa-F7uFeLAeSYhQ4wFuaX3%2BytDuDj9c8Gx6S_ou%3Dw%40mail.gmail.com [2] https://www.postgresql.org/message-id/20160601134819.30392.85...@wrigleys.postgresql.org [3] https://www.postgresql.org/message-id/AMSPR06MB504CD8FE8AA30D4B7C958AAE39E0%40AMSPR06MB504.eurprd06.prod.outlook.com -- Pav
Re: [HACKERS] Fast AT ADD COLUMN with DEFAULTs
On 10/6/16, Serge Rielau wrote: >> On Oct 6, 2016, at 9:20 AM, Tom Lane wrote: >> Vitaly Burovoy writes: >>> But what I discover for myself is that we have pg_attrdef separately >>> from the pg_attribute. Why? >> >> The core reason for that is that the default expression needs to be >> a separate object from the column for purposes of dependency analysis. >> For example, if you have a column whose default is "foo()", then the >> default expression depends on the function foo(), but the column should >> not: if you drop the function, only the default expression ought to >> be dropped, not the column. >> >> Because of this, the default expression needs to have its own OID >> (to be stored in pg_depend) and it's convenient to store it in a >> separate catalog so that the classoid can identify it as being a >> default expression rather than some other kind of object. > > Good to know. > >> If we were going to allow these missing_values or creation_defaults >> or whatever they're called to be general expressions, then they would >> need >> to have their own OIDs for dependency purposes. That would lead me to >> think that the best representation is to put them in their own rows in >> pg_attrdef, perhaps adding a boolean column to pg_attrdef to distinguish >> regular defaults from these things. Or maybe they even need their own >> catalog, depending on whether you think dependency analysis would want >> to distinguish them from regular defaults using just the classed. >> >> Now, as I just pointed out in another mail, realistically we're probably >> going to restrict the feature to simple constants, which'd mean they will >> depend only on the column's type and can never need any dependencies of >> their own. So we could take the shortcut of just storing them in a new >> column in pg_attribute. I agree with you. >> But maybe that's shortsighted and we'll >> eventually wish we'd done them as full-fledged separate objects. I don't think so. If we try to implement non-blocking adding columns with volatile defaults (and for instance update old rows in the background), we can end up with the next situation: CREATE TABLE a(i bigint PRIMARY KEY); INSERT INTO a SELECT generate_series(1,100); ALTER TABLE a ADD COLUMN b bigserial CHECK (b BETWEEN 1 AND 100); For indexes (even unique) created concurrently similar troubles are solved with a "not valid" mark, but what to do with a heap if we try to do it in the background? >> But on the third hand ... once one of these is in place, how could you >> drop it separately from the column? That would amount to a change in the >> column's stored data, which is not what one would expect from dropping >> a separate object. So maybe it's senseless to think that these things >> could ever be distinct objects. But that definitely leads to the >> conclusion that they're constants and nothing else. > I cannot follow this reasoning. > Let’s look past what PG does today: > For each row (whether that’s necessary or not) we evaluate the expression, > compute the value and > store it in the rewritten table. > We do not record dependencies on the “pedigree” of the value. > It happened to originate from the DEFAULT expression provided with the ADD > COLUMN, > but that is not remembered anywhere. > All we remember is the value - in each row. > So the only change that is proposed here - when it comes right down to it - > is to remember the value once only (IFF it is provably the same for each row) > and thus avoid the need to rewrite the table. > So I see no reason to impose any restriction other than “evaluated value is > provably the same for every row”. Tom says the same thing. The expression at the end should be a value if it allows to avoid rewriting table. > Regarding the location of storage. > I did start of using pg_attrdef, but ran into some snags. > My approach was to add the value as an extra column (rather than an extra > row). > That caused trouble since a SET DEFAULT operation is decomposed into a DROP > and a SET and > preserving the value across such operations did not come naturally. I'm sorry for making you be confused. The best way is to use an extra column in the pg_attribute to store serialized value. > If we were to use extra rows instead that issue would be solved, assuming we > add a “default kind” sort of column. > It would dictate the storage format though which may be considered overkill > for a a constant. -- Best regards, Vitaly Burovoy -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast AT ADD COLUMN with DEFAULTs
> On Oct 6, 2016, at 9:20 AM, Tom Lane wrote: > > Vitaly Burovoy writes: >> But what I discover for myself is that we have pg_attrdef separately >> from the pg_attribute. Why? > > The core reason for that is that the default expression needs to be > a separate object from the column for purposes of dependency analysis. > For example, if you have a column whose default is "foo()", then the > default expression depends on the function foo(), but the column should > not: if you drop the function, only the default expression ought to > be dropped, not the column. > > Because of this, the default expression needs to have its own OID > (to be stored in pg_depend) and it's convenient to store it in a > separate catalog so that the classoid can identify it as being a > default expression rather than some other kind of object. Good to know. > > If we were going to allow these missing_values or creation_defaults > or whatever they're called to be general expressions, then they would need > to have their own OIDs for dependency purposes. That would lead me to > think that the best representation is to put them in their own rows in > pg_attrdef, perhaps adding a boolean column to pg_attrdef to distinguish > regular defaults from these things. Or maybe they even need their own > catalog, depending on whether you think dependency analysis would want > to distinguish them from regular defaults using just the classed. > > Now, as I just pointed out in another mail, realistically we're probably > going to restrict the feature to simple constants, which'd mean they will > depend only on the column's type and can never need any dependencies of > their own. So we could take the shortcut of just storing them in a new > column in pg_attribute. But maybe that's shortsighted and we'll > eventually wish we'd done them as full-fledged separate objects. > > But on the third hand ... once one of these is in place, how could you > drop it separately from the column? That would amount to a change in the > column's stored data, which is not what one would expect from dropping > a separate object. So maybe it's senseless to think that these things > could ever be distinct objects. But that definitely leads to the > conclusion that they're constants and nothing else. I cannot follow this reasoning. Let’s look past what PG does today: For each row (whether that’s necessary or not) we evaluate the expression, compute the value and store it in the rewritten table. We do not record dependencies on the “pedigree” of the value. It happened to originate from the DEFAULT expression provided with the ADD COLUMN, but that is not remembered anywhere. All we remember is the value - in each row. So the only change that is proposed here - when it comes right down to it - is to remember the value once only (IFF it is provably the same for each row) and thus avoid the need to rewrite the table. So I see no reason to impose any restriction other than “evaluated value is provably the same for every row”. Regarding the location of storage. I did start of using pg_attrdef, but ran into some snags. My approach was to add the value as an extra column (rather than an extra row). That caused trouble since a SET DEFAULT operation is decomposed into a DROP and a SET and preserving the value across such operations did not come naturally. If we were to use extra rows instead that issue would be solved, assuming we ad a “default kind” sort of column. It would dictate the storage format though which may be considered overkill for a a constant. Cheers Serge -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast AT ADD COLUMN with DEFAULTs
On 10/6/16, Vitaly Burovoy wrote: > Ough. I made a mistake about pg_attribute because I forgot about the > pg_attrdef. > If we do not merge these tables, the pg_attrdef is the best place to > store evaluated expression as a constant the same way defaults are > stored in adbin. Oops. While I was writing the previous email, Tom explained necessity of the pg_attrdef. With that explanation it is obvious that I was wrong and a value for pre-existing rows should be in a new column in the pg_attribute. All the other thoughts from my previous email stand good. -- Best regards, Vitaly Burovoy -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast AT ADD COLUMN with DEFAULTs
On 10/6/16, Serge Rielau wrote: >> On Oct 6, 2016, at 9:01 AM, Tom Lane wrote: >> BTW, it also occurs to me that there are going to be good implementation >> reasons for restricting it to be a hard constant, not any sort of >> expression. We are likely to need to be able to insert the value in >> low-level code where general expression evaluation is impractical. >> > Yes, the padding must happen primarily in the getAttr() routines. > Clearly we do not want to evaluate an expression there. > But what speaks against evaluating the expression before we store it? > After all we seem to all agree that this only works if the expression > computes to the same constant all the time. > > If we do not want to store an “untyped” datum straight in pg_attribute as a > BYTEA (my current approach) Ough. I made a mistake about pg_attribute because I forgot about the pg_attrdef. If we do not merge these tables, the pg_attrdef is the best place to store evaluated expression as a constant the same way defaults are stored in adbin. > we could store the pretty printed version of the constant It is a wrong way. It ruins commands like "ALTER COLUMN ... TYPE ... USING" > and evaluate that when we build the tuple descriptor. > This happens when we load the relation into the relcache. > > Anyway, I’m jumping ahead and it’s perhaps best to let the code speak for > itself once I have the WIP patch ready so we have something concrete to > discuss -- Best regards, Vitaly Burovoy -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast AT ADD COLUMN with DEFAULTs
Vitaly Burovoy writes: > But what I discover for myself is that we have pg_attrdef separately > from the pg_attribute. Why? The core reason for that is that the default expression needs to be a separate object from the column for purposes of dependency analysis. For example, if you have a column whose default is "foo()", then the default expression depends on the function foo(), but the column should not: if you drop the function, only the default expression ought to be dropped, not the column. Because of this, the default expression needs to have its own OID (to be stored in pg_depend) and it's convenient to store it in a separate catalog so that the classoid can identify it as being a default expression rather than some other kind of object. If we were going to allow these missing_values or creation_defaults or whatever they're called to be general expressions, then they would need to have their own OIDs for dependency purposes. That would lead me to think that the best representation is to put them in their own rows in pg_attrdef, perhaps adding a boolean column to pg_attrdef to distinguish regular defaults from these things. Or maybe they even need their own catalog, depending on whether you think dependency analysis would want to distinguish them from regular defaults using just the classoid. Now, as I just pointed out in another mail, realistically we're probably going to restrict the feature to simple constants, which'd mean they will depend only on the column's type and can never need any dependencies of their own. So we could take the shortcut of just storing them in a new column in pg_attribute. But maybe that's shortsighted and we'll eventually wish we'd done them as full-fledged separate objects. But on the third hand ... once one of these is in place, how could you drop it separately from the column? That would amount to a change in the column's stored data, which is not what one would expect from dropping a separate object. So maybe it's senseless to think that these things could ever be distinct objects. But that definitely leads to the conclusion that they're constants and nothing else. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast AT ADD COLUMN with DEFAULTs
On 10/6/16, Tom Lane wrote: > Serge Rielau writes: >>> On Oct 6, 2016, at 5:25 AM, Vitaly Burovoy >>> wrote: Which makes me think we should call this missing_value or absent_value Be honest Simon Rigg's wrote that words. so its clear that it is not a "default" it is the value we use for rows that do not have any value stored for them. > >> I like Tom’s “creation default”. Another one could be “initial default”. >> But that, too, can be misread. > > Something based on missing_value/absent_value could work for me too. > > If we name it something involving "default", that definitely increases > the possibility for confusion with the regular user-settable default. > > Also worth thinking about here is that the regular default expression > affects what will be put into future inserted rows, whereas this thing > affects the interpretation of past rows. So it's really quite a different > animal. That's kind of leading me away from calling it creation_default. > > BTW, it also occurs to me that there are going to be good implementation > reasons for restricting it to be a hard constant, not any sort of > expression. We are likely to need to be able to insert the value in > low-level code where general expression evaluation is impractical. Yes, I mentioned that it should be evaluated and stored as a value because user functions can be changed (besides the speed reason), that's why I like the "value" in its name. The "default" is usually identified with expressions, not values (which are particular cases of expressions). Serge mentioned the phrase "pre-existing rows", which makes me think about something like "pre_existing_value" -- Best regards, Vitaly Burovoy -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast AT ADD COLUMN with DEFAULTs
> On Oct 6, 2016, at 9:01 AM, Tom Lane wrote: > > BTW, it also occurs to me that there are going to be good implementation > reasons for restricting it to be a hard constant, not any sort of > expression. We are likely to need to be able to insert the value in > low-level code where general expression evaluation is impractical. > Yes, the padding must happen primarily in the getAttr() routines. Clearly we do not want to evaluate an expression there. But what speaks against evaluating the expression before we store it? After all we seem to all agree that this only works if the expression computes to the same constant all the time. If we do not want to store an “untyped” datum straight in pg_attribute as a BYTEA (my current approach) we could store the pretty printed version of the constant and evaluate that when we build the tuple descriptor. This happens when we load the relation into the relcache. Anyway, I’m jumping ahead and it’s perhaps best to let the code speak for itself once I have the WIP patch ready so we have something concrete to discuss Cheers Serge -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] memory leak in e94568ecc10 (pre-reading in external sort)
On Thu, Oct 6, 2016 at 8:44 AM, Peter Geoghegan wrote: > Besides, what I propose to do is really exactly the same as what you > also want to do, except it avoids actually changing state->maxTapes. > We'd just pass down what you propose to assign to state->maxTapes > directly, which differs (and not just in the common case where there > are inactive tapes -- it's always at least off-by-one). Right? What I mean is that you should pass down numTapes alongside numInputTapes. The function init_tape_buffers() could either have an additional argument (numTapes), or derive numTapes itself. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast AT ADD COLUMN with DEFAULTs
Serge Rielau writes: >> On Oct 6, 2016, at 5:25 AM, Vitaly Burovoy wrote: >>> Which makes me think we should call this missing_value or absent_value >>> so its clear that it is not a "default" it is the value we use for >>> rows that do not have any value stored for them. > I like Tomâs âcreation defaultâ. Another one could be âinitial > defaultâ. > But that, too, can be misread. Something based on missing_value/absent_value could work for me too. If we name it something involving "default", that definitely increases the possibility for confusion with the regular user-settable default. Also worth thinking about here is that the regular default expression affects what will be put into future inserted rows, whereas this thing affects the interpretation of past rows. So it's really quite a different animal. That's kind of leading me away from calling it creation_default. BTW, it also occurs to me that there are going to be good implementation reasons for restricting it to be a hard constant, not any sort of expression. We are likely to need to be able to insert the value in low-level code where general expression evaluation is impractical. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] memory leak in e94568ecc10 (pre-reading in external sort)
On Thu, Oct 6, 2016 at 12:00 AM, Heikki Linnakangas wrote: > This is related to earlier the discussion with Peter G, on whether we should > change state->maxTapes to reflect the actual number of tape that were used, > when that's less than maxTapes. I think his confusion about the loop in > init_tape_buffers(), in > CAM3SWZQ7=fcy1iuz6jnzuunnktag6uitc1i-donxscp-9zs...@mail.gmail.com would > also have been avoided, if we had done that. So I think we should reconsider > that. -1 on that from me. I don't think that you should modify a variable that is directly linkable to Knuth's description of polyphase merge -- doing so seems like a bad idea. state->maxTapes (Knuth's T) really is something that is established pretty early on, and doesn't change. While the fix you pushed was probably a good idea anyway, I still think you should not use state->maxTapes to exhaustively call LogicalTapeAssignReadBufferSize() on every tape, even non-active tapes. That's the confusing part. It's not as if your need for the number of input tapes (the number of maybe-active tapes) is long lived; you just need to instruct logtape.c on buffer sizing once, at the start of mergeruns(). Besides, what I propose to do is really exactly the same as what you also want to do, except it avoids actually changing state->maxTapes. We'd just pass down what you propose to assign to state->maxTapes directly, which differs (and not just in the common case where there are inactive tapes -- it's always at least off-by-one). Right? -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast AT ADD COLUMN with DEFAULTs
> On Oct 6, 2016, at 5:25 AM, Vitaly Burovoy wrote: > > On 10/6/16, Simon Riggs wrote: >> On 6 October 2016 at 04:43, Serge Rielau wrote: > Or should I compose some sort of a design document? >> >> Having read this thread, I'm a little unclear as to what you're >> writing now, though there's definitely good ideas here. >> >> I think it would be beneficial to write up a single coherent >> description of this, including behaviour and a small sketch of >> implementation, just so everyone knows what this is. No design doc, >> but a summary. > > At the moment I think it can also be a good idea to post the current > patch as a Proposal or a WIP to get initial feedback. I can do that - Accompanied by a posting sized overview. > > Yes, it works for stable "now()" but does not work for volatile > functions like "random()", "uuid_generate_v4()" or default for serial > columns. The only possible way I can see is to check an expression has > only "T_Const"s, static and stable functions. In such case the > expression can be evaluated and the result be saved as a value for > absented attributes of a tuple. In the other case save NULL there and > rewrite the table. Agreed. I think DEFAULT as-is does the job nicely function wise. One can always decompose the ADD COLUMN into two steps within the same transaction if the initial column value for pre-existing rows does not match the default for new or updated rows. AT Just needs a performance boost for large tables where that’s reasonably possible. >> Which makes me think we should call this missing_value or absent_value >> so its clear that it is not a "default" it is the value we use for >> rows that do not have any value stored for them. > > It is definitely a default for a user, it is not a regular default internally. > I'm not a native speaker, "absent_value" can be mixed up with a NULL. > As for me the best phrase is "pre-add-column-default", but it is > impossible to use it as a column name. :-( > It is still an open question. I like Tom’s “creation default”. Another one could be “initial default”. But that, too, can be misread. Cheers Serge Rielau Salesforce.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Wed, Oct 5, 2016 at 10:22 PM, Robert Haas wrote: > On Tue, Oct 4, 2016 at 12:36 AM, Amit Kapila wrote: >> I think one way to avoid the risk of deadlock in above scenario is to >> take the cleanup lock conditionally, if we get the cleanup lock then >> we will delete the items as we are doing in patch now, else it will >> just mark the tuples as dead and ensure that it won't try to remove >> tuples that are moved-by-split. Now, I think the question is how will >> these dead tuples be removed. We anyway need a separate mechanism to >> clear dead tuples for hash indexes as during scans we are marking the >> tuples as dead if corresponding tuple in heap is dead which are not >> removed later. This is already taken care in btree code via >> kill_prior_tuple optimization. So I think clearing of dead tuples can >> be handled by a separate patch. > > That seems like it could work. The hash scan code will need to be > made smart enough to ignore any tuples marked dead, if it isn't > already. > It already takes care of ignoring killed tuples in below code, though the way is much less efficient as compare to btree. Basically, it fetches the matched tuple and then check if it is dead where as in btree while matching the key, it does the same check. It might be efficient to do it before matching the hashkey, but I think it is a matter of separate patch. hashgettuple() { .. /* * Skip killed tuples if asked to. */ if (scan->ignore_killed_tuples) } -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PostgreSQL - Weak DH group
On 10/05/2016 09:57 PM, Heikki Linnakangas wrote: On 10/05/2016 05:15 PM, Nicolas Guini wrote: We are working with Postgres 9.3.14 and executing nmap we found that it is using “weak DH group” (nmap –script ssl-dh-params). Weak = 1024 bits. Yeah, it seems that we're a bit behind the times on this... This issue is similar to what this post explains about using weak DH parameters: http://www.usefuljs.net/2016/09/29/imperfect-forward-secrecy/ The blog post points out that, as counterintuitive as it sounds, the SSL_CTX_set_tmp_dh_callback() callback should ignore the keylength argument, and always return a DH group with 2048 bits, or stronger. As you pointed out, that's not what our callback does. I propose the attached patch. It gives up on trying to deal with multiple key lengths (as noted earlier, OpenSSL just always passed keylength=1024, so that was useless). Instead of using the callback, it just sets fixed DH parameters with SSL_CTX_set_tmp_dh(), like we do for the ECDH curve. The DH parameters are loaded from a file called "dh_params.pem" (instead of "dh1024.pem"), if present, otherwise the built-in 2048 bit parameters are used. I removed the code for generating DH parameters on-the-fly altogether. The OpenSSL man page clearly says that that should not be done, because generating the parameters takes a long time. And because OpenSSL always passed keylength=1024, AFAICS that had been always dead code. If we want to get really fancy, we could generate the parameters the first time you turn ssl=on, and store them in $PGDATA. But the generation takes a very long time, so the admin might think it's stuck. I note that supplying custom DH parameters in the file is completely undocumented :-(. We should add a note in the docs on how to generate the custom DH parameters with openssl. Also, there's no easy way of telling if your custom parameters are actually been used. If you copy the params file in $PGDATA, but you have a typo in the name, you won't notice. Perhaps we should print a line in the log when the file is loaded. I'm afraid of back-patching this, out of fear that older clients would stop working, or would downgrade to an even weaker cipher. You could fix it by putting weaker 1024 bit params in dh_params.pem, so maybe we could do it if we put a warning and instructions for doing that in the release notes. Thoughts? - Heikki >From 69f74e3cced093c9ae2cce830e31c3fd76a8a06b Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Thu, 6 Oct 2016 16:22:29 +0300 Subject: [PATCH 1/1] Always use 2048 bit DH parameters for OpenSSL ephemeral DH ciphers. 1024 bits is considered weak these days, but OpenSSL always passes 1024 as the key length to the tmp_dh callback. All the code to handle other key lengths as in fact dead. To remedy those issues: * Only include hard-coded 2048-bit parameters. * Set the parameters directly with SSL_CTX_set_tmp_dh(), without the callback * The file for the DH parameters is now called "dh_params.pem", instead of "dh1024.pem" (the files for other key lengths, dh512.pem, dh2048.pem, etc. were never actually used) Per report by Nicolas Guini Discussion: --- src/backend/libpq/be-secure-openssl.c | 198 +- 1 file changed, 48 insertions(+), 150 deletions(-) diff --git a/src/backend/libpq/be-secure-openssl.c b/src/backend/libpq/be-secure-openssl.c index 668f217..fc7dd6a 100644 --- a/src/backend/libpq/be-secure-openssl.c +++ b/src/backend/libpq/be-secure-openssl.c @@ -65,18 +65,19 @@ #include "tcop/tcopprot.h" #include "utils/memutils.h" +/* name of the file containing DH parameters */ +#define DH_PARAMS_FILE "dh_params.pem" static int my_sock_read(BIO *h, char *buf, int size); static int my_sock_write(BIO *h, const char *buf, int size); static BIO_METHOD *my_BIO_s_socket(void); static int my_SSL_set_fd(Port *port, int fd); -static DH *load_dh_file(int keylength); +static DH *load_dh_file(void); static DH *load_dh_buffer(const char *, size_t); -static DH *generate_dh_parameters(int prime_len, int generator); -static DH *tmp_dh_cb(SSL *s, int is_export, int keylength); static int verify_cb(int, X509_STORE_CTX *); static void info_cb(const SSL *ssl, int type, int args); +static void initialize_dh(void); static void initialize_ecdh(void); static const char *SSLerrmessage(unsigned long ecode); @@ -94,14 +95,14 @@ static SSL_CTX *SSL_context = NULL; * sessions even if the static private key is compromised, * so we are *highly* motivated to ensure that we can use * EDH even if the DBA... or an attacker... deletes the - * $DataDir/dh*.pem files. + * $DataDir/dh_params.pem file. * * We could refuse SSL connections unless a good DH parameter * file exists, but some clients may quietly renegotiate an * unsecured connection without fully informing the user. * Very uncool. * - * Alternatively, the backend could attempt to load these files + * Alternative
Re: [HACKERS] Switch to unnamed POSIX semaphores as our preferred sema code?
I wrote: > Although in normal cases the semaphore code paths aren't very heavily > exercised in our code, I was able to get a measurable performance > difference by building with --disable-spinlocks, so that spinlocks are > emulated with semaphores. On an 8-core RHEL6 machine, "pgbench -S -c 20 > -j 20" seems to be about 4% faster with unnamed semaphores than SysV > semaphores. It'd be good to replicate that test on some higher-end > hardware, but provisionally I'd say unnamed semaphores are faster. I realized that the above test is probably bogus, or at least not very relevant to real-world Postgres performance. A key performance aspect of Linux futexes is that uncontended lock acquisitions, as well as releases that don't need to wake anyone, don't enter the kernel at all. However, in PG's normal use of semaphores, neither scenario occurs very often; processes lock their semaphores only after determining that they need to wait, and release semaphores only when it's known they'll waken a sleeper. The futex fast-path cases can occur only in the race condition that someone else awakens a would-be waiter before it actually reaches its semop call. However, uncontended locks and releases *are* very common for spinlocks. This means that testing with --disable-spinlocks will show a futex performance benefit that's totally irrelevant for real cases. Based on that analysis, I abandoned testing with --disable-spinlocks and instead tried to measure the actual speed of contended heavyweight lock acquisition/release. I used pgbench -f lockscript.sql -c 20 -j 20 -T 60 bench with the script being begin; lock table pgbench_accounts; commit; I got speeds between 10500 and 10800 TPS with either semaphore API; if there's any difference at all, it's below the noise level for this test scenario. So I'm now thinking there's basically no performance consideration here, and the point of switching would just be to get out from under SysV kernel resource limits. (Again though, this applies only to Linux --- the other thread I cited suggests things might be quite different on FreeBSD for instance.) Can anyone think of a test case that would stress semaphore operations more heavily, without being unrealistic? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgres_fdw : altering foreign table not invalidating prepare statement execution plan.
On 2016/10/06 20:17, Amit Langote wrote: On 2016/10/05 20:45, Etsuro Fujita wrote: On 2016/10/05 14:09, Ashutosh Bapat wrote: I wrote: So, I added a new callback function for those caches that is much like PlanCacheFuncCallback but skips checking the list for the query tree. IMO, maintaining that extra function and the risk of bugs because of not keeping those two functions in sync outweigh the small not-so-frequent gain. The inefficiency wouldn't be negligible in the case where there are large numbers of cached plans. So, I'd like to introduce a helper function that checks the dependency list for the generic plan, to eliminate most of the duplication. I too am inclined to have a PlanCacheForeignCallback(). Just one minor comment: Thanks for the comments! I noticed that we were wrong. Your patch was modified so that dependencies on FDW-related objects would be extracted from a given plan in create_foreignscan_plan (by Ashutosh) and then in set_foreignscan_references by me, but that wouldn't work well for INSERT cases. To fix that, I'd like to propose that we collect the dependencies from the given rte in add_rte_to_flat_rtable, instead. Attached is an updated version, in which I removed the PlanCacheForeignCallback and adjusted regression tests a bit. If some writable FDW consulted foreign table/server/FDW options in AddForeignUpdateTarget, which adds the extra junk columns for UPDATE/DELETE to the targetList of the given query tree, the rewritten query tree would also need to be invalidated. But I don't think such an FDW really exists because that routine in a practical FDW wouldn't change such columns depending on those options. I had second thoughts about that; since the possibility wouldn't be zero, I added to extract_query_dependencies_walker the same code I added to add_rte_to_flat_rtable. After all, the updated patch is much like your version, but I think your patch, which modified extract_query_dependencies_walker only, is not enough because a generic plan can have more dependencies than its query tree (eg, consider inheritance). Best regards, Etsuro Fujita foreign_plan_cache_inval_v4.patch Description: binary/octet-stream -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] psql autocomplete works not good in USER command in 9.6
On Thu, Oct 6, 2016 at 9:17 PM, Дмитрий Воронин wrote: > I find, that autocomplete does not works good when I type in psql some USER > commands: > > CREATE USER ... > ALTER USER ... > > psql's autocomplete returns database users and "pg_signal_backend" function, > I suppose, that is not correct. > > I try to undestand reasons and make a patch if those problem has not been > solved yet. I don't see any problems here. CREATE/ALTER USER have for a pretty long time tab-completed with the existing role names. 9.6 is not really new to that. What's new though in 9.6 is the introduction of pg_signal_backend in the list as a system role. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_rewind hangs if --source-server is used and syncrep is enabled
On Thu, Oct 6, 2016 at 7:37 PM, Heikki Linnakangas wrote: > Committed, thanks! I moved the call to where we establish the connection, > that felt slightly more natural. Thanks for the commit. Indeed that's better with the other sanity checks. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast AT ADD COLUMN with DEFAULTs
On 10/6/16, Simon Riggs wrote: > On 6 October 2016 at 04:43, Serge Rielau wrote: Or should I compose some sort of a design document? > > Having read this thread, I'm a little unclear as to what you're > writing now, though there's definitely good ideas here. > > I think it would be beneficial to write up a single coherent > description of this, including behaviour and a small sketch of > implementation, just so everyone knows what this is. No design doc, > but a summary. At the moment I think it can also be a good idea to post the current patch as a Proposal or a WIP to get initial feedback. > It would be very useful to be able to do this... > ALTER TABLE foo ADD last_updated_timestamp timestamp default > current_timestamp > so that it generates a constant value and stores that for all prior > rows, but then generates a new value for future rows. Yes, it works for stable "now()" but does not work for volatile functions like "random()", "uuid_generate_v4()" or default for serial columns. The only possible way I can see is to check an expression has only "T_Const"s, static and stable functions. In such case the expression can be evaluated and the result be saved as a value for absented attributes of a tuple. In the other case save NULL there and rewrite the table. > Which makes me think we should call this missing_value or absent_value > so its clear that it is not a "default" it is the value we use for > rows that do not have any value stored for them. It is definitely a default for a user, it is not a regular default internally. I'm not a native speaker, "absent_value" can be mixed up with a NULL. As for me the best phrase is "pre-add-column-default", but it is impossible to use it as a column name. :-( It is still an open question. (I remember funny versions in a discussion[1] when people tried to choose a name for a function reversed to pg_size_pretty...) [1] https://www.postgresql.org/message-id/flat/cafj8prd-tgodknxdygecza4on01_urqprwf-8ldkse-6bdh...@mail.gmail.com -- Best regards, Vitaly Burovoy -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] psql autocomplete works not good in USER command in 9.6
Hello, I find, that autocomplete does not works good when I type in psql some USER commands: CREATE USER ... ALTER USER ... psql's autocomplete returns database users and "pg_signal_backend" function, I suppose, that is not correct. I try to undestand reasons and make a patch if those problem has not been solved yet. -- Best regards, Dmitry Voronin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] autonomous transactions
On 06/10/16 11:56, Simon Riggs wrote: > > * The labelling "Autonomous Transaction" is a simple coat of paint, > which can easily be transferred to a better implementation if one > comes. If one doesn't then its better to have something than nothing. > So I suggest we commit Background Transactions first and then in a > fairly thin commit, implement Autonomous Transactions on top of it for > now and if we get a better one, switch it over. > +1 to this -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast AT ADD COLUMN with DEFAULTs
On 6 October 2016 at 04:43, Serge Rielau wrote: >>> Or should I compose some sort of a design document? Having read this thread, I'm a little unclear as to what you're writing now, though there's definitely good ideas here. I think it would be beneficial to write up a single coherent description of this, including behaviour and a small sketch of implementation, just so everyone knows what this is. No design doc, but a summary. It would be very useful to be able to do this... ALTER TABLE foo ADD last_updated_timestamp timestamp default current_timestamp so that it generates a constant value and stores that for all prior rows, but then generates a new value for future rows. Which makes me think we should call this missing_value or absent_value so its clear that it is not a "default" it is the value we use for rows that do not have any value stored for them. -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgres_fdw : altering foreign table not invalidating prepare statement execution plan.
Thanks to both of you for taking this up and sorry about the delay in responding. On 2016/10/05 20:45, Etsuro Fujita wrote: > On 2016/10/05 14:09, Ashutosh Bapat wrote: >>> I think it would be a bit inefficient to use PlanCacheFuncCallback as the >>> inval callback function for those caches, because that checks the >>> inval-item >>> list for the rewritten query tree, but any updates on such catalogs >>> wouldn't >>> affect the query tree. > >> I am not sure about that. Right now it seems that only the plans are >> affected, but can we say that for all FDWs? > > If some writable FDW consulted foreign table/server/FDW options in > AddForeignUpdateTarget, which adds the extra junk columns for > UPDATE/DELETE to the targetList of the given query tree, the rewritten > query tree would also need to be invalidated. But I don't think such an > FDW really exists because that routine in a practical FDW wouldn't change > such columns depending on those options. > >>> So, I added a new callback function for those caches >>> that is much like PlanCacheFuncCallback but skips checking the list for >>> the >>> query tree. > >> I am not sure that the inefficiency because of an extra loop on >> plansource->invalItems is a good reason to duplicate most of the code >> in PlanCacheFuncCallback(). IMO, maintaining that extra function and >> the risk of bugs because of not keeping those two functions in sync >> outweigh the small not-so-frequent gain. The name of function may be >> changed to be more generic, instead of current one referring Func. > > The inefficiency wouldn't be negligible in the case where there are large > numbers of cached plans. So, I'd like to introduce a helper function that > checks the dependency list for the generic plan, to eliminate most of the > duplication. I too am inclined to have a PlanCacheForeignCallback(). Just one minor comment: +static void +CheckGenericPlanDependencies(CachedPlanSource *plansource, + int cacheid, uint32 hashvalue) +{ +if (plansource->gplan && plansource->gplan->is_valid) +{ How about move the if() to the callers so that: +/* + * Check the dependency list for the generic plan. + */ +if (plansource->gplan && plansource->gplan->is_valid) +CheckGenericPlanDependencies(plansource, cacheid, hashvalue); That will reduce the indentation level within the function definition. By the way, one wild idea may be to have a fixed number of callbacks based on their behavior, rather than which catalogs they are meant for. For example, have 2 suitably named callbacks: first that invalidates both CachedPlanSource and CachedPlan, second that invalidates only CachedPlan. Use the appropriate one whenever a new case arises where we decide to mark plans dependent on still other types of objects. Just an idea... Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_rewind hangs if --source-server is used and syncrep is enabled
On 10/06/2016 02:24 AM, Michael Paquier wrote: On Wed, Oct 5, 2016 at 11:53 PM, Michael Banck wrote: My colleague Christoph Berg pointed out that pg_rewind could just set synchronous_commit = local before creating the temporary table, which indeed works, proof-of-concept patch attached Even synchronous_commit = off would not matter much, and we could just use that for performance reasons. The temporary table used in this context is just used to track the delta chunks of blocks, so this solution sounds better to me. I'll patch 9.4's pg_rewind similarly to what is decided here, and we could as well use an extra PQexec call to bring more clarity for the code, now an extra round-trip could be a big deal where network lag matters, but compared to the COPY chunks sent out that would not matter much I guess. Committed, thanks! I moved the call to where we establish the connection, that felt slightly more natural. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] autonomous transactions
On 7 September 2016 at 20:46, Robert Haas wrote: > On Sat, Sep 3, 2016 at 7:09 AM, Simon Riggs wrote: >> On 2 September 2016 at 09:45, Robert Haas wrote: >>> On Wed, Aug 31, 2016 at 7:20 AM, Peter Eisentraut >>> wrote: I would like to propose the attached patch implementing autonomous transactions for discussion and review. >>> >>> I'm pretty skeptical of this approach. Like Greg Stark, Serge Rielau, >>> and Constantin Pan, I had expected that autonomous transactions should >>> be implemented inside of a single backend, without relying on workers. >> >> The problem is that we have limits on the number of concurrent >> transactions, which is currently limited by the number of procs. > > True. I believe this issue has been discussed before, and I think > it's a solvable issue. I believe that an autonomous transaction could > be made to appear to the rest of the system (outside the affected > backend) as if it were a subtransaction, but then get committed before > the containing transaction gets committed. This would avoid needing > more PGPROCs (but getting deadlock detection to work is hard). Just to point out that I've actually written this approach already. The patch is available, Autonomous Subtransactions. We discussed it in Ottawa and it was rejected. (I thought Robert was there, but Serge and Tom definitely were). See other posts in this thread by Serge and Craig to explain why. I take your point that startup time may be not as good as it can be. But what Peter has works and is useful, whatever we call it. We have various approaches... summarised in chronological order of their suggestion 1. Use additional PGXACTs - rejected because it wouldn't provide enough room 2. Use Autonomous SubTransactions - rejected because the semantics are different to what we might expect from ATs 3. Use background transactions (this thread) 4. Use pause and resume so we don't use up too many PGXACTs What I see is that there are some valid and useful features here and we should be developing them further because they have other benefits. * Autonomous Subtransactions might not work for Autonomous Transactions, but they are useful for incremental loading of large data, which then allows easier logical decoding. So that sounds like we should do that anyway, even if they are not ATs. They can also be used within parallel tasks. * Background Transactions are useful and we should have them. * The labelling "Autonomous Transaction" is a simple coat of paint, which can easily be transferred to a better implementation if one comes. If one doesn't then its better to have something than nothing. So I suggest we commit Background Transactions first and then in a fairly thin commit, implement Autonomous Transactions on top of it for now and if we get a better one, switch it over. -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Transactions involving multiple postgres foreign servers
On 2016/10/06 17:45, Ashutosh Bapat wrote: > On Thu, Oct 6, 2016 at 1:34 PM, Masahiko Sawada wrote: >> On Thu, Oct 6, 2016 at 1:41 PM, Ashutosh Bapat >> wrote: My understanding is that basically the local server can not return COMMIT to the client until 2nd phase is completed. >>> >>> If we do that, the local server may not return to the client at all, >>> if the foreign server crashes and never comes up. Practically, it may >>> take much longer to finish a COMMIT, depending upon how long it takes >>> for the foreign server to reply to a COMMIT message. >> >> Yes, I think 2PC behaves so, please refer to [1]. >> To prevent local server stops forever due to communication failure., >> we could provide the timeout on coordinator side or on participant >> side. > > This too, looks like a heuristic and shouldn't be the default > behaviour and hence not part of the first version of this feature. At any rate, the coordinator should not return to the client until after the 2nd phase is completed, which was the original point. If COMMIT taking longer is an issue, then it could be handled with one of the approaches mentioned so far (even if not in the first version), but no version of this feature should really return COMMIT to the client only after finishing the first phase. Am I missing something? I am saying this because I am assuming that this feature means the client itself does not invoke 2PC, even knowing that there are multiple servers involved, but rather rely on the involved FDW drivers and related core code handling it transparently. I may have misunderstood the feature though, apologies if so. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Transactions involving multiple postgres foreign servers
On Thu, Oct 6, 2016 at 1:34 PM, Masahiko Sawada wrote: > On Thu, Oct 6, 2016 at 1:41 PM, Ashutosh Bapat > wrote: No, the COMMIT returns after the first phase. It can not wait for all the foreign servers to complete their second phase >>> >>> Hm, it sounds like it's same as normal commit (not 2PC). >>> What's the difference? >>> >>> My understanding is that basically the local server can not return >>> COMMIT to the client until 2nd phase is completed. >> >> >> If we do that, the local server may not return to the client at all, >> if the foreign server crashes and never comes up. Practically, it may >> take much longer to finish a COMMIT, depending upon how long it takes >> for the foreign server to reply to a COMMIT message. > > Yes, I think 2PC behaves so, please refer to [1]. > To prevent local server stops forever due to communication failure., > we could provide the timeout on coordinator side or on participant > side. > This too, looks like a heuristic and shouldn't be the default behaviour and hence not part of the first version of this feature. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Transactions involving multiple postgres foreign servers
On Thu, Oct 6, 2016 at 1:41 PM, Ashutosh Bapat wrote: >>> >>> No, the COMMIT returns after the first phase. It can not wait for all >>> the foreign servers to complete their second phase >> >> Hm, it sounds like it's same as normal commit (not 2PC). >> What's the difference? >> >> My understanding is that basically the local server can not return >> COMMIT to the client until 2nd phase is completed. > > > If we do that, the local server may not return to the client at all, > if the foreign server crashes and never comes up. Practically, it may > take much longer to finish a COMMIT, depending upon how long it takes > for the foreign server to reply to a COMMIT message. Yes, I think 2PC behaves so, please refer to [1]. To prevent local server stops forever due to communication failure., we could provide the timeout on coordinator side or on participant side. > >> Otherwise the next transaction can see data that is not committed yet >> on remote server. > > 2PC doesn't guarantee transactional consistency all by itself. It only > guarantees that all legs of a distributed transaction are either all > rolled back or all committed. IOW, it guarantees that a distributed > transaction is not rolled back on some nodes and committed on the > other node. > Providing a transactionally consistent view is a very hard problem. > Trying to solve all those problems in a single patch would be very > difficult and the amount of changes required may be really huge. Then > there are many possible consistency definitions when it comes to > consistency of distributed system. I have not seen a consensus on what > kind of consistency model/s we want to support in PostgreSQL. That's > another large debate. We have had previous attempts where people have > tried to complete everything in one go and nothing has been completed > yet. Yes, providing a atomic visibility is hard problem, and it's a separated issue[2]. > 2PC implementation OR guaranteeing that all the legs of a transaction > commit or roll back, is an essential block of any kind of distributed > transaction manager. So, we should at least support that one, before > attacking further problems. I agree. [1]https://en.wikipedia.org/wiki/Two-phase_commit_protocol [2]http://www.bailis.org/papers/ramp-sigmod2014.pdf Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Feature suggestion: ON CONFLICT DO NOTHING RETURNING which returns existing row data
> On 5 Oct 2016, at 8:11 PM, Pantelis Theodosiou wrote: > > This can be solved by chaining modifying CTEs. > > Something like this (not tested) that can work with multiple rows inserted: Thanks for the suggestion, but it was actually slower than our current implementation, I believe due to always looking up t1’s id in that join rather than only doing it when we didn’t get an id back from the insert. My hope with this feature suggestion / request was that we wouldn’t have to do that subsequent lookup at all, as pg would just give it back to us. Maybe it would be a win if we were inserting multiple rows, but this code is actually in a trigger on a dummy table that we COPY data in to - thus it can’t be rewritten as a rule or a multi-row insert like that. Thanks Tom -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] memory leak in e94568ecc10 (pre-reading in external sort)
On 10/06/2016 07:50 AM, Tomas Vondra wrote: it seems e94568ecc10 has a pretty bad memory leak. A simple Oops, fixed, thanks for the report! To be precise, this wasn't a memory leak, just a gross overallocation of memory. The new code in tuplesort.c assumes that it's harmless to call LogicalTapeRewind() on never-used tapes, but in fact, it allocated the read buffer for the tape. I fixed that by changing LogicalTapeRewind() to not allocate it, if the tape was completely empty. This is related to earlier the discussion with Peter G, on whether we should change state->maxTapes to reflect the actual number of tape that were used, when that's less than maxTapes. I think his confusion about the loop in init_tape_buffers(), in CAM3SWZQ7=fcy1iuz6jnzuunnktag6uitc1i-donxscp-9zs...@mail.gmail.com would also have been avoided, if we had done that. So I think we should reconsider that. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers