Re: [HACKERS] Pluggable storage
On Fri, Oct 20, 2017 at 5:47 AM, Amit Kapila wrote: > I think what we need here is a way to register satisfies function > (SnapshotSatisfiesFunc) in SnapshotData for different storage engines. I don't see how that helps very much. SnapshotSatisfiesFunc takes a HeapTuple as an argument, and it cares in detail about that tuple's xmin, xmax, and infomask, and it sets hint bits. All of that is bad, because an alternative storage engine is likely to use a different format than HeapTuple and to not have hint bits (or at least not in the same form we have them now). Also, it doesn't necessarily have a Boolean answer to the question "can this snapshot see this tuple?". It may be more like "given this TID, what tuple if any can I see there?" or "given this tuple, what version of it would I see with this snapshot?". Another thing to consider is that, if we could replace satisfiesfunc, it would probably break some existing code. There are multiple places in the code that compare snapshot->satisfies to HeapTupleSatisfiesHistoricMVCC and HeapTupleSatisfiesMVCC. I think the storage API should just leave snapshots alone. If a storage engine wants to call HeapTupleSatisfiesVisibility() with that snapshot, it can do so. Otherwise it can switch on snapshot->satisfies and handle each case however it likes. I don't see how generalizing a Snapshot for other storage engines really buys us anything except complexity and the danger of reducing performance for the existing heap. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Implementing pg_receivewal --no-sync
On Wed, Oct 25, 2017 at 6:07 AM, Michael Paquier wrote: > Hi all, > > After thinking a bit on the subject, I have decided to submit a patch > to do $subject. This makes pg_receivewal more consistent with > pg_basebackup. This option is mainly useful for testing, something > that becomes way more doable since support for --endpos has been > added. > > Unsurprisingly, --synchronous and --no-sync are incompatible options. + +By default, pg_receivewal flushes a WAL segment's +contents each time a feedback message is sent to the server depending +on the interval of time defined by +--status-interval. IMHO, it's okay to remove the part 'depending on the.--status-interval'. +This option causes +pg_receivewal to not issue such flushes waiting, Did you mean 'to not issue such flush waitings'? + [ 'pg_receivewal', '-D', $stream_dir, '--synchronous', '--no-sync' ], + 'failure if --synchronous specified without --no-sync'); s/without/with -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Hash take II
On Tue, Oct 24, 2017 at 10:10 PM, Thomas Munro wrote: > Here is an updated patch set that does that ^. It's a bit hard to understand what's going on with the v21 patch set I posted yesterday because EXPLAIN ANALYZE doesn't tell you anything interesting. Also, if you apply the multiplex_gather patch[1] I posted recently and set multiplex_gather to off then it doesn't tell you anything at all, because the leader has no hash table (I suppose that could happen with unpatched master given sufficiently bad timing). Here's a new version with an extra patch that adds some basic information about load balancing to EXPLAIN ANALYZE, inspired by what commit bf11e7ee did for Sort. Example output: enable_parallel_hash = on, multiplex_gather = on: -> Parallel Hash (actual rows=100 loops=3) Buckets: 131072 Batches: 16 Leader:Shared Memory Usage: 3552kB Hashed: 396120 Batches Probed: 7 Worker 0: Shared Memory Usage: 3552kB Hashed: 276640 Batches Probed: 6 Worker 1: Shared Memory Usage: 3552kB Hashed: 327240 Batches Probed: 6 -> Parallel Seq Scan on simple s (actual rows=33 loops=3) -> Parallel Hash (actual rows=1000 loops=8) Buckets: 131072 Batches: 256 Leader:Shared Memory Usage: 2688kB Hashed: 1347720 Batches Probed: 36 Worker 0: Shared Memory Usage: 2688kB Hashed: 1131360 Batches Probed: 33 Worker 1: Shared Memory Usage: 2688kB Hashed: 1123560 Batches Probed: 38 Worker 2: Shared Memory Usage: 2688kB Hashed: 1231920 Batches Probed: 38 Worker 3: Shared Memory Usage: 2688kB Hashed: 1272720 Batches Probed: 34 Worker 4: Shared Memory Usage: 2688kB Hashed: 1234800 Batches Probed: 33 Worker 5: Shared Memory Usage: 2688kB Hashed: 1294680 Batches Probed: 37 Worker 6: Shared Memory Usage: 2688kB Hashed: 1363240 Batches Probed: 35 -> Parallel Seq Scan on big s2 (actual rows=125 loops=8) enable_parallel_hash = on, multiplex_gather = off (ie no leader participation): -> Parallel Hash (actual rows=100 loops=2) Buckets: 131072 Batches: 16 Worker 0: Shared Memory Usage: 3520kB Hashed: 475920 Batches Probed: 9 Worker 1: Shared Memory Usage: 3520kB Hashed: 524080 Batches Probed: 8 -> Parallel Seq Scan on simple s (actual rows=50 loops=2) enable_parallel_hash = off, multiplex_gather = on: -> Hash (actual rows=100 loops=3) Buckets: 131072 Batches: 16 Leader:Memory Usage: 3227kB Worker 0: Memory Usage: 3227kB Worker 1: Memory Usage: 3227kB -> Seq Scan on simple s (actual rows=100 loops=3) enable_parallel_hash = off, multiplex_gather = off: -> Hash (actual rows=100 loops=2) Buckets: 131072 Batches: 16 Worker 0: Memory Usage: 3227kB Worker 1: Memory Usage: 3227kB -> Seq Scan on simple s (actual rows=100 loops=2) parallelism disabled (traditional single-line output, unchanged): -> Hash (actual rows=100 loops=1) Buckets: 131072 Batches: 16 Memory Usage: 3227kB -> Seq Scan on simple s (actual rows=100 loops=1) (It actually says "Tuples Hashed", not "Hashed" but I edited the above to fit on a standard punchcard.) Thoughts? [1] https://www.postgresql.org/message-id/CAEepm%3D2U%2B%2BLp3bNTv2Bv_kkr5NE2pOyHhxU%3DG0YTa4ZhSYhHiw%40mail.gmail.com -- Thomas Munro http://www.enterprisedb.com parallel-hash-v22.patchset.tgz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Current int & float overflow checking is slow.
On 2017-10-25 07:33:46 +0200, Robert Haas wrote: > On Tue, Oct 24, 2017 at 9:28 PM, Tom Lane wrote: > > I don't like changing well-defined, user-visible query behavior for > > no other reason than a performance gain (of a size that hasn't even > > been shown to be interesting, btw). Will we change it back in another > > ten years if the performance tradeoff changes? That part of the argument seems unconvincing. It's not like the overflow check is likely to ever have been beneficial performancewise, nor is it remotely likely for that to ever be the case. > > Also, if I recall the old discussion properly, one concern was getting > > uniform behavior across different platforms. I'm worried that if we do > > what Andres suggests, we'll get behavior that is not only different but > > platform-specific. Now, to the extent that you believe that every modern > > platform implements edge-case IEEE float behavior the same way, that worry > > may be obsolete. But I don't think I believe that. > > Yeah, those are reasonable concerns. I agree. I'm not really sure what the right way is here. I do however think it's worth discussing what ways to address the performance penalty due to the overflow checks, and one obvious way to do so is not to play. It'd be interesting to write the overflow checking addition in x86 inline asm, and see how much better that gets - just so we know the maximum we can reach with that. The problem with the C99 stuff seems to be the external function calls. With either, one problem would be that we'd have to reset the overflow register before doing math, which isn't free either - otherwise some external function could have left it set to on. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] unique index violation after pg_upgrade to PG10
On Tue, Oct 24, 2017 at 10:20 PM, Justin Pryzby wrote: > I think you must have compared these: Yes, I did. My mistake. > On Tue, Oct 24, 2017 at 03:11:44PM -0500, Justin Pryzby wrote: >> ts=# SELECT * FROM bt_page_items(get_raw_page('sites_idx', 1)); >> >> itemoffset | 48 >> ctid | (1,37) >> itemlen| 32 >> nulls | f >> vars | t >> data | 1b 43 52 43 4c 4d 54 2d 43 45 4d 53 30 0b 31 31 31 31 00 00 00 >> 00 00 00 > ... >> itemoffset | 37 >> ctid | (0,97) >> itemlen| 24 >> nulls | f >> vars | t >> data | 1b 43 52 43 4c 4d 54 2d 43 45 4d 53 30 03 00 00 > > ..but note those are both items in sites_idx (48 and 37, which I seem to have > pasted out of order).. I included both because I'm not confident I know what > the "index tid=(1,37)" referred to, but I gather it means item at offset=37 > (and not item with ctid=(1,37).) > > | [pryzbyj@database amcheck]$ psql --port 5678 ts -c "SELECT > bt_index_check('sites_idx'::regclass::oid, heapallindexed=>True)" > | ERROR: high key invariant violated for index "sites_idx" > | DETAIL: Index tid=(1,37) points to heap tid=(0,97) page lsn=0/0. This means that the item at (1,37) in the index is not <= the high key, which is located at (1,1). (The high key comes first, despite being an upper bound rather than a lower bound, per pageinspect docs.) I find it suspicious that the page lsn is 0 here, btw. > ts=# SELECT * FROM page_header(get_raw_page('sites_idx', 1)); > lsn | checksum | flags | lower | upper | special | pagesize | version | > prune_xid > -+--+---+---+---+-+--+-+--- > 0/0 |0 | 0 | 872 | 1696 |8176 | 8192 | 4 | >0 > > Here is its heap page: > > ts=# SELECT * FROM heap_page_items(get_raw_page('sites', 0)) WHERE lp=97; > lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | > t_infomask2 | t_infomask | t_hoff | t_bits | t_oid | >t_data > ++--+++--+--++-+++--+---+ > 97 |968 |1 | 52 | 21269 | 33567444 |0 | (3,27) | > 8204 | 2307 | 32 | 11101001 | | > \x70001b4352434c4d542d43454d5330030303 > > Which I see ends with 0303 vs .. Looks like I was accidentally right, then -- the heap and index do differ. You might try this tool I published recently, to get a better sense of details like this: https://github.com/petergeoghegan/pg_hexedit (Don't do so with a data directory that you cannot afford to corrupt again, though!) > Maybe this is relevant ? > ts=# SELECT * FROM heap_page_items(get_raw_page('sites', 3)) WHERE lp=27; > lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | > t_infomask2 | t_infomask | t_hoff | t_bits | t_oid | t_data > ++--++++--++-++++---+ > 27 | 0 |0 | 0 ||| || >|||| | This looks like an LP_DEAD item. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Current int & float overflow checking is slow.
On Tue, Oct 24, 2017 at 9:28 PM, Tom Lane wrote: > I don't like changing well-defined, user-visible query behavior for > no other reason than a performance gain (of a size that hasn't even > been shown to be interesting, btw). Will we change it back in another > ten years if the performance tradeoff changes? > > Also, if I recall the old discussion properly, one concern was getting > uniform behavior across different platforms. I'm worried that if we do > what Andres suggests, we'll get behavior that is not only different but > platform-specific. Now, to the extent that you believe that every modern > platform implements edge-case IEEE float behavior the same way, that worry > may be obsolete. But I don't think I believe that. Yeah, those are reasonable concerns. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] unique index violation after pg_upgrade to PG10
On Tue, Oct 24, 2017 at 02:57:47PM -0700, Peter Geoghegan wrote: > On Tue, Oct 24, 2017 at 1:11 PM, Justin Pryzby wrote: > > ..which I gather just verifies that the index is corrupt, not sure if > > there's > > anything else to do with it? Note, we've already removed the duplicate > > rows. > > Yes, the index itself is definitely corrupt -- this failed before the > new "heapallindexed" check even started. Though it looks like that > would have failed too, if you got that far, since the index points to > a row that does not contain the same data. (I only know this because > you dumped the heap tuple and the index tuple.) I think you must have compared these: On Tue, Oct 24, 2017 at 03:11:44PM -0500, Justin Pryzby wrote: > ts=# SELECT * FROM bt_page_items(get_raw_page('sites_idx', 1)); > > itemoffset | 48 > ctid | (1,37) > itemlen| 32 > nulls | f > vars | t > data | 1b 43 52 43 4c 4d 54 2d 43 45 4d 53 30 0b 31 31 31 31 00 00 00 > 00 00 00 ... > itemoffset | 37 > ctid | (0,97) > itemlen| 24 > nulls | f > vars | t > data | 1b 43 52 43 4c 4d 54 2d 43 45 4d 53 30 03 00 00 ..but note those are both items in sites_idx (48 and 37, which I seem to have pasted out of order).. I included both because I'm not confident I know what the "index tid=(1,37)" referred to, but I gather it means item at offset=37 (and not item with ctid=(1,37).) | [pryzbyj@database amcheck]$ psql --port 5678 ts -c "SELECT bt_index_check('sites_idx'::regclass::oid, heapallindexed=>True)" | ERROR: high key invariant violated for index "sites_idx" | DETAIL: Index tid=(1,37) points to heap tid=(0,97) page lsn=0/0. ts=# SELECT * FROM page_header(get_raw_page('sites_idx', 1)); lsn | checksum | flags | lower | upper | special | pagesize | version | prune_xid -+--+---+---+---+-+--+-+--- 0/0 |0 | 0 | 872 | 1696 |8176 | 8192 | 4 | 0 Here is its heap page: ts=# SELECT * FROM heap_page_items(get_raw_page('sites', 0)) WHERE lp=97; lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid | t_data ++--+++--+--++-+++--+---+ 97 |968 |1 | 52 | 21269 | 33567444 |0 | (3,27) | 8204 | 2307 | 32 | 11101001 | | \x70001b4352434c4d542d43454d5330030303 Which I see ends with 0303 vs .. t_infomask=2307=2048+256+3 => #define HEAP_HASNULL0x0001 /* has null attribute(s) */ #define HEAP_HASVARWIDTH0x0002 /* has variable-width attribute(s) */ #define HEAP_XMIN_COMMITTED 0x0100 /* t_xmin committed */ #define HEAP_XMAX_INVALID 0x0800 /* t_xmax invalid/aborted */ t_infomask2=8204 => 8192+12 => #define HEAP_KEYS_UPDATED 0x2000 /* tuple was updated and key cols modified, or tuple deleted */ Maybe this is relevant ? ts=# SELECT * FROM heap_page_items(get_raw_page('sites', 3)) WHERE lp=27; lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid | t_data ++--++++--++-++++---+ 27 | 0 |0 | 0 ||| || |||| | Justin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Remove secondary checkpoint
On Tue, Oct 24, 2017 at 7:24 PM, Tsunakawa, Takayuki wrote: > (3) > Should we change the default value of max_wal_size from 1 GB to a smaller > size? I vote for "no" for performance. The default has just changed in v10, so changing it again could be confusing, so I agree with your position. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Remove secondary checkpoint
From: pgsql-hackers-ow...@postgresql.org > [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Simon Riggs > This > * reduces disk space requirements on master > * removes a minor bug in fast failover > * simplifies code I welcome this patch. I was wondering why PostgreSQL retains the previous checkpoint. (Although unrelated to this, I've also been wondering why PostgreSQL flushes WAL to disk when writing a page in the shared buffer, because PostgreSQL doesn't use WAL for undo.) Here are my review comments. (1) -* a) we keep WAL for two checkpoint cycles, back to the "prev" checkpoint. +* a) we keep WAL for only one checkpoint cycle (prior to PG11 we kept +*WAL for two checkpoint cycles to allow us to recover from the +*secondary checkpoint if the first checkpoint failed, though we +*only did this on the master anyway, not on standby. Keeping just +*one checkpoint simplifies processing and reduces disk space in +*many smaller databases.) /* -* The last valid checkpoint record required for a streaming -* recovery exists in neither standby nor the primary. +* We used to attempt to go back to a secondary checkpoint +* record here, but only when not in standby_mode. We now +* just fail if we can't read the last checkpoint because +* this allows us to simplify processing around checkpoints. */ if (fast_promote) { - checkPointLoc = ControlFile->prevCheckPoint; + /* +* It appears to be a bug that we used to use prevCheckpoint here +*/ + checkPointLoc = ControlFile->checkPoint; - XLByteToSeg(PriorRedoPtr, _logSegNo, wal_segment_size); + /* Trim from the last checkpoint, not the last - 1 */ + XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size); I suggest to remove references to the secondary checkpoint (the old behavior) from the comments. I'm afraid those could confuse new developers. (2) @@ -8110,10 +8100,6 @@ ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, ereport(LOG, (errmsg("invalid primary checkpoint link in control file"))); break; @@ -8135,10 +8121,6 @@ ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, ereport(LOG, (errmsg("invalid primary checkpoint record"))); break; @@ -8154,10 +8136,6 @@ ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, ereport(LOG, (errmsg("invalid resource manager ID in primary checkpoint record"))); break; etc, etc. "primary checkpoint" should be just "checkpoint", now that the primary/secondary concept will disappear. (3) Should we change the default value of max_wal_size from 1 GB to a smaller size? I vote for "no" for performance. Regards Takayuki Tsunakawa -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Remove secondary checkpoint
From: pgsql-hackers-ow...@postgresql.org > [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Michael Paquier > On Tue, Oct 24, 2017 at 5:58 PM, Tsunakawa, Takayuki > wrote: > > If the latest checkpoint record is unreadable (the WAL > segment/block/record is corrupt?), recovery from the previous checkpoint > would also stop at the latest checkpoint. And we don't need to replay the > WAL records between the previous checkpoint and the latest one, because > their changes are already persisted when the latest checkpoint was taken. > So, the user should just do pg_resetxlog and start the database server when > the recovery cannot find the latest checkpoint record and PANICs? > > Not necessarily. If a failure is detected when reading the last checkpoint, > as you say recovery would begin at the checkpoint prior that and would stop > when reading the record of last checkpoint, still one could use a > recovery.conf with restore_command to fetch segments from a different > source, like a static archive, as only the local segment may be corrupted. Oh, you are right. "If the crash recovery fails, perform recovery from backup." Anyway, I agree that the secondary checkpoint isn't necessary. Regards Takayuki Tsunakawa -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Remove secondary checkpoint
On Tue, Oct 24, 2017 at 5:58 PM, Tsunakawa, Takayuki wrote: > If the latest checkpoint record is unreadable (the WAL segment/block/record > is corrupt?), recovery from the previous checkpoint would also stop at the > latest checkpoint. And we don't need to replay the WAL records between the > previous checkpoint and the latest one, because their changes are already > persisted when the latest checkpoint was taken. So, the user should just do > pg_resetxlog and start the database server when the recovery cannot find the > latest checkpoint record and PANICs? Not necessarily. If a failure is detected when reading the last checkpoint, as you say recovery would begin at the checkpoint prior that and would stop when reading the record of last checkpoint, still one could use a recovery.conf with restore_command to fetch segments from a different source, like a static archive, as only the local segment may be corrupted. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Remove secondary checkpoint
From: pgsql-hackers-ow...@postgresql.org > [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Tom Lane > Doesn't it also make crash recovery less robust? The whole point > of that mechanism is to be able to cope if the latest checkpoint > record is unreadable. If the latest checkpoint record is unreadable (the WAL segment/block/record is corrupt?), recovery from the previous checkpoint would also stop at the latest checkpoint. And we don't need to replay the WAL records between the previous checkpoint and the latest one, because their changes are already persisted when the latest checkpoint was taken. So, the user should just do pg_resetxlog and start the database server when the recovery cannot find the latest checkpoint record and PANICs? Regards Takayuki Tsunakawa -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Implementing pg_receivewal --no-sync
Hi all, After thinking a bit on the subject, I have decided to submit a patch to do $subject. This makes pg_receivewal more consistent with pg_basebackup. This option is mainly useful for testing, something that becomes way more doable since support for --endpos has been added. Unsurprisingly, --synchronous and --no-sync are incompatible options. Thanks, -- Michael pg_receivewal_nosync.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] per-sesson errors after interrupting CLUSTER pg_attrdef
On Fri, Oct 20, 2017 at 9:01 AM, Justin Pryzby wrote: > This was briefly scary but seems to have been limited to my psql session (no > other errors logged). Issue with catcache (?) > > I realized that the backup job I'd kicked off was precluding the CLUSTER from > running, but that CLUSTER was still holding lock and stalling everything else > under the sun. > > ts=# \df qci_add* > ERROR: could not read block 8 in file "base/16400/999225102": read only 0 of > 8192 bytes > ts=# \dt+ pg_att > > ts=# \dt+ pg_attrdef > ERROR: could not read block 8 in file "base/16400/999225102": read only 0 of > 8192 bytes Perhaps that's too late, but to which relation is this relfilenode actually referring to? -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] unique index violation after pg_upgrade to PG10
On Tue, Oct 24, 2017 at 02:57:47PM -0700, Peter Geoghegan wrote: > On Tue, Oct 24, 2017 at 1:11 PM, Justin Pryzby wrote: > > ..which I gather just verifies that the index is corrupt, not sure if > > there's > > anything else to do with it? Note, we've already removed the duplicate > > rows. > Maybe you could try verifying a different index on the same table with > "heapallindexed", too. Perhaps that would fail in a more interesting > way. The only other index seems fine: [pryzbyj@database ~]$ psql --port 5678 ts -txc "SELECT bt_index_parent_check('sites_pkey'::regclass::oid, heapallindexed=>True)" bt_index_parent_check | [pryzbyj@database ~]$ psql --port 5678 ts -txc "SELECT bt_index_check('sites_pkey'::regclass::oid, heapallindexed=>True)" bt_index_check | > You're using LVM snapshots -- I hope that you're aware that they're > not guaranteed to be consistent across logical volumes. There are a > few different ways that they could cause corruption like this if you > weren't careful. (In general, I wouldn't recommend using LVM snapshots > as any kind of backup solution.) Right, I asked a coworker to create the snapshot before trying to fix the immediate problem, as a forensic measure. We have postgres on just one LVM LV. Justin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] unique index violation after pg_upgrade to PG10
On Tue, Oct 24, 2017 at 01:48:55PM -0500, Kenneth Marshall wrote: > I just dealt with a similar problem with pg_repack and a PostgreSQL 9.5 DB, > the exact same error. It seemed to caused by a tuple visibility issue that > allowed the "working" unique index to be built, even though a duplicate row > existed. Then the next pg_repack would fail with the error you got. FTR, I was able to run the repack script several times without issue, hitting that table each time. Justin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Remove secondary checkpoint
On Mon, Oct 23, 2017 at 11:20 PM, Simon Riggs wrote: > Remove the code that maintained two checkpoint's WAL files and all > associated stuff. > > Try to avoid breaking anything else > > This > * reduces disk space requirements on master > * removes a minor bug in fast failover > * simplifies code In short, yeah! I had a close look at the patch and noticed a couple of issues. +else /* - * The last valid checkpoint record required for a streaming - * recovery exists in neither standby nor the primary. + * We used to attempt to go back to a secondary checkpoint + * record here, but only when not in standby_mode. We now + * just fail if we can't read the last checkpoint because + * this allows us to simplify processing around checkpoints. */ ereport(PANIC, (errmsg("could not locate a valid checkpoint record"))); -} Using brackets in this case is more elegant style IMO. OK, there is one line, but the comment is long so the block becomes confusingly-shaped. /* Version identifier for this pg_control format */ -#define PG_CONTROL_VERSION1003 +#define PG_CONTROL_VERSION1100 Yes, this had to happen anyway in this release cycle. backup.sgml describes the following: to higher segment numbers. It's assumed that segment files whose contents precede the checkpoint-before-last are no longer of interest and can be recycled. However this is not true anymore with this patch. The documentation of pg_control_checkpoint() is not updated. func.sgml lists the tuple fields returned for this function. You forgot to update pg_control_checkpoint in pg_proc.h. prior_lsn is still listed in the list of arguments returned. And you need to update as well the output list of types. * whichChkpt identifies the checkpoint (merely for reporting purposes). - * 1 for "primary", 2 for "secondary", 0 for "other" (backup_label) + * 1 for "primary", 0 for "other" (backup_label) + * 2 for "secondary" is no longer supported */ I would suggest to just remove the reference to "secondary". -* Delete old log files (those no longer needed even for previous -* checkpoint or the standbys in XLOG streaming). +* Delete old log files and recycle them */ Here that's more "Delete or recycle old log files". Recycling of a WAL segment is its renaming into a newer segment. You forgot to update this formula in xlog.c: distance = (2.0 + CheckPointCompletionTarget) * CheckPointDistanceEstimate; /* add 10% for good measure. */ distance *= 1.10; You can guess easily where the mistake is. - checkPointLoc = ControlFile->prevCheckPoint; + /* +* It appears to be a bug that we used to use prevCheckpoint here +*/ + checkPointLoc = ControlFile->checkPoint; Er, no. This is correct because we expect the prior checkpoint to still be present in the event of a failure detecting the latest checkpoint. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] unique index violation after pg_upgrade to PG10
On Tue, Oct 24, 2017 at 1:11 PM, Justin Pryzby wrote: > ..which I gather just verifies that the index is corrupt, not sure if there's > anything else to do with it? Note, we've already removed the duplicate rows. Yes, the index itself is definitely corrupt -- this failed before the new "heapallindexed" check even started. Though it looks like that would have failed too, if you got that far, since the index points to a row that does not contain the same data. (I only know this because you dumped the heap tuple and the index tuple.) Maybe you could try verifying a different index on the same table with "heapallindexed", too. Perhaps that would fail in a more interesting way. I don't know how pg_repack works in any detail, but I have a hard time imagining it causing corruption like this, where a single B-Tree page is corrupt (high key invariant fails), possibly because of a torn page (possibly due to recovery not replaying all the WAL needed, for whatever reason). You're using LVM snapshots -- I hope that you're aware that they're not guaranteed to be consistent across logical volumes. There are a few different ways that they could cause corruption like this if you weren't careful. (In general, I wouldn't recommend using LVM snapshots as any kind of backup solution.) -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Domains and arrays and composites, oh my
I wrote: > Anyway, PFA an updated patch that also fixes some conflicts with the > already-committed arrays-of-domains patch. I realized that the pending patch for jsonb_build_object doesn't actually have any conflict with what I needed to touch here, so I went ahead and fixed the JSON functions that needed fixing, along with hstore's populate_record. I ended up rewriting the argument-metadata-collection portions of populate_record_worker and populate_recordset_worker rather heavily, because I didn't like them at all: aside from not working for domains over composite, they were pretty inefficient (redoing a lot of work on each call for no good reason) and they were randomly different from each other, resulting in json{b}_populate_recordset rejecting some cases that worked in json{b}_populate_record. I've also updated the documentation. I think that this patch version is done so far as the core code and contrib are concerned. The PLs need varying amounts of work, but as I said earlier, I think it would be better to tackle those in separate patches instead of continuing to enlarge the footprint of the core patch. So, barring objection, I'd like to go ahead and commit this. regards, tom lane diff --git a/contrib/hstore/hstore_io.c b/contrib/hstore/hstore_io.c index d828401..e999a8e 100644 *** a/contrib/hstore/hstore_io.c --- b/contrib/hstore/hstore_io.c *** typedef struct RecordIOData *** 752,757 --- 752,759 { Oid record_type; int32 record_typmod; + /* this field is used only if target type is domain over composite: */ + void *domain_info; /* opaque cache for domain checks */ int ncolumns; ColumnIOData columns[FLEXIBLE_ARRAY_MEMBER]; } RecordIOData; *** hstore_from_record(PG_FUNCTION_ARGS) *** 780,788 Oid argtype = get_fn_expr_argtype(fcinfo->flinfo, 0); /* ! * have no tuple to look at, so the only source of type info is the ! * argtype. The lookup_rowtype_tupdesc call below will error out if we ! * don't have a known composite type oid here. */ tupType = argtype; tupTypmod = -1; --- 782,792 Oid argtype = get_fn_expr_argtype(fcinfo->flinfo, 0); /* ! * We have no tuple to look at, so the only source of type info is the ! * argtype --- which might be domain over composite, but we don't care ! * here, since we have no need to be concerned about domain ! * constraints. The lookup_rowtype_tupdesc_domain call below will ! * error out if we don't have a known composite type oid here. */ tupType = argtype; tupTypmod = -1; *** hstore_from_record(PG_FUNCTION_ARGS) *** 793,804 { rec = PG_GETARG_HEAPTUPLEHEADER(0); ! /* Extract type info from the tuple itself */ tupType = HeapTupleHeaderGetTypeId(rec); tupTypmod = HeapTupleHeaderGetTypMod(rec); } ! tupdesc = lookup_rowtype_tupdesc(tupType, tupTypmod); ncolumns = tupdesc->natts; /* --- 797,811 { rec = PG_GETARG_HEAPTUPLEHEADER(0); ! /* ! * Extract type info from the tuple itself -- this will work even for ! * anonymous record types. ! */ tupType = HeapTupleHeaderGetTypeId(rec); tupTypmod = HeapTupleHeaderGetTypMod(rec); } ! tupdesc = lookup_rowtype_tupdesc_domain(tupType, tupTypmod, false); ncolumns = tupdesc->natts; /* *** hstore_populate_record(PG_FUNCTION_ARGS) *** 943,951 rec = NULL; /* ! * have no tuple to look at, so the only source of type info is the ! * argtype. The lookup_rowtype_tupdesc call below will error out if we ! * don't have a known composite type oid here. */ tupType = argtype; tupTypmod = -1; --- 950,958 rec = NULL; /* ! * We have no tuple to look at, so the only source of type info is the ! * argtype. The lookup_rowtype_tupdesc_domain call below will error ! * out if we don't have a known composite type oid here. */ tupType = argtype; tupTypmod = -1; *** hstore_populate_record(PG_FUNCTION_ARGS) *** 957,963 if (PG_ARGISNULL(1)) PG_RETURN_POINTER(rec); ! /* Extract type info from the tuple itself */ tupType = HeapTupleHeaderGetTypeId(rec); tupTypmod = HeapTupleHeaderGetTypMod(rec); } --- 964,973 if (PG_ARGISNULL(1)) PG_RETURN_POINTER(rec); ! /* ! * Extract type info from the tuple itself -- this will work even for ! * anonymous record types. ! */ tupType = HeapTupleHeaderGetTypeId(rec); tupTypmod = HeapTupleHeaderGetTypMod(rec); } *** hstore_populate_record(PG_FUNCTION_ARGS) *** 975,981 if (HS_COUNT(hs) == 0 && rec) PG_RETURN_POINTER(rec); ! tupdesc = lookup_rowtype_tupdesc(tupType, tupTypmod); ncolumns = tupdesc->natts; if (rec) --- 985,995 if (HS_COUNT(hs) == 0 && rec) PG_RETURN_POINTER(rec); ! /* ! * Lookup the input record's tupdesc. For the moment, w
Re: [HACKERS] unique index violation after pg_upgrade to PG10
On Tue, Oct 24, 2017 at 12:31:49PM -0700, Peter Geoghegan wrote: > On Tue, Oct 24, 2017 at 11:48 AM, Kenneth Marshall wrote: > >> Really ? pg_repack "found" and was victim to the duplicate keys, and > >> rolled > >> back its work. The CSV logs clearly show that our application INSERTed > >> rows > >> which are duplicates. > >> > >> [pryzbyj@database ~]$ rpm -qa pg_repack10 > >> pg_repack10-1.4.2-1.rhel6.x86_64 > >> > >> Justin > > > > Hi Justin, > > > > I just dealt with a similar problem with pg_repack and a PostgreSQL 9.5 DB, > > the exact same error. It seemed to caused by a tuple visibility issue that > > allowed the "working" unique index to be built, even though a duplicate row > > existed. Then the next pg_repack would fail with the error you got. In our > > case I needed the locality of reference to keep the DB performance > > acceptable > > and it was not a critical issue if there was a duplicate. We would remove > > the > > duplicates if we had a failure. We never had a problem with the NO pg_repack > > scenarios. > > A new, enhanced version of the corruption detection tool amcheck is > now available, and has both apt + yum packages available: > > https://github.com/petergeoghegan/amcheck > > Unlike the version in Postgres 10, this enhanced version (V1.2) has > "heapallindexed" verification, which is really what you want here. If > you install it, and run it against the unique index in question (with > "heapallindexed" verification), that could help. It might provide a > more useful diagnostic message. > > This is very new, so do let me know how you get on if you try it out. I started an instance connected to a copy of the LVM snapshot I saved: [pryzbyj@database ~]$ sudo -u postgres /usr/pgsql-10/bin/postmaster -c port=5678 -D /mnt/10/data [pryzbyj@database amcheck]$ psql --port 5678 ts -c "SELECT bt_index_check('sites_idx'::regclass::oid, heapallindexed=>True)" ERROR: high key invariant violated for index "sites_idx" DETAIL: Index tid=(1,37) points to heap tid=(0,97) page lsn=0/0. [pryzbyj@database amcheck]$ psql --port 5678 ts -c "SELECT bt_index_parent_check('sites_idx'::regclass::oid, heapallindexed=>True)" ERROR: high key invariant violated for index "sites_idx" DETAIL: Index tid=(1,37) points to heap tid=(0,97) page lsn=0/0. ts=# SELECT * FROM page_header(get_raw_page('sites_idx', 1)); lsn | 0/0 checksum | 0 flags | 0 lower | 872 upper | 1696 special | 8176 pagesize | 8192 version | 4 prune_xid | 0 ts=# SELECT * FROM page_header(get_raw_page('sites', 0)); lsn | 1FB/AC5A4908 checksum | 0 flags | 5 lower | 436 upper | 464 special | 8192 pagesize | 8192 version | 4 prune_xid | 0 ts=# SELECT * FROM bt_page_items(get_raw_page('sites_idx', 1)); itemoffset | 48 ctid | (1,37) itemlen| 32 nulls | f vars | t data | 1b 43 52 43 4c 4d 54 2d 43 45 4d 53 30 0b 31 31 31 31 00 00 00 00 00 00 itemoffset | 37 ctid | (0,97) itemlen| 24 nulls | f vars | t data | 1b 43 52 43 4c 4d 54 2d 43 45 4d 53 30 03 00 00 ..which I gather just verifies that the index is corrupt, not sure if there's anything else to do with it? Note, we've already removed the duplicate rows. Justin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] unique index violation after pg_upgrade to PG10
On Tue, Oct 24, 2017 at 11:48 AM, Kenneth Marshall wrote: >> Really ? pg_repack "found" and was victim to the duplicate keys, and rolled >> back its work. The CSV logs clearly show that our application INSERTed rows >> which are duplicates. >> >> [pryzbyj@database ~]$ rpm -qa pg_repack10 >> pg_repack10-1.4.2-1.rhel6.x86_64 >> >> Justin > > Hi Justin, > > I just dealt with a similar problem with pg_repack and a PostgreSQL 9.5 DB, > the exact same error. It seemed to caused by a tuple visibility issue that > allowed the "working" unique index to be built, even though a duplicate row > existed. Then the next pg_repack would fail with the error you got. In our > case I needed the locality of reference to keep the DB performance acceptable > and it was not a critical issue if there was a duplicate. We would remove the > duplicates if we had a failure. We never had a problem with the NO pg_repack > scenarios. A new, enhanced version of the corruption detection tool amcheck is now available, and has both apt + yum packages available: https://github.com/petergeoghegan/amcheck Unlike the version in Postgres 10, this enhanced version (V1.2) has "heapallindexed" verification, which is really what you want here. If you install it, and run it against the unique index in question (with "heapallindexed" verification), that could help. It might provide a more useful diagnostic message. This is very new, so do let me know how you get on if you try it out. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Current int & float overflow checking is slow.
Robert Haas writes: > On Tue, Oct 24, 2017 at 4:36 PM, Tom Lane wrote: >> Yeah, but I lost the argument. For better or worse, our expected >> behavior is now that we throw errors. You don't get to change that >> just because it would save a few cycles. > I don't know that we can consider the results of a discussion in 2006 > to be binding policy for the indefinite future. A lot of things get > relitigated more than once per decade on this mailing list, and if we > know things now that we didn't know then (e.g. that one choice has a > far more severe performance consequence than the other) that's > reasonable justification for deciding to change our mind. I don't like changing well-defined, user-visible query behavior for no other reason than a performance gain (of a size that hasn't even been shown to be interesting, btw). Will we change it back in another ten years if the performance tradeoff changes? Also, if I recall the old discussion properly, one concern was getting uniform behavior across different platforms. I'm worried that if we do what Andres suggests, we'll get behavior that is not only different but platform-specific. Now, to the extent that you believe that every modern platform implements edge-case IEEE float behavior the same way, that worry may be obsolete. But I don't think I believe that. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] unique index violation after pg_upgrade to PG10
On Tue, Oct 24, 2017 at 01:30:19PM -0500, Justin Pryzby wrote: > On Tue, Oct 24, 2017 at 01:27:14PM -0500, Kenneth Marshall wrote: > > On Tue, Oct 24, 2017 at 01:14:53PM -0500, Justin Pryzby wrote: > > > > Note: > > > I run a script which does various combinations of ANALYZE/VACUUM > > > (FULL/ANALYZE) > > > following the upgrade, and a script runs nightly with REINDEX and > > > pg_repack > > > (and a couple of CLUSTER), so you should assume that any combination of > > > those > > > maintenance commands have been run. > > > > > > In our reindex/repack log I found the first error due to duplicates: > > > Tue Oct 24 01:27:53 MDT 2017: sites: sites_idx(repack non-partitioned)... > > > WARNING: Error creating index "public"."index_61764": ERROR: could not > > > create unique index "index_61764" > > > DETAIL: Key (site_office, site_location)=(CRCLMT-DOEMS0, 1120) is > > > duplicated. > > > WARNING: Skipping index swapping for "sites", since no new indexes built > > > WARNING: repack failed for "sites_idx" > > > reindex: warning, dropping invalid/unswapped index: index_61764 > > > > > > > Hi Justin, > > > > This sounds like a pg_repack bug and not a PostgreSQL bug. What version are > > you running? > > Really ? pg_repack "found" and was victim to the duplicate keys, and rolled > back its work. The CSV logs clearly show that our application INSERTed rows > which are duplicates. > > [pryzbyj@database ~]$ rpm -qa pg_repack10 > pg_repack10-1.4.2-1.rhel6.x86_64 > > Justin Hi Justin, I just dealt with a similar problem with pg_repack and a PostgreSQL 9.5 DB, the exact same error. It seemed to caused by a tuple visibility issue that allowed the "working" unique index to be built, even though a duplicate row existed. Then the next pg_repack would fail with the error you got. In our case I needed the locality of reference to keep the DB performance acceptable and it was not a critical issue if there was a duplicate. We would remove the duplicates if we had a failure. We never had a problem with the NO pg_repack scenarios. Regards, Ken -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] unique index violation after pg_upgrade to PG10
On Tue, Oct 24, 2017 at 01:27:14PM -0500, Kenneth Marshall wrote: > On Tue, Oct 24, 2017 at 01:14:53PM -0500, Justin Pryzby wrote: > > Note: > > I run a script which does various combinations of ANALYZE/VACUUM > > (FULL/ANALYZE) > > following the upgrade, and a script runs nightly with REINDEX and pg_repack > > (and a couple of CLUSTER), so you should assume that any combination of > > those > > maintenance commands have been run. > > > > In our reindex/repack log I found the first error due to duplicates: > > Tue Oct 24 01:27:53 MDT 2017: sites: sites_idx(repack non-partitioned)... > > WARNING: Error creating index "public"."index_61764": ERROR: could not > > create unique index "index_61764" > > DETAIL: Key (site_office, site_location)=(CRCLMT-DOEMS0, 1120) is > > duplicated. > > WARNING: Skipping index swapping for "sites", since no new indexes built > > WARNING: repack failed for "sites_idx" > > reindex: warning, dropping invalid/unswapped index: index_61764 > > > > Hi Justin, > > This sounds like a pg_repack bug and not a PostgreSQL bug. What version are > you running? Really ? pg_repack "found" and was victim to the duplicate keys, and rolled back its work. The CSV logs clearly show that our application INSERTed rows which are duplicates. [pryzbyj@database ~]$ rpm -qa pg_repack10 pg_repack10-1.4.2-1.rhel6.x86_64 Justin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] unique index violation after pg_upgrade to PG10
On Tue, Oct 24, 2017 at 01:14:53PM -0500, Justin Pryzby wrote: > I upgrade another instance to PG10 yesterday and this AM found unique key > violations. > > Our application is SELECTing FROM sites WHERE site_location=$1, and if it > doesn't find one, INSERTs one (I know that's racy and not ideal). We ended up > with duplicate sites, despite a unique index. We removed the duplicate rows > and reindexed fine. This is just a heads up with all the detail I can fit in > a > mail (but there's more if needed). > ... > Note: > I run a script which does various combinations of ANALYZE/VACUUM > (FULL/ANALYZE) > following the upgrade, and a script runs nightly with REINDEX and pg_repack > (and a couple of CLUSTER), so you should assume that any combination of those > maintenance commands have been run. > > In our reindex/repack log I found the first error due to duplicates: > Tue Oct 24 01:27:53 MDT 2017: sites: sites_idx(repack non-partitioned)... > WARNING: Error creating index "public"."index_61764": ERROR: could not > create unique index "index_61764" > DETAIL: Key (site_office, site_location)=(CRCLMT-DOEMS0, 1120) is duplicated. > WARNING: Skipping index swapping for "sites", since no new indexes built > WARNING: repack failed for "sites_idx" > reindex: warning, dropping invalid/unswapped index: index_61764 > Hi Justin, This sounds like a pg_repack bug and not a PostgreSQL bug. What version are you running? Regards, Ken -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Current int & float overflow checking is slow.
On Tue, Oct 24, 2017 at 4:36 PM, Tom Lane wrote: >> Does it? In plenty of cases getting infinity rather than an error is >> just about as useful. >> This was argued by a certain Tom Lane a few years back ;) >> http://archives.postgresql.org/message-id/19208.1167246902%40sss.pgh.pa.us > > Yeah, but I lost the argument. For better or worse, our expected > behavior is now that we throw errors. You don't get to change that > just because it would save a few cycles. I don't know that we can consider the results of a discussion in 2006 to be binding policy for the indefinite future. A lot of things get relitigated more than once per decade on this mailing list, and if we know things now that we didn't know then (e.g. that one choice has a far more severe performance consequence than the other) that's reasonable justification for deciding to change our mind. Also, it's not like there were a million votes on one side vs. just you on the other; reading the thread, it's not at all clear that you were in the minority with that position. That's not to say I necessarily support Andres's proposal. Changing query behavior is a big deal; we can't do it very often without causing a lot of hassles for users (and maybe damaging our reputation for stability in the process). And it's not very clear to me that someone who does a SUM(a * b) over many rows will be happy to get infinity rather than an error. It could be true, but I don't have the experience to be sure of it -- and I'm a bit worried that if we change anything, we'll only find out whether users like it after we cut the release. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] unique index violation after pg_upgrade to PG10
I upgrade another instance to PG10 yesterday and this AM found unique key violations. Our application is SELECTing FROM sites WHERE site_location=$1, and if it doesn't find one, INSERTs one (I know that's racy and not ideal). We ended up with duplicate sites, despite a unique index. We removed the duplicate rows and reindexed fine. This is just a heads up with all the detail I can fit in a mail (but there's more if needed). ts=# \d sites site_id | integer | | not null | nextval('sites_site_id_seq'::regclass) site_office | text | | | site_location| text | | | [...] Indexes: "sites_pkey" PRIMARY KEY, btree (site_id) "sites_idx" UNIQUE, btree (site_office, site_location) ts=# SELECT site_office, site_location, count(*), min(site_id), max(site_id) FROM sites GROUP BY 1,2 HAVING count(*)>1 ORDER BY 1,2; site_office | site_location | count | min | max +---+---+-+- CRCLMT-DOEMS0 | | 2 | 165 | 351 CRCLMT-DOEMS0 | 1101 | 2 | 123 | 343 CRCLMT-DOEMS0 | 1102 | 2 | 134 | 318 CRCLMT-DOEMS0 | 1103 | 2 | 145 | 322 CRCLMT-DOEMS0 | 1104 | 2 | 156 | 329 The duplicate site_ids mean this isn't an issue with row version/visibility due to XIDs (right?). ts=# SELECT 1 FROM sites WHERE site_office='CRCLMT-CEMS0' AND site_location=''; (0 rows) ts=# SET enable_bitmapscan=off; SET enable_indexscan=off; SELECT 1 FROM sites WHERE site_office='CRCLMT-CEMS0' AND site_location=''; -[ RECORD 1 ] ?column? | 1 So there's an issue with indices failing to return matching rows (and thereby allowing inserting duplicate violating rows). That's the only table/index affected (and the only PG instance thus far affected). Note regarding my pg_upgrade: 3 years ago, this was our first and smallest customer who I upgraded off PG8.4 (to PG9.2 if I recall), and I did it using pg_dump |pg_restore. I believe, as a consequence, its postgres database was in "C" locale and ASCII encoding. So the last few upgrades (9.3, .4 .5 and .6), I believe I've manually used initdb --encoding followed by pg_upgrade (else it fails due to new postgres/template DBs with different locale/encoding from old). This upgrade, I finally renamed postgres DB (which has imported CSV logs and one or two other things) and pg_dump|pg_restore into a new DB with UTF8 encoding, which allowed pg_upgrade to run without special initdb invocation. I have an LVM snapshot and full CSV logs imported into a table. I also have a backup from 22:00 which doesn't have duplicate sites. Those seem to have been inserted by our application around 00:30: These IDs which inserted duplicate rows: postgres=# SELECT session_id, max(session_line) FROM postgres_log_2017_10_24_ WHERE message LIKE 'statement: SELECT site_id FROM sites WHERE%' GROUP BY 1 ; session_id | max ---+-- 59eedfb1.5cea | 714 59eedfb5.5cf1 | 1741 (2 rows) postgres=# SELECT log_time, session_id, session_line, left(message,333) FROM postgres_log WHERE (session_id='59eedfb1.5cea' OR session_id='59eedfb5.5cf1') AND (session_line<6 OR message LIKE '%INSERT INTO site%') ORDER BY 1,2,3; -[ RECORD 4 ]+-- log_time | 2017-10-24 00:37:37.888-06 session_id | 59eedfb1.5cea session_line | 4 left | statement: SELECT site_id FROM sites WHERE + | site_office = 'CRCLMT-DOEMS0' AND site_location = '1203' -[ RECORD 5 ]+-- log_time | 2017-10-24 00:37:37.89-06 session_id | 59eedfb1.5cea session_line | 5 left | statement: INSERT INTO sites (site_office,site_location,site_alias) + | VALUES ('CRCLMT-DOEMS0', '1203', (SELECT site_id FROM sites + | WHERE site_office = 'CRCLMT-CEMS0' AND site_location = '1203')) Note: I run a script which does various combinations of ANALYZE/VACUUM (FULL/ANALYZE) following the upgrade, and a script runs nightly with REINDEX and pg_repack (and a couple of CLUSTER), so you should assume that any combination of those maintenance commands have been run. In our reindex/repack log I found the first error due to duplicates: Tue Oct 24 01:27:53 MDT 2017: sites: sites_idx(repack non-partitioned)... WARNING: Error creating index "public"."index_61764": ERROR: could not create unique index "index_61764" DETAIL: Key (site_office, site_location)=
[HACKERS] Deadlock in ALTER SUBSCRIPTION REFRESH PUBLICATION
Parallel execution of ALTER SUBSCRIPTION REFRESH PUBLICATION at several nodes may cause deadlock: knizhnik 1480 0.0 0.1 1417532 16496 ? Ss 20:01 0:00 postgres: bgworker: logical replication worker for subscription 16589 sync 16720 waiting knizhnik 1481 0.0 0.1 1417668 17668 ? Ss 20:01 0:00 postgres: knizhnik postgres 127.0.0.1(36378) idle knizhnik 1484 0.0 0.1 1406952 16676 ? Ss 20:01 0:00 postgres: knizhnik postgres 127.0.0.1(56194) ALTER SUBSCRIPTION waiting knizhnik 1486 0.0 0.1 1410040 21268 ? Ss 20:01 0:00 postgres: wal sender process knizhnik 127.0.0.1(36386) idle knizhnik 1487 0.0 0.1 1417532 16736 ? Ss 20:01 0:00 postgres: bgworker: logical replication worker for subscription 16589 sync 16774 knizhnik 1489 0.0 0.1 1410040 21336 ? Ss 20:01 0:00 postgres: wal sender process knizhnik 127.0.0.1(36388) idle knizhnik 1510 0.0 0.1 1406952 16820 ? Ss 20:01 0:00 postgres: knizhnik postgres 127.0.0.1(56228) ALTER SUBSCRIPTION waiting knizhnik 1530 0.0 0.1 1406952 16736 ? Ss 20:01 0:00 postgres: knizhnik postgres 127.0.0.1(56260) ALTER SUBSCRIPTION waiting knizhnik 1534 0.0 0.0 1417504 14360 ? Ss 20:01 0:00 postgres: bgworker: logical replication worker for subscription 16590 knizhnik 1537 0.0 0.1 1409164 16920 ? Ss 20:01 0:00 postgres: wal sender process knizhnik 127.0.0.1(45054) idle knizhnik 1545 0.0 0.0 1417344 14576 ? Ss 20:01 0:00 postgres: bgworker: logical replication worker for subscription 16592 knizhnik 1546 0.0 0.1 1409168 16588 ? Ss 20:01 0:00 postgres: wal sender process knizhnik 127.0.0.1(56274) idle knizhnik 1547 0.0 0.0 1417348 14332 ? Ss 20:01 0:00 postgres: bgworker: logical replication worker for subscription 16592 sync 16606 knizhnik 1549 0.0 0.0 1409004 14512 ? Ss 20:01 0:00 postgres: wal sender process knizhnik 127.0.0.1(56276) idle in transaction waiting knizhnik 1553 0.0 0.0 1417348 14540 ? Ss 20:01 0:00 postgres: bgworker: logical replication worker for subscription 16592 knizhnik 1554 0.0 0.1 1409168 16596 ? Ss 20:01 0:00 postgres: wal sender process knizhnik 127.0.0.1(56280) idle knizhnik 1555 0.0 0.0 1417348 14332 ? Ss 20:01 0:00 postgres: bgworker: logical replication worker for subscription 16592 sync 16660 knizhnik 1556 0.0 0.0 1409004 14520 ? Ss 20:01 0:00 postgres: wal sender process knizhnik 127.0.0.1(56282) idle in transaction waiting knizhnik 1558 0.0 0.0 1417348 14460 ? Ss 20:01 0:00 postgres: bgworker: logical replication worker for subscription 16592 sync 16696 knizhnik 1560 0.0 0.0 1409004 14516 ? Ss 20:01 0:00 postgres: wal sender process knizhnik 127.0.0.1(56286) idle in transaction waiting knizhnik 1562 0.0 0.0 1417348 14084 ? Ss 20:01 0:00 postgres: bgworker: logical replication worker for subscription 16592 sync 16732 knizhnik 1567 0.0 0.0 1409004 14516 ? Ss 20:01 0:00 postgres: wal sender process knizhnik 127.0.0.1(56288) idle in transaction waiting knizhnik@knizhnik:~$ gdb -p 1556 (gdb) bt #0 0x7f7b991569b3 in __epoll_wait_nocancel () at ../sysdeps/unix/syscall-template.S:84 #1 0x0087185a in WaitEventSetWaitBlock (set=0x246d1e8, cur_timeout=-1, occurred_events=0x73a43580, nevents=1) at latch.c:1048 #2 0x00871716 in WaitEventSetWait (set=0x246d1e8, timeout=-1, occurred_events=0x73a43580, nevents=1, wait_event_info=50331652) at latch.c:1000 #3 0x00870e6c in WaitLatchOrSocket (latch=0x7f7b975daba4, wakeEvents=1, sock=-1, timeout=-1, wait_event_info=50331652) at latch.c:385 #4 0x00870d3e in WaitLatch (latch=0x7f7b975daba4, wakeEvents=1, timeout=0, wait_event_info=50331652) at latch.c:339 #5 0x0088995c in ProcSleep (locallock=0x24764d0, lockMethodTable=0xc17420 ) at proc.c:1255 #6 0x008828cb in WaitOnLock (locallock=0x24764d0, owner=0x2517c78) at lock.c:1702 #7 0x0088157e in LockAcquireExtended (locktag=0x73a439a0, lockmode=5, sessionLock=0 '\000', dontWait=0 '\000', reportMemoryError=1 '\001') at lock.c:998 #8 0x00880ba8 in LockAcquire (locktag=0x73a439a0, lockmode=5, sessionLock=0 '\000', dontWait=0 '\000') at lock.c:688 #9 0x0087f8dc in XactLockTableWait (xid=624, rel=0x0, ctid=0x0, oper=XLTW_None) at lmgr.c:587 #10 0x0082f524 in SnapBuildWaitSnapshot (running=0x246d1b0, cutoff=632) at snapbuild.c:1400 #11 0x0082f2f6 in SnapBuildFindSnapshot (builder=0x254b550, lsn=24410008, running=0x246d1b0) at snapbuild.c:1311 #12 0x0082ee67 in SnapBuildProcessRunningXacts (builder=0x254b550, lsn=24410008, running=0x246d1b0) at snapbuild.c:1106 #13 0x0081c458 in DecodeStandbyOp (ctx=0x25334b0, buf=0x73a43b10) at decode.c:314 #14 0x0081bf39 in LogicalDecodingProcessRecord (ctx=0x25334b0, record=0x25
Re: [HACKERS] CurTransactionContext freed before transaction COMMIT ???
This sounds broken on its face --- if you want stuff to survive to top-level commit, you need to keep it in TopTransactionContext. CurTransactionContext might be a subtransaction's context that will go away at subtransaction commit/abort. https://github.com/postgres/postgres/blob/master/src/backend/utils/mmgr/README >From the above README: CurTransactionContext --- this holds data that has to survive until the end of the current transaction, and in particular will be needed at top-level transaction commit. When we are in a top-level transaction this is the same as TopTransactionContext, but in subtransactions it points to a child context. It is important to understand that if a subtransaction aborts, its CurTransactionContext is thrown away after finishing the abort processing; but a committed subtransaction's CurTransactionContext is kept until top-level commit (unless of course one of the intermediate levels of subtransaction aborts). This ensures that we do not keep data from a failed subtransaction longer than necessary. Because of this behavior, you must be careful to clean up properly during subtransaction abort --- the subtransaction's state must be delinked from any pointers or lists kept in upper transactions, or you will have dangling pointers leading to a crash at top-level commit. An example of data kept here is pending NOTIFY messages, which are sent at top-level commit, but only if the generating subtransaction did not abort. --> So even if sub-transaction is committed, subtransaction's CurTransactionContext is kept until top-level commit. --> If sub-transaction is aborted, we handled(clearing the data) it via RegisterSubXactCallback(). And in our particular use case(which gave segmentation fault), we didn't issue any sub-transaction. It was a single insert transaction. Even then it resulted in some kind of context free. Regards G. Sai Ram
Re: [HACKERS] Remove secondary checkpoint
On 24 October 2017 at 09:50, Tom Lane wrote: > Simon Riggs writes: >> Remove the code that maintained two checkpoint's WAL files and all >> associated stuff. > >> Try to avoid breaking anything else > >> This >> * reduces disk space requirements on master >> * removes a minor bug in fast failover >> * simplifies code > > Doesn't it also make crash recovery less robust? The whole point > of that mechanism is to be able to cope if the latest checkpoint > record is unreadable. If you want to toss that overboard, I think > you need to make the case why we don't need it, not just post a > patch removing it. *Of course* the code is simpler without it. > That's utterly irrelevant. The code would be even simpler with > no crash recovery at all ... but we're not going there. Well, the mechanism has already been partially removed since we don't maintain two checkpoints on a standby. So all I'm proposing is we remove the other half. I've not seen myself, nor can I find an example online where the primary failed yet the secondary did not also fail from the same cause. If it is a possibility to do this, now we have pg_waldump we can easily search for a different checkpoint and start from there instead which is a more flexible approach. If you didn't save your WAL and don't have any other form of backup, relying on the secondary checkpoint is not exactly a safe bet. -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Current int & float overflow checking is slow.
Andres Freund writes: > On 2017-10-24 10:09:09 -0400, Tom Lane wrote: >> There's an ancient saying that code can be arbitrarily fast if it >> doesn't have to get the right answer. I think this proposal falls >> in that category. > Does it? In plenty of cases getting infinity rather than an error is > just about as useful. > This was argued by a certain Tom Lane a few years back ;) > http://archives.postgresql.org/message-id/19208.1167246902%40sss.pgh.pa.us Yeah, but I lost the argument. For better or worse, our expected behavior is now that we throw errors. You don't get to change that just because it would save a few cycles. >> SIGFPE isn't going to be easy to recover from, nor portable. > Hm? A trivial hack implementing the above survives the regression test, > with the exception of one output change because some functions currently > do *not* check for overflow. What's the issue you're concerned about? The real problem with it is that it's a process-wide setting, and would for example probably break PL/R, or other libraries that are not expecting to lose control to overflows. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CurTransactionContext freed before transaction COMMIT ???
Gaddam Sai Ram writes: > We are implementing in-memory index. As a part of that, during > index callbacks, under CurTransactionContext, we cache all the DMLs of a > transaction in dlist(postgres's doubly linked list). > We registered transaction callback, and in transaction pre-commit > event(XACT_EVENT_PRE_COMMIT), we iterate through all cached DMLs(dlist) and > populate in my in-memory data structure. This sounds broken on its face --- if you want stuff to survive to top-level commit, you need to keep it in TopTransactionContext. CurTransactionContext might be a subtransaction's context that will go away at subtransaction commit/abort. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Current int & float overflow checking is slow.
On 2017-10-24 10:09:09 -0400, Tom Lane wrote: > Andres Freund writes: > > There's no comparable overflow handling to the above integer > > intrinsics. But I think we can still do a lot better. Two very different > > ways: > > > 1) Just give up on detecting overflows for floats. Generating inf in > >these cases actually seems entirely reasonable. We already don't > >detect them in a bunch of cases anyway. I can't quite parse the > >standard's language around this. > > There's an ancient saying that code can be arbitrarily fast if it > doesn't have to get the right answer. I think this proposal falls > in that category. Does it? In plenty of cases getting infinity rather than an error is just about as useful. This was argued by a certain Tom Lane a few years back ;) http://archives.postgresql.org/message-id/19208.1167246902%40sss.pgh.pa.us > > 2) Use platform specific float exception handling where available. We > >could at backend start, and in FloatExceptionHandler(), us > >feenableexcept() (windows has similar) to trigger SIGFPE on float > >overflow. > > SIGFPE isn't going to be easy to recover from, nor portable. Hm? A trivial hack implementing the above survives the regression test, with the exception of one output change because some functions currently do *not* check for overflow. What's the issue you're concerned about? The portability indeed is a problem. > I think what you actually want to do is *disable* SIGFPE (see > feholdexcept), and then have individual functions use feclearexcept > and fetestexcept. These functions were standardized by C99 so > they should be pretty widely available ... of course, whether they > actually are widely portable remains to be seen. Whether they're > faster than what we're doing now also remains to be seen. I tested it, and they're fairly slow on at least gcc-7 + glibc 2.24. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] CurTransactionContext freed before transaction COMMIT ???
Hello people, We are implementing in-memory index. As a part of that, during index callbacks, under CurTransactionContext, we cache all the DMLs of a transaction in dlist(postgres's doubly linked list). We registered transaction callback, and in transaction pre-commit event(XACT_EVENT_PRE_COMMIT), we iterate through all cached DMLs(dlist) and populate in my in-memory data structure. --> For detailed explanation of our implementation, please look into below thread. https://www.postgresql.org/message-id/15f4ea99b34.e69a4e1b633.8937474039794097334%40zohocorp.com --> It was working fine until I was greeted with a segmentation fault while accessing dlist! --> On further examining I found that dlist head is not NULL and it is pointing to some address, but the container I extracted is pointing to 0x7f7f7f7f7f7f7f5f, and I was not able to access any members in my container. --> https://wiki.postgresql.org/wiki/Developer_FAQ#Why_are_my_variables_full_of_0x7f_bytes.3F From the above wiki, we came to a conclusion that the memory context in which that dlist was present was freed. And yes CLOBBER_FREED_MEMORY is enabled by passing --enable-cassert to enable asserts in my code. --> Normally the memory context inside XACT_EVENT_PRE_COMMIT is TopTransactionContext but in this particular case, the memory context was 'MessageContext'. Little unusual! Not sure why! --> So basically CurTransactionContext shouldn't get freed before transaction COMMIT. --> So what has actually happened here??? Kindly help us with your insights! For your reference, a code snippet of our implementation is pasted below: Sample code snippet: /*** Code snippet starts ***/ dlist_head DMLInfo = {{NULL, NULL}}; Insert_callback() { old_context = MemoryContextSwitchTo(CurTransactionContext); GraphIndex_InsertInfo *current; current = (GraphIndex_InsertInfo *)palloc(sizeof(GraphIndex_InsertInfo)); current->tableOid = tableOid; current->subTransactionID = GetCurrentSubTransactionId(); current->colValue = (long)values[0]; // Varies with column type current->ctid = ctid; dlist_push_head(&DMLInfo, ¤t->list_node); MemoryContextSwitchTo(old_context); return TRUE; } static void GraphIndex_xactCallback(XactEvent event, void *arg) { GraphIndex_Table *tableInfo; GraphIndex_InsertInfo *current; dlist_iter iter; switch (event) { case XACT_EVENT_PRE_PREPARE: case XACT_EVENT_PRE_COMMIT: PG_TRY(); { if(DMLInfo.head.next) { dlist_reverse_foreach(iter, &DMLInfo) { current = dlist_container(GraphIndex_InsertInfo, list_node, iter.cur); DMLProcessingFunction(current); // This is giving me the ERROR (while accessing members of current) } DMLInfo.head.next = NULL; } } PG_CATCH(); { /* Do cleanup */ } PG_END_TRY(); break; case XACT_EVENT_ABORT: case XACT_EVENT_PARALLEL_ABORT: /* No cleanup Necessary(No structure created) */ break; default: return; } } /*** Code snippet ends ***/ Regards G. Sai Ram
Re: [HACKERS] Current int & float overflow checking is slow.
Andres Freund writes: > There's no comparable overflow handling to the above integer > intrinsics. But I think we can still do a lot better. Two very different > ways: > 1) Just give up on detecting overflows for floats. Generating inf in >these cases actually seems entirely reasonable. We already don't >detect them in a bunch of cases anyway. I can't quite parse the >standard's language around this. There's an ancient saying that code can be arbitrarily fast if it doesn't have to get the right answer. I think this proposal falls in that category. > 2) Use platform specific float exception handling where available. We >could at backend start, and in FloatExceptionHandler(), us >feenableexcept() (windows has similar) to trigger SIGFPE on float >overflow. SIGFPE isn't going to be easy to recover from, nor portable. I think what you actually want to do is *disable* SIGFPE (see feholdexcept), and then have individual functions use feclearexcept and fetestexcept. These functions were standardized by C99 so they should be pretty widely available ... of course, whether they actually are widely portable remains to be seen. Whether they're faster than what we're doing now also remains to be seen. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Remove secondary checkpoint
On 2017-10-24 09:50:12 -0400, Tom Lane wrote: > Simon Riggs writes: > > Remove the code that maintained two checkpoint's WAL files and all > > associated stuff. > > > Try to avoid breaking anything else > > > This > > * reduces disk space requirements on master > > * removes a minor bug in fast failover > > * simplifies code > > Doesn't it also make crash recovery less robust? I think it does the contrary. The current mechanism is, in my opinion, downright dangerous: https://www.postgresql.org/message-id/20160201235854.go8...@awork2.anarazel.de Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Help needed regarding DSA based in-memory index!
Hi Munro, Thanks for cautioning us about possible memory leaks(during error cases) incase of long-lived DSA segements(have a look in below thread for more details). https://www.postgresql.org/message-id/CAEepm%3D3c4WAtSQG4tAF7Y_VCnO5cKh7KuFYZhpKbwGQOF%3DdZ4A%40mail.gmail.com Actually we are following an approach to avoid this DSA memory leaks. Let me explain our implementation and please validate and correct us in-case we miss anything. Implementation: Basically we have to put our index data into memory (Index Column Value Vs Ctid) which we get in aminsert callback function. Coming to the implementation, in aminsert Callback function, We Switch to CurTransactionContext Cache the DMLs of a transaction into dlist(global per process) Even if different clients work parallel, it won't be a problem because every client gets one dlist in separate process and it'll have it's own CurTransactionContext We have registered transaction callback (using RegisterXactCallback() function). And during event pre-commit(XACT_EVENT_PRE_COMMIT), we populate all the transaction specific DMLs (from dlist) into our in-memory index(DSA) obviously inside PG_TRY/PG_CATCH block. In case we got some errors(because of dsa_allocate() or something else) while processing dlist(while populating in-memory index), we cleanup the DSA memory in PG_CATCH block that is allocated/used till that point. During other error cases, typically transactions gets aborted and PRE_COMMIT event is not called and hence we don't touch DSA at that time. Hence no need to bother about leaks. Even sub transaction case is handled with sub transaction callbacks. CurTransactionContext(dlist basically) is automatically cleared after that particular transaction. I want to know if this approach is good and works well in all cases. Kindly provide your feedback on this. Regards G. Sai Ram
Re: [HACKERS] Remove secondary checkpoint
Simon Riggs writes: > Remove the code that maintained two checkpoint's WAL files and all > associated stuff. > Try to avoid breaking anything else > This > * reduces disk space requirements on master > * removes a minor bug in fast failover > * simplifies code Doesn't it also make crash recovery less robust? The whole point of that mechanism is to be able to cope if the latest checkpoint record is unreadable. If you want to toss that overboard, I think you need to make the case why we don't need it, not just post a patch removing it. *Of course* the code is simpler without it. That's utterly irrelevant. The code would be even simpler with no crash recovery at all ... but we're not going there. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Transform for pl/perl
There are some moments I should mention: 1. {"1":1}::jsonb is transformed into HV {"1"=>"1"}, while ["1","2"]::jsonb is transformed into AV ["1", "2"] 2. If there is a numeric value appear in jsonb, it will be transformed to SVnv through string (Numeric->String->SV->SVnv). Not the best solution, but as far as I understand this is usual practise in postgresql to serialize Numerics and de-serialize them. 3. SVnv is transformed into jsonb through string (SVnv->String->Numeric). An example may also be helpful to understand extension. So, as an example, function "test" transforms incoming jsonb into perl, transforms it back into jsonb and returns it. create extension jsonb_plperl cascade; create or replace function test(val jsonb) returns jsonb transform for type jsonb language plperl as $$ return $_[0]; $$; select test('{"1":1,"example": null}'::jsonb); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Current int & float overflow checking is slow.
We already know this integer overflow checking is non-standard and compilers keep trying to optimize them out. Our only strategy to defeat that depends on compiler flags like -fwrapv that vary by compiler and may or may not be working on less well tested compiler. So if there's a nice readable and convenient way to portably use cpu flags That would be brilliant. And I'm not too concerned if it doesn't run on VAX. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: long transactions on hot standby feedback replica / proof of concept
On Tue, Oct 24, 2017 at 10:56 AM, Ivan Kartyshov wrote: > Hello. I made some bugfixes and rewrite the patch. > > Simon Riggs писал 2017-09-05 14:44: > >> As Alexander says, simply skipping truncation if standby is busy isn't >> a great plan. >> >> If we defer an action on standby replay, when and who will we apply >> it? What happens if the standby is shutdown or crashes while an action >> is pending. >> > > After crash standby server will go to the nearest checkout and will replay > all his wal`s to reach consistent state and then open for read-only load. > (all collected wal`s will be replayed in right way, I meant wal`s that > truncates table and wal`s that make table grow) --- a/src/backend/storage/lmgr/lock.c > +++ b/src/backend/storage/lmgr/lock.c > @@ -49,6 +49,8 @@ > #include "utils/ps_status.h" > #include "utils/resowner_private.h" > > +#include > +#include > > /* This configuration variable is used to set the lock table size */ > int max_locks_per_xact; /* set by guc.c */ Why do you need these includes? + FreeFakeRelcacheEntry(rel); > + } else > + { > + XLogFlush(lsn); > + } This code violates our coding style. According to it, "else" shouldn't share its line with braces. Also, patch definitely needs some general comment explaining what's going on. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] [POC] hash partitioning
On Tue, Oct 24, 2017 at 5:00 PM, Andres Freund wrote: > On 2017-10-24 12:43:12 +0530, amul sul wrote: >> I tried to get suggested SMHasher[1] test result for the hash_combine >> for 32-bit and 64-bit version. >> >> SMHasher works on hash keys of the form {0}, {0,1}, {0,1,2}... up to >> N=255, using 256-N as the seed, for the hash_combine testing we >> needed two hash value to be combined, for that, I've generated 64 >> and 128-bit hash using cityhash functions[2] for the given smhasher >> key then split in two part to test 32-bit and 64-bit hash_combine >> function respectively. Attached patch for SMHasher code changes & >> output of 32-bit and 64-bit hash_combine testing. Note that I have >> skipped speed test this test which is irrelevant here. >> >> By referring other hash function results [3], we can see that hash_combine >> test results are not bad either. >> >> Do let me know if current testing is not good enough or if you want me to do >> more testing, thanks. > > This looks very good! Both the tests you did, and the results for > hash_combineXX. I therefore think we can go ahead with that formulation > of hash_combine64? > Thanks, Andres. Yes we can, I've added your suggested hash_combine64 in the latest patch[1]. Regards, Amul 1] https://postgr.es/m/CAAJ_b97R2rJinGPAVmZZzpNV%3D-5BgYFxDfY9HYdM1bCYJFGmQw%40mail.gmail.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [POC] hash partitioning
On 2017-10-24 12:43:12 +0530, amul sul wrote: > I tried to get suggested SMHasher[1] test result for the hash_combine > for 32-bit and 64-bit version. > > SMHasher works on hash keys of the form {0}, {0,1}, {0,1,2}... up to > N=255, using 256-N as the seed, for the hash_combine testing we > needed two hash value to be combined, for that, I've generated 64 > and 128-bit hash using cityhash functions[2] for the given smhasher > key then split in two part to test 32-bit and 64-bit hash_combine > function respectively. Attached patch for SMHasher code changes & > output of 32-bit and 64-bit hash_combine testing. Note that I have > skipped speed test this test which is irrelevant here. > > By referring other hash function results [3], we can see that hash_combine > test results are not bad either. > > Do let me know if current testing is not good enough or if you want me to do > more testing, thanks. This looks very good! Both the tests you did, and the results for hash_combineXX. I therefore think we can go ahead with that formulation of hash_combine64? Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Transform for pl/perl
Hello. Please, check out jsonb transform (https://www.postgresql.org/docs/9.5/static/sql-createtransform.html) for pl/perl language I've implemented.diff --git a/contrib/Makefile b/contrib/Makefile index 8046ca4..53d44fe 100644 --- a/contrib/Makefile +++ b/contrib/Makefile @@ -75,9 +75,9 @@ ALWAYS_SUBDIRS += sepgsql endif ifeq ($(with_perl),yes) -SUBDIRS += hstore_plperl +SUBDIRS += hstore_plperl jsonb_plperl else -ALWAYS_SUBDIRS += hstore_plperl +ALWAYS_SUBDIRS += hstore_plperl jsonb_plperl endif ifeq ($(with_python),yes) diff --git a/contrib/jsonb_plperl/Makefile b/contrib/jsonb_plperl/Makefile new file mode 100644 index 000..8c427c5 --- /dev/null +++ b/contrib/jsonb_plperl/Makefile @@ -0,0 +1,40 @@ +# contrib/jsonb_plperl/Makefile + +MODULE_big = jsonb_plperl +OBJS = jsonb_plperl.o $(WIN32RES) +PGFILEDESC = "jsonb_plperl - jsonb transform for plperl" + +PG_CPPFLAGS = -I$(top_srcdir)/src/pl/plperl + +EXTENSION = jsonb_plperlu jsonb_plperl +DATA = jsonb_plperlu--1.0.sql jsonb_plperl--1.0.sql + +REGRESS = jsonb_plperl jsonb_plperlu + +ifdef USE_PGXS +PG_CONFIG = pg_config +PGXS := $(shell $(PG_CONFIG) --pgxs) +include $(PGXS) +else +subdir = contrib/jsonb_plperl +top_builddir = ../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif + +# We must link libperl explicitly +ifeq ($(PORTNAME), win32) +# these settings are the same as for plperl +override CPPFLAGS += -DPLPERL_HAVE_UID_GID -Wno-comment +# ... see silliness in plperl Makefile ... +SHLIB_LINK += $(sort $(wildcard ../../src/pl/plperl/libperl*.a)) +else +rpathdir = $(perl_archlibexp)/CORE +SHLIB_LINK += $(perl_embed_ldflags) +endif + +# As with plperl we need to make sure that the CORE directory is included +# last, probably because it sometimes contains some header files with names +# that clash with some of ours, or with some that we include, notably on +# Windows. +override CPPFLAGS := $(CPPFLAGS) $(perl_embed_ccflags) -I$(perl_archlibexp)/CORE diff --git a/contrib/jsonb_plperl/expected/jsonb_plperl.out b/contrib/jsonb_plperl/expected/jsonb_plperl.out new file mode 100644 index 000..7a85361 --- /dev/null +++ b/contrib/jsonb_plperl/expected/jsonb_plperl.out @@ -0,0 +1,76 @@ +CREATE EXTENSION jsonb_plperl CASCADE; +NOTICE: installing required extension "plperl" +-- test hash -> jsonb +CREATE FUNCTION testHVToJsonb() RETURNS jsonb +LANGUAGE plperl +TRANSFORM FOR TYPE jsonb +AS $$ +$val = {a => 1, b => 'boo', c => undef}; +return $val; +$$; +SELECT testHVToJsonb(); + testhvtojsonb +- + {"a": 1, "b": "boo", "c": null} +(1 row) + +-- test array -> jsonb +CREATE FUNCTION testAVToJsonb() RETURNS jsonb +LANGUAGE plperl +TRANSFORM FOR TYPE jsonb +AS $$ +$val = [{a => 1, b => 'boo', c => undef}, {d => 2}]; +return $val; +$$; +SELECT testAVToJsonb(); +testavtojsonb +- + [{"a": 1, "b": "boo", "c": null}, {"d": 2}] +(1 row) + +-- test scalar -> jsonb +CREATE FUNCTION testSVToJsonb() RETURNS jsonb +LANGUAGE plperl +TRANSFORM FOR TYPE jsonb +AS $$ +$val = 1; +return $val; +$$; +SELECT testAVToJsonb(); +testavtojsonb +- + [{"a": 1, "b": "boo", "c": null}, {"d": 2}] +(1 row) + +-- test jsonb -> scalar -> jsonb +CREATE FUNCTION testSVToJsonb2(val jsonb) RETURNS jsonb +LANGUAGE plperl +TRANSFORM FOR TYPE jsonb +AS $$ +return $_[0]; +$$; +SELECT testSVToJsonb2('1'); + testsvtojsonb2 + + 1 +(1 row) + +SELECT testSVToJsonb2('[1,2,3]'); + testsvtojsonb2 + + [1, 2, 3] +(1 row) + +SELECT testSVToJsonb2('{"1":{"2":[3,4,5]},"2":3}'); + testsvtojsonb2 +- + {"1": {"2": [3, 4, 5]}, "2": 3} +(1 row) + +DROP EXTENSION plperl CASCADE; +NOTICE: drop cascades to 5 other objects +DETAIL: drop cascades to extension jsonb_plperl +drop cascades to function testhvtojsonb() +drop cascades to function testavtojsonb() +drop cascades to function testsvtojsonb() +drop cascades to function testsvtojsonb2(jsonb) diff --git a/contrib/jsonb_plperl/expected/jsonb_plperlu.out b/contrib/jsonb_plperl/expected/jsonb_plperlu.out new file mode 100644 index 000..6d4be1c --- /dev/null +++ b/contrib/jsonb_plperl/expected/jsonb_plperlu.out @@ -0,0 +1,33 @@ +CREATE EXTENSION jsonb_plperlu CASCADE; +NOTICE: installing required extension "plperlu" +-- test jsonb -> hash +CREATE FUNCTION testJsonbToHV(val jsonb) RETURNS jsonb +LANGUAGE plperlu +TRANSFORM FOR TYPE jsonb +AS $$ +return $_[0]; +$$; +SELECT testJsonbToHV('{"aa":"bb", "cc":null, "dd":2}'::jsonb); + testjsonbtohv +--- + {"aa": "bb", "cc": null, "dd": 2} +(1 row) + +-- test jsonb -> av +CREATE FUNCTION testJsonbToAV(val jsonb) RETURNS jsonb +LANGUAGE plperlu +TRANSFORM FOR TYPE jsonb +AS $$ +return $_[0]; +$$; +SELECT testJsonbToAV('["bb", nul
Re: [HACKERS] SIGSEGV in BRIN autosummarize
Alvaro Herrera wrote: > Before pushing, I'll give a look to the regular autovacuum path to see > if it needs a similar fix. Reading that one, my conclusion is that it doesn't have the same problem because the strings are allocated in AutovacuumMemCxt which is not reset by error recovery. This gave me the idea to use that context instead of TopTransactionContext to store the strings in workitem processing also. The patch I propose now is attached. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >From 013e54c9df54b7a324a30beba22138a86417f34f Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Mon, 23 Oct 2017 18:55:12 +0200 Subject: [PATCH v2] Fix autovacuum work items --- src/backend/postmaster/autovacuum.c | 23 ++- 1 file changed, 10 insertions(+), 13 deletions(-) diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c index 776b1c0a9d..dc76728d71 100644 --- a/src/backend/postmaster/autovacuum.c +++ b/src/backend/postmaster/autovacuum.c @@ -2521,9 +2521,10 @@ deleted: { AutoVacuumWorkItem *workitem = &AutoVacuumShmem->av_workItems[i]; - if (!workitem->avw_used) - continue; - if (workitem->avw_active) + /* Item empty, already being processed, not this database? skip it */ + if (!workitem->avw_used || + workitem->avw_active || + workitem->avw_database != MyDatabaseId) continue; /* claim this one, and release lock while performing it */ @@ -2531,6 +2532,7 @@ deleted: LWLockRelease(AutovacuumLock); perform_work_item(workitem); + /* we intentially do not set did_vacuum here */ /* * Check for config changes before acquiring lock for further @@ -2601,11 +2603,9 @@ perform_work_item(AutoVacuumWorkItem *workitem) /* * Save the relation name for a possible error message, to avoid a catalog * lookup in case of an error. If any of these return NULL, then the -* relation has been dropped since last we checked; skip it. Note: they -* must live in a long-lived memory context because we call vacuum and -* analyze in different transactions. +* relation has been dropped since last we checked; skip it. */ - + MemoryContextSwitchTo(AutovacMemCxt); cur_relname = get_rel_name(workitem->avw_relation); cur_nspname = get_namespace_name(get_rel_namespace(workitem->avw_relation)); cur_datname = get_database_name(MyDatabaseId); @@ -2615,6 +2615,8 @@ perform_work_item(AutoVacuumWorkItem *workitem) autovac_report_workitem(workitem, cur_nspname, cur_datname); /* +* Have at it. +* * We will abort the current work item if something errors out, and * continue with the next one; in particular, this happens if we are * interrupted with SIGINT. Note that this means that the work item list @@ -2622,9 +2624,6 @@ perform_work_item(AutoVacuumWorkItem *workitem) */ PG_TRY(); { - /* have at it */ - MemoryContextSwitchTo(TopTransactionContext); - switch (workitem->avw_type) { case AVW_BRINSummarizeRange: @@ -2668,10 +2667,8 @@ perform_work_item(AutoVacuumWorkItem *workitem) } PG_END_TRY(); - /* We intentionally do not set did_vacuum here */ - - /* be tidy */ deleted2: + /* be tidy */ if (cur_datname) pfree(cur_datname); if (cur_nspname) -- 2.11.0 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Current int & float overflow checking is slow.
Hi, In analytics queries that involve a large amounts of integers and/or floats (i.e. a large percentage) it's quite easy to see the functions underlying the operators in profiles. Partially that's the function call overhead, but even *after* removing most of that via JITing, they're surprisingly expensive. Largely that's due to the overflow checks. For integers we currently do: #define SAMESIGN(a,b) (((a) < 0) == ((b) < 0)) /* * Overflow check. If the inputs are of different signs then their sum * cannot overflow. If the inputs are of the same sign, their sum had * better be that sign too. */ if (SAMESIGN(arg1, arg2) && !SAMESIGN(result, arg1)) ereport(ERROR, (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE), errmsg("integer out of range"))); which means that we turn a single integer instruction into ~10, including a bunch of branches. All that despite the fact that most architectures have flag registers signalling integer overflow. It's just that C doesn't easily make that available. gcc exposes more efficient overflow detection via intrinsics: https://gcc.gnu.org/onlinedocs/gcc-7.1.0/gcc/Integer-Overflow-Builtins.html Using that turns the non-error path from int4pl from: 0x00826ec0 <+0>: mov0x20(%rdi),%rcx # arg1 0x00826ec4 <+4>: mov0x28(%rdi),%rdx # arg2 0x00826ec8 <+8>: mov%ecx,%esi 0x00826eca <+10>:lea(%rdx,%rcx,1),%eax # add # overflow check 0x00826ecd <+13>:shr$0x1f,%edx 0x00826ed0 <+16>:not%esi 0x00826ed2 <+18>:shr$0x1f,%esi 0x00826ed5 <+21>:cmp%dl,%sil 0x00826ed8 <+24>:je 0x826f30 0x00826eda <+26>:mov%eax,%edx 0x00826edc <+28>:shr$0x1f,%ecx 0x00826edf <+31>:shr$0x1f,%edx 0x00826ee2 <+34>:cmp%cl,%dl 0x00826ee4 <+36>:je 0x826f30 /* overflow error code */ 0x00826f30 <+112>: retq into 0x00826ec0 <+0>: mov0x28(%rdi),%rax # arg2 0x00826ec4 <+4>: add0x20(%rdi),%eax # arg1 + arg2 0x00826ec7 <+7>: jo 0x826ecc # jump if overflowed 0x00826ec9 <+9>: mov%eax,%eax # clear high bits 0x00826ecb <+11>:retq which, not that surprisingly, is faster. Not to speak of easier to read ;) Besides the fact that the code is faster, there's also the issue that the current way to do overflow checks is not actually correct C, and requires compiler flags like -fwrapv. For floating point it's even worse. /* * check to see if a float4/8 val has underflowed or overflowed */ #define CHECKFLOATVAL(val, inf_is_valid, zero_is_valid) \ do { \ if (isinf(val) && !(inf_is_valid)) \ ereport(ERROR, \ (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE), \ errmsg("value out of range: overflow"))); \ \ if ((val) == 0.0 && !(zero_is_valid)) \ ereport(ERROR, \ (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE), \ errmsg("value out of range: underflow"))); \ } while(0) result = arg1 + arg2; /* * There isn't any way to check for underflow of addition/subtraction * because numbers near the underflow value have already been rounded to * the point where we can't detect that the two values were originally * different, e.g. on x86, '1e-45'::float4 == '2e-45'::float4 == * 1.4013e-45. */ CHECKFLOATVAL(result, isinf(arg1) || isinf(arg2), true); The disassembled code for float4pl is: 0x0043ce90 <+0>: vmovss 0x20(%rdi),%xmm1 0x0043ce95 <+5>: vmovss 0x28(%rdi),%xmm2 0x0043ce9a <+10>:vmovss 0x2b6a7e(%rip),%xmm3# 0x6f3920 0x0043cea2 <+18>:vaddss %xmm1,%xmm2,%xmm0 0x0043cea6 <+22>:vmovaps %xmm0,%xmm4 0x0043ceaa <+26>:vandps %xmm3,%xmm4,%xmm4 0x0043ceae <+30>:vucomiss 0x2b6a4a(%rip),%xmm4# 0x6f3900 0x0043ceb6 <+38>:jbe0x43ced4 0x0043ceb8 <+40>:vandps %xmm3,%xmm1,%xmm1 0x0043cebc <+44>:vucomiss 0x2b6a3c(%rip),%xmm1# 0
Re: [HACKERS] Parallel Hash take II
On Tue, Sep 19, 2017 at 8:06 AM, Robert Haas wrote: > On Thu, Sep 14, 2017 at 10:01 AM, Thomas Munro > wrote: >> 3. Gather Merge and Parallel Hash Join may have a deadlock problem. > > [...] > > Thomas and I spent about an hour and a half brainstorming about this > just now. > > [...] > > First, as long as nbatches == 1, we can use a hash > table of up to size (participants * work_mem); if we have to switch to > multiple batches, then just increase the number of batches enough that > the current memory usage drops below work_mem. Second, following an > idea originally by Ashutosh Bapat whose relevance to this issue Thomas > Munro realized during our discussion, we can make all the batches > small enough to fit in work_mem (rather than participants * work_mem > as the current patch does) and spread them across the workers (in the > style of Parallel Append, including potentially deploying multiple > workers against the same batch if there are fewer batches than > workers). Here is an updated patch set that does that ^. Assorted details: 1. To avoid deadlocks, we can only wait at barriers when we know that all other attached backends have either arrived already or are actively executing the code preceding the barrier wait, so that progress is guaranteed. The rules is that executor nodes can remain attached to a barrier after they've emitted a tuple, which is useful for resource management (ie avoids inventing a separate reference counting scheme), but must never again wait for it. With that programming rule there can be no deadlock between executor nodes. 2. Multiple batches are processed at the same time in parallel, rather than being processed serially. Participants try to spread themselves out over different batches to reduce contention. 3. I no longer try to handle outer joins. I have an idea for how to do that while adhering to the above deadlock-avoidance programming rule, but I would like to consider that for a later patch. 4. I moved most of the parallel-aware code into ExecParallelHash*() functions that exist alongside the private hash table versions. This avoids uglifying the existing hash join code and introducing conditional branches all over the place, as requested by Andres. This made some of the earlier refactoring patches unnecessary. 5. Inner batch repartitioning, if required, is now completed up front for Parallel Hash. Rather than waiting until we try to load hash tables back into memory to discover that they don't fit, this version tracks the size of hash table contents while writing the batches out. That change has various pros and cons, but its main purpose is to remove dependencies between batches. 6. Outer batch repartitioning is now done up front too, if it's necessary. This removes the dependencies that exist between batch 0 and later batches, but means that outer batch 0 is now written to disk if for multi-batch joins. I don't see any way to avoid that while adhering to the deadlock avoidance rule stated above. If we've already started emitting tuples for batch 0 (by returning control to higher nodes) then we have no deadlock-free way to wait for the scan of the outer relation to finish, so we can't safely process later batches. Therefore we must buffer batch 0's tuples. This renders the skew optimisation useless. 7. There is now some book-keeping state for each batch. For typical cases the total space used is negligible but obviously you can contrive degenerate cases where it eats a lot of memory (say by creating a million batches, which is unlikely to work well for other reasons). I have some ideas on reducing their size, but on the other hand I also have some ideas on increasing it profitably... (this is the perfect place to put the Bloom filters discussed elsewhere that would make up for the loss of the skew optimisation, for selective joins; a subject for another day). 8. Barrier API extended slightly. (1) BarrierWait() is renamed to BarrierArriveAndWait(). (2) BarrierArriveAndDetach() is new. The new function is the mechanism required to get from PHJ_BATCH_PROBING to PHJ_BATCH_DONE without waiting, and corresponds to the operation known as Phaser.arriveAndDeregister() in Java (and maybe also barrier::arrive_and_drop() in the C++ concurrency TS and Clock.drop() in X10, not sure). 9. I got rid of PARALLEL_KEY_EXECUTOR_NODE_NTH(). Previously I wanted more than one reserved smh_toc key per executor node. Robert didn't like that. 10. I got rid of "LeaderGate". That earlier attempt at deadlock avoidance clearly missed the mark. (I think it probably defended against the Gather + full TupleQueue deadlock but not the GatherMerge-induced deadlock so it was a useless non-general solution.) 11. The previous incarnation of SharedTuplestore had a built-in concept of partitions, which allowed the number of partitions to be expanded any time but only allowed one partition to be read back in at a time. That worked for the earlier kind of sequ
Re: [HACKERS] WIP: long transactions on hot standby feedback replica / proof of concept
Hello. I made some bugfixes and rewrite the patch. Simon Riggs писал 2017-09-05 14:44: As Alexander says, simply skipping truncation if standby is busy isn't a great plan. If we defer an action on standby replay, when and who will we apply it? What happens if the standby is shutdown or crashes while an action is pending. After crash standby server will go to the nearest checkout and will replay all his wal`s to reach consistent state and then open for read-only load. (all collected wal`s will be replayed in right way, I meant wal`s that truncates table and wal`s that make table grow) -- Ivan Kartyshov Postgres Professional: http://www.postgrespro.com The Russian Postgres Companydiff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c index 9a5fde0..68decab 100644 --- a/src/backend/catalog/storage.c +++ b/src/backend/catalog/storage.c @@ -31,6 +31,9 @@ #include "storage/smgr.h" #include "utils/memutils.h" #include "utils/rel.h" +#include "storage/lmgr.h" +#include "storage/procarray.h" +#include "access/transam.h" /* * We keep a list of all relations (represented as RelFileNode values) @@ -498,50 +501,56 @@ smgr_redo(XLogReaderState *record) reln = smgropen(xlrec->rnode, InvalidBackendId); - /* - * Forcibly create relation if it doesn't exist (which suggests that - * it was dropped somewhere later in the WAL sequence). As in - * XLogReadBufferForRedo, we prefer to recreate the rel and replay the - * log as best we can until the drop is seen. - */ - smgrcreate(reln, MAIN_FORKNUM, true); - - /* - * Before we perform the truncation, update minimum recovery point to - * cover this WAL record. Once the relation is truncated, there's no - * going back. The buffer manager enforces the WAL-first rule for - * normal updates to relation files, so that the minimum recovery - * point is always updated before the corresponding change in the data - * file is flushed to disk. We have to do the same manually here. - * - * Doing this before the truncation means that if the truncation fails - * for some reason, you cannot start up the system even after restart, - * until you fix the underlying situation so that the truncation will - * succeed. Alternatively, we could update the minimum recovery point - * after truncation, but that would leave a small window where the - * WAL-first rule could be violated. - */ - XLogFlush(lsn); - - if ((xlrec->flags & SMGR_TRUNCATE_HEAP) != 0) + if (!ConditionalLockRelationOid(reln->smgr_rnode.node.relNode, AccessExclusiveLock)) { - smgrtruncate(reln, MAIN_FORKNUM, xlrec->blkno); + /* + * Forcibly create relation if it doesn't exist (which suggests that + * it was dropped somewhere later in the WAL sequence). As in + * XLogReadBufferForRedo, we prefer to recreate the rel and replay the + * log as best we can until the drop is seen. + */ + smgrcreate(reln, MAIN_FORKNUM, true); + + /* + * Before we perform the truncation, update minimum recovery point to + * cover this WAL record. Once the relation is truncated, there's no + * going back. The buffer manager enforces the WAL-first rule for + * normal updates to relation files, so that the minimum recovery + * point is always updated before the corresponding change in the data + * file is flushed to disk. We have to do the same manually here. + * + * Doing this before the truncation means that if the truncation fails + * for some reason, you cannot start up the system even after restart, + * until you fix the underlying situation so that the truncation will + * succeed. Alternatively, we could update the minimum recovery point + * after truncation, but that would leave a small window where the + * WAL-first rule could be violated. + */ + XLogFlush(lsn); - /* Also tell xlogutils.c about it */ - XLogTruncateRelation(xlrec->rnode, MAIN_FORKNUM, xlrec->blkno); - } + if ((xlrec->flags & SMGR_TRUNCATE_HEAP) != 0) + { +smgrtruncate(reln, MAIN_FORKNUM, xlrec->blkno); + +/* Also tell xlogutils.c about it */ +XLogTruncateRelation(xlrec->rnode, MAIN_FORKNUM, xlrec->blkno); + } - /* Truncate FSM and VM too */ - rel = CreateFakeRelcacheEntry(xlrec->rnode); + /* Truncate FSM and VM too */ + rel = CreateFakeRelcacheEntry(xlrec->rnode); - if ((xlrec->flags & SMGR_TRUNCATE_FSM) != 0 && - smgrexists(reln, FSM_FORKNUM)) - FreeSpaceMapTruncateRel(rel, xlrec->blkno); - if ((xlrec->flags & SMGR_TRUNCATE_VM) != 0 && - smgrexists(reln, VISIBILITYMAP_FORKNUM)) - visibilitymap_truncate(rel, xlrec->blkno); + if ((xlrec->flags & SMGR_TRUNCATE_FSM) != 0 && +smgrexists(reln, FSM_FORKNUM)) +FreeSpaceMapTruncateRel(rel, xlrec->blkno); + if ((xlrec->flags & SMGR_TRUNCATE_VM) != 0 && +smgrexists(reln, VISIBILITYMAP_FORKNUM)) +visibilitymap_truncate(rel, xlrec->blkno); - FreeFakeRelcacheEntry(rel); + FreeFakeRelcacheEntry(rel); + } else + { + XLogFlush
Re: [HACKERS] [POC] hash partitioning
On Fri, Oct 13, 2017 at 3:00 AM, Andres Freund wrote: > On 2017-10-12 17:27:52 -0400, Robert Haas wrote: >> On Thu, Oct 12, 2017 at 4:20 PM, Andres Freund wrote: >> >> In other words, it's not utterly fixed in stone --- we invented >> >> --load-via-partition-root primarily to cope with circumstances that >> >> could change hash values --- but we sure don't want to be changing it >> >> with any regularity, or for a less-than-excellent reason. >> > >> > Yea, that's what I expected. It'd probably good for somebody to run >> > smhasher or such on the output of the combine function (or even better, >> > on both the 32 and 64 bit variants) in that case. >> >> Not sure how that test suite works exactly, but presumably the >> characteristics in practice will depend the behavior of the hash >> functions used as input the combine function - so the behavior could >> be good for an (int, int) key but bad for a (text, date) key, or >> whatever. > > I don't think that's true, unless you have really bad hash functions on > the the component hashes. A hash combine function can't really do > anything about badly hashed input, what you want is that it doesn't > *reduce* the quality of the hash by combining. > I tried to get suggested SMHasher[1] test result for the hash_combine for 32-bit and 64-bit version. SMHasher works on hash keys of the form {0}, {0,1}, {0,1,2}... up to N=255, using 256-N as the seed, for the hash_combine testing we needed two hash value to be combined, for that, I've generated 64 and 128-bit hash using cityhash functions[2] for the given smhasher key then split in two part to test 32-bit and 64-bit hash_combine function respectively. Attached patch for SMHasher code changes & output of 32-bit and 64-bit hash_combine testing. Note that I have skipped speed test this test which is irrelevant here. By referring other hash function results [3], we can see that hash_combine test results are not bad either. Do let me know if current testing is not good enough or if you want me to do more testing, thanks. 1] https://github.com/aappleby/smhasher 2] https://github.com/aappleby/smhasher/blob/master/src/CityTest.cpp 3] https://github.com/rurban/smhasher/tree/master/doc Regards, Amul [amul.sul@power2 src]$ ./SMHasher hash_combine64 --- --- Testing hash_combine64 (test hash combine 64 pg function.) [[[ Sanity Tests ]]] Verification value 0x3B439A64 : Passed! Running sanity check 1..PASS Running sanity check 2..PASS [[[ Differential Tests ]]] Testing 8303632 up-to-5-bit differentials in 64-bit keys -> 64 bit hashes. 1000 reps, 8303632000 total tests, expecting 0.00 random collisions.. 0 total collisions, of which 0 single collisions were ignored Testing 11017632 up-to-4-bit differentials in 128-bit keys -> 64 bit hashes. 1000 reps, 11017632000 total tests, expecting 0.00 random collisions.. 0 total collisions, of which 0 single collisions were ignored Testing 2796416 up-to-3-bit differentials in 256-bit keys -> 64 bit hashes. 1000 reps, 2796416000 total tests, expecting 0.00 random collisions.. 0 total collisions, of which 0 single collisions were ignored [[[ Avalanche Tests ]]] Testing 32-bit keys -> 64-bit hashes, 30 reps.. worst bias is 0.742667% Testing 40-bit keys -> 64-bit hashes, 30 reps.. worst bias is 0.684667% Testing 48-bit keys -> 64-bit hashes, 30 reps.. worst bias is 0.570667% Testing 56-bit keys -> 64-bit hashes, 30 reps.. worst bias is 0.726667% Testing 64-bit keys -> 64-bit hashes, 30 reps.. worst bias is 0.67% Testing 72-bit keys -> 64-bit hashes, 30 reps.. worst bias is 0.659333% Testing 80-bit keys -> 64-bit hashes, 30 reps.. worst bias is 0.666000% Testing 88-bit keys -> 64-bit hashes, 30 reps.. worst bias is 0.712000% Testing 96-bit keys -> 64-bit hashes, 30 reps.. worst bias is 0.632667% Testing 104-bit keys -> 64-bit hashes, 30 reps.. worst bias is 0.646000% Testing 112-bit keys -> 64-bit hashes, 30 reps.. worst bias is 0.746000% Testing 120-bit keys -> 64-bit hashes, 30 reps.. worst bias is 0.692667% Testing 128-bit keys -> 64-bit hashes, 30 reps.. worst bias is 0.684000% Testing 136-bit keys -> 64-bit hashes, 30 reps.. worst bias is 0.758667% Testing 144-bit keys -> 64-bit hashes, 30 reps.. worst bias is 0.725333% Testing 152-bit keys -> 64-bit hashes, 30 reps.. worst bias is 0.776000% [[[ Keyset 'Cyclic' Tests ]]] Keyset 'Cyclic' - 8 cycles of 8 bytes - 1000 keys Testing collisions - Expected 0.00, actual 0.00 ( 0.00x) Testing distribution - Worst bias is the 20-bit window at bit 6 - 0.039% Keyset 'Cyclic' - 8 cycles of 9 bytes - 1000 keys Testing collisions - Ex