I upgrade another instance to PG10 yesterday and this AM found unique key violations.
Our application is SELECTing FROM sites WHERE site_location=$1, and if it doesn't find one, INSERTs one (I know that's racy and not ideal). We ended up with duplicate sites, despite a unique index. We removed the duplicate rows and reindexed fine. This is just a heads up with all the detail I can fit in a mail (but there's more if needed). ts=# \d sites site_id | integer | | not null | nextval('sites_site_id_seq'::regclass) site_office | text | | | site_location | text | | | [...] Indexes: "sites_pkey" PRIMARY KEY, btree (site_id) "sites_idx" UNIQUE, btree (site_office, site_location) ts=# SELECT site_office, site_location, count(*), min(site_id), max(site_id) FROM sites GROUP BY 1,2 HAVING count(*)>1 ORDER BY 1,2; site_office | site_location | count | min | max ----------------+---------------+-------+-----+----- CRCLMT-DOEMS0 | | 2 | 165 | 351 CRCLMT-DOEMS0 | 1101 | 2 | 123 | 343 CRCLMT-DOEMS0 | 1102 | 2 | 134 | 318 CRCLMT-DOEMS0 | 1103 | 2 | 145 | 322 CRCLMT-DOEMS0 | 1104 | 2 | 156 | 329 The duplicate site_ids mean this isn't an issue with row version/visibility due to XIDs (right?). ts=# SELECT 1 FROM sites WHERE site_office='CRCLMT-CEMS0' AND site_location=''; (0 rows) ts=# SET enable_bitmapscan=off; SET enable_indexscan=off; SELECT 1 FROM sites WHERE site_office='CRCLMT-CEMS0' AND site_location=''; -[ RECORD 1 ] ?column? | 1 So there's an issue with indices failing to return matching rows (and thereby allowing inserting duplicate violating rows). That's the only table/index affected (and the only PG instance thus far affected). Note regarding my pg_upgrade: 3 years ago, this was our first and smallest customer who I upgraded off PG8.4 (to PG9.2 if I recall), and I did it using pg_dump |pg_restore. I believe, as a consequence, its postgres database was in "C" locale and ASCII encoding. So the last few upgrades (9.3, .4 .5 and .6), I believe I've manually used initdb --encoding followed by pg_upgrade (else it fails due to new postgres/template DBs with different locale/encoding from old). This upgrade, I finally renamed postgres DB (which has imported CSV logs and one or two other things) and pg_dump|pg_restore into a new DB with UTF8 encoding, which allowed pg_upgrade to run without special initdb invocation. I have an LVM snapshot and full CSV logs imported into a table. I also have a backup from 22:00 which doesn't have duplicate sites. Those seem to have been inserted by our application around 00:30: These IDs which inserted duplicate rows: postgres=# SELECT session_id, max(session_line) FROM postgres_log_2017_10_24_0000 WHERE message LIKE 'statement: SELECT site_id FROM sites WHERE%' GROUP BY 1 ; session_id | max ---------------+------ 59eedfb1.5cea | 714 59eedfb5.5cf1 | 1741 (2 rows) postgres=# SELECT log_time, session_id, session_line, left(message,333) FROM postgres_log WHERE (session_id='59eedfb1.5cea' OR session_id='59eedfb5.5cf1') AND (session_line<6 OR message LIKE '%INSERT INTO site%') ORDER BY 1,2,3; -[ RECORD 4 ]+---------------------------------------------------------------------------------------------------------------------------------------------- log_time | 2017-10-24 00:37:37.888-06 session_id | 59eedfb1.5cea session_line | 4 left | statement: SELECT site_id FROM sites WHERE + | site_office = 'CRCLMT-DOEMS0' AND site_location = '1203' -[ RECORD 5 ]+---------------------------------------------------------------------------------------------------------------------------------------------- log_time | 2017-10-24 00:37:37.89-06 session_id | 59eedfb1.5cea session_line | 5 left | statement: INSERT INTO sites (site_office,site_location,site_alias) + | VALUES ('CRCLMT-DOEMS0', '1203', (SELECT site_id FROM sites + | WHERE site_office = 'CRCLMT-CEMS0' AND site_location = '1203')) Note: I run a script which does various combinations of ANALYZE/VACUUM (FULL/ANALYZE) following the upgrade, and a script runs nightly with REINDEX and pg_repack (and a couple of CLUSTER), so you should assume that any combination of those maintenance commands have been run. In our reindex/repack log I found the first error due to duplicates: Tue Oct 24 01:27:53 MDT 2017: sites: sites_idx(repack non-partitioned)... WARNING: Error creating index "public"."index_61764": ERROR: could not create unique index "index_61764" DETAIL: Key (site_office, site_location)=(CRCLMT-DOEMS0, 1120) is duplicated. WARNING: Skipping index swapping for "sites", since no new indexes built WARNING: repack failed for "sites_idx" reindex: warning, dropping invalid/unswapped index: index_61764 postgres=# SELECT * FROM postgres_log WHERE session_id='59eeeb79.4bd5' AND error_severity='ERROR' ORDER BY 1 DESC LIMIT 33; log_time | 2017-10-24 01:27:53.545-06 user_name | postgres database | ts pid | 19413 connection_from | [local] session_id | 59eeeb79.4bd5 session_line | 10 command_tag | CREATE INDEX session_start_time | 2017-10-24 01:27:53-06 error_severity | ERROR message | could not create unique index "index_61764" detail | Key (site_office, site_location)=(CRCLMT-DOEMS0, 1120) is duplicated. query | CREATE UNIQUE INDEX CONCURRENTLY index_61764 ON sites USING btree (site_office, site_location) TABLESPACE pg_default application_name | pg_repack Disclosure: I see a storage error from 2 days ago (pre-upgrade). This is a 4-drive RAID5 with spare. I can't see that we're missing a spare, so I don't know what else to do besides consider this message to be spurious.. Oct 22 21:28:16 midrivers-ubuntu kernel: [1429369.825577] hpsa 0000:05:00.0: scsi 4:0:0:0 Aborting command ffff8f84f2ba7c00Tag:0x00000000:000001f0 CDBLen: 6 CDB: 0x1201... SN: 0x0 BEING SENT Oct 22 21:28:16 midrivers-ubuntu kernel: [1429369.825583] hpsa 0000:05:00.0: scsi 4:0:0:0: Aborting command RAID HP P410i controller SSDSmartPathCap- En- Exp=1 Oct 22 21:28:16 midrivers-ubuntu kernel: [1429369.898930] hpsa 0000:05:00.0: CDB 12018000600000000000000000000000 was aborted with status 0x0 Oct 22 21:28:17 midrivers-ubuntu kernel: [1429371.269963] hpsa 0000:05:00.0: invalid command: LUN:0000000000000000 CDB:00000000f00100000000000000000000 Oct 22 21:28:17 midrivers-ubuntu kernel: [1429371.269966] hpsa 0000:05:00.0: probably means device no longer present Oct 22 21:28:17 midrivers-ubuntu kernel: [1429371.269968] hpsa 0000:05:00.0: scsi 4:0:0:0 Aborting command ffff8f84f2ba7c00Tag:0x00000000:000001f0 CDBLen: 6 CDB: 0x1201... SN: 0x0 SENT, FAILED Oct 22 21:28:17 midrivers-ubuntu kernel: [1429371.269974] hpsa 0000:05:00.0: scsi 4:0:0:0: FAILED to abort command RAID HP P410i controller SSDSmartPathCap- En- Exp=1 sites is perhaps our most central table and I would expect issues there to be quickly apparent (however this was at least initially a silent failure). public | sites | table | telsasoft | 80 kB | Justin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers