[HACKERS] unique index violation after pg_upgrade to PG10

Justin Pryzby Tue, 24 Oct 2017 11:15:44 -0700

I upgrade another instance to PG10 yesterday and this AM found unique key
violations.


Our application is SELECTing FROM sites WHERE site_location=$1, and if it
doesn't find one, INSERTs one (I know that's racy and not ideal).  We ended up
with duplicate sites, despite a unique index.  We removed the duplicate rows
and reindexed fine.  This is just a heads up with all the detail I can fit in a
mail (but there's more if needed).

ts=# \d sites
 site_id          | integer  |           | not null | 
nextval('sites_site_id_seq'::regclass)
 site_office      | text     |           |          | 
 site_location    | text     |           |          | 
[...]
Indexes:
    "sites_pkey" PRIMARY KEY, btree (site_id)
    "sites_idx" UNIQUE, btree (site_office, site_location)

ts=# SELECT site_office, site_location, count(*), min(site_id), max(site_id) 
FROM sites GROUP BY 1,2 HAVING count(*)>1 ORDER BY 1,2;  
  site_office   | site_location | count | min | max 
----------------+---------------+-------+-----+-----
 CRCLMT-DOEMS0  |               |     2 | 165 | 351
 CRCLMT-DOEMS0  | 1101          |     2 | 123 | 343
 CRCLMT-DOEMS0  | 1102          |     2 | 134 | 318
 CRCLMT-DOEMS0  | 1103          |     2 | 145 | 322
 CRCLMT-DOEMS0  | 1104          |     2 | 156 | 329

The duplicate site_ids mean this isn't an issue with row version/visibility due
to XIDs (right?).

ts=# SELECT 1 FROM sites WHERE site_office='CRCLMT-CEMS0' AND site_location='';
(0 rows)
ts=# SET enable_bitmapscan=off; SET enable_indexscan=off; SELECT 1 FROM sites 
WHERE site_office='CRCLMT-CEMS0' AND site_location='';
-[ RECORD 1 ]
?column? | 1

So there's an issue with indices failing to return matching rows (and thereby
allowing inserting duplicate violating rows).

That's the only table/index affected (and the only PG instance thus far
affected).

Note regarding my pg_upgrade: 3 years ago, this was our first and smallest
customer who I upgraded off PG8.4 (to PG9.2 if I recall), and I did it using
pg_dump |pg_restore.  I believe, as a consequence, its postgres database was in
"C" locale and ASCII encoding.  So the last few upgrades (9.3, .4 .5 and .6), I
believe I've manually used initdb --encoding followed by pg_upgrade (else it
fails due to new postgres/template DBs with different locale/encoding from
old).  This upgrade, I finally renamed postgres DB (which has imported CSV logs
and one or two other things) and pg_dump|pg_restore into a new DB with UTF8
encoding, which allowed pg_upgrade to run without special initdb invocation.

I have an LVM snapshot and full CSV logs imported into a table.  I also have a
backup from 22:00 which doesn't have duplicate sites.  Those seem to have been
inserted by our application around 00:30:

These IDs which inserted duplicate rows:
postgres=# SELECT session_id, max(session_line) FROM 
postgres_log_2017_10_24_0000 WHERE message LIKE 'statement: SELECT site_id FROM 
sites WHERE%' GROUP BY 1 ;
  session_id   | max  
---------------+------
 59eedfb1.5cea |  714
 59eedfb5.5cf1 | 1741
(2 rows)

postgres=# SELECT log_time, session_id, session_line, left(message,333) FROM 
postgres_log WHERE (session_id='59eedfb1.5cea' OR session_id='59eedfb5.5cf1') 
AND (session_line<6 OR message LIKE '%INSERT INTO site%') ORDER BY 1,2,3;
-[ RECORD 4 
]+----------------------------------------------------------------------------------------------------------------------------------------------
log_time     | 2017-10-24 00:37:37.888-06
session_id   | 59eedfb1.5cea
session_line | 4
left         | statement: SELECT site_id FROM sites WHERE                       
                                                                            +
             |                 site_office = 'CRCLMT-DOEMS0' AND site_location 
= '1203'
-[ RECORD 5 
]+----------------------------------------------------------------------------------------------------------------------------------------------
log_time     | 2017-10-24 00:37:37.89-06
session_id   | 59eedfb1.5cea
session_line | 5
left         | statement: INSERT INTO sites 
(site_office,site_location,site_alias)                                          
                                +
             |                         VALUES ('CRCLMT-DOEMS0', '1203', (SELECT 
site_id FROM sites                                                          +
             |                         WHERE site_office = 'CRCLMT-CEMS0' AND 
site_location = '1203'))

Note:
I run a script which does various combinations of ANALYZE/VACUUM (FULL/ANALYZE)
following the upgrade, and a script runs nightly with REINDEX and pg_repack
(and a couple of CLUSTER), so you should assume that any combination of those
maintenance commands have been run.

In our reindex/repack log I found the first error due to duplicates:
Tue Oct 24 01:27:53 MDT 2017: sites: sites_idx(repack non-partitioned)...
WARNING: Error creating index "public"."index_61764": ERROR:  could not create 
unique index "index_61764"
DETAIL:  Key (site_office, site_location)=(CRCLMT-DOEMS0, 1120) is duplicated.
WARNING: Skipping index swapping for "sites", since no new indexes built
WARNING: repack failed for "sites_idx"
reindex: warning, dropping invalid/unswapped index: index_61764

postgres=# SELECT * FROM postgres_log WHERE session_id='59eeeb79.4bd5' AND 
error_severity='ERROR' ORDER BY 1 DESC LIMIT 33;
log_time               | 2017-10-24 01:27:53.545-06
user_name              | postgres
database               | ts
pid                    | 19413
connection_from        | [local]
session_id             | 59eeeb79.4bd5
session_line           | 10
command_tag            | CREATE INDEX
session_start_time     | 2017-10-24 01:27:53-06
error_severity         | ERROR
message                | could not create unique index "index_61764"
detail                 | Key (site_office, site_location)=(CRCLMT-DOEMS0, 1120) 
is duplicated.
query                  | CREATE UNIQUE INDEX CONCURRENTLY index_61764 ON sites 
USING btree (site_office, site_location) TABLESPACE pg_default
application_name       | pg_repack

Disclosure: I see a storage error from 2 days ago (pre-upgrade).  This is a
4-drive RAID5 with spare.  I can't see that we're missing a spare, so I don't
know what else to do besides consider this message to be spurious..

Oct 22 21:28:16 midrivers-ubuntu kernel: [1429369.825577] hpsa 0000:05:00.0: 
scsi 4:0:0:0 Aborting command ffff8f84f2ba7c00Tag:0x00000000:000001f0 CDBLen: 6 
CDB: 0x1201... SN: 0x0  BEING SENT
Oct 22 21:28:16 midrivers-ubuntu kernel: [1429369.825583] hpsa 0000:05:00.0: 
scsi 4:0:0:0: Aborting command RAID              HP       P410i            
controller SSDSmartPathCap- En- Exp=1
Oct 22 21:28:16 midrivers-ubuntu kernel: [1429369.898930] hpsa 0000:05:00.0: 
CDB 12018000600000000000000000000000 was aborted with status 0x0
Oct 22 21:28:17 midrivers-ubuntu kernel: [1429371.269963] hpsa 0000:05:00.0: 
invalid command: LUN:0000000000000000 CDB:00000000f00100000000000000000000
Oct 22 21:28:17 midrivers-ubuntu kernel: [1429371.269966] hpsa 0000:05:00.0: 
probably means device no longer present
Oct 22 21:28:17 midrivers-ubuntu kernel: [1429371.269968] hpsa 0000:05:00.0: 
scsi 4:0:0:0 Aborting command ffff8f84f2ba7c00Tag:0x00000000:000001f0 CDBLen: 6 
CDB: 0x1201... SN: 0x0  SENT, FAILED
Oct 22 21:28:17 midrivers-ubuntu kernel: [1429371.269974] hpsa 0000:05:00.0: 
scsi 4:0:0:0: FAILED to abort command RAID              HP       P410i          
  controller SSDSmartPathCap- En- Exp=1

sites is perhaps our most central table and I would expect issues there to be
quickly apparent (however this was at least initially a silent failure).
 public | sites | table | telsasoft | 80 kB | 

Justin


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] unique index violation after pg_upgrade to PG10

Reply via email to