postgres 9.5 DB corruption: invalid byte sequence for encoding "UTF8"

Thomas Tignor Mon, 25 Mar 2019 11:47:33 -0700

Hoping someone may be able to offer some guidance on this recurring problem. I 
am providing this "problem report" to the general list as I understand the bugs 
list requires a set of reproduction steps we do not yet have. Please advise if 
I have the process wrong. I have tried to provide all known relevant info here. 
Thanks in advance for any insights.--> A description of what you are trying to 
achieve and what results you expect.:
We are experiencing intermittent DB corruption in postgres 9.5.14. We are 
trying to identify and eliminate all sources. We are using two independent 
services for data replication, Slony-I v2.2.6 and a custom service developed 
in-house. Both are based on COPY operations. DB corruption is observed when 
COPY operations fail with an error of the form: 'invalid byte sequence for 
encoding "UTF8"'. This occurs with a frequency ranging between a few days and 
several weeks. Each incident is followed by a race to find and repair or remove 
corrupted data, which we are getting good at. With well over a dozen incidents, 
the great majority originally showed corruption in a single VARCHAR(2000) 
column (value) in a single table (alert_attribute). In this time, we read about 
suspected and real problems with TOAST functionality and so made the decision 
to change alert_attribute.value to PLAIN storage. Since that change was made, 
most new incidents show corruption in the alert_attribute.name column instead 
(VARCHAR(200)). Another table (alert_instance) has been impacted as well. See 
below for their schemas.


We have looked high and low through system logs and device reporting utility 
output for any sign of hardware failures. We haven't turned up anything yet. We 
also tried rebuilding an entire DB from scratch. That did not seem to help.

We have not been performing routine reindexing. This is a problem we are 
working to correct. Normally our master DB serves for an 8-12 week period 
without reindexing before we failover to a peer. Before assuming the master 
role, the peer always begins by truncating the alert_instance and 
alert_attribute tables and loading all data from the current master.

Hardware specs are listed below. For storage, we have 8 INTEL SSDSA2BW12 
direct-attached disks. We can provide additional info as needed.

ams=# \d ams.alert_attribute
               Table "ams.alert_attribute"
      Column       |          Type           | Modifiers 
-------------------+-------------------------+-----------
 alert_instance_id | integer                 | not null
 name              | character varying(200)  | not null
 data_type         | smallint                | not null
 value             | character varying(2000) | 
Indexes:
    "pk_alert_attributes" PRIMARY KEY, btree (alert_instance_id, name), 
tablespace "tbls5"
    "idx_aa_aval" btree (name, value)
Foreign-key constraints:
    "fk_alert_attr_instance_id" FOREIGN KEY (alert_instance_id) REFERENCES 
ams.alert_instance(alert_instance_id) ON DELETE CASCADE
Triggers:
    _ams_cluster_logtrigger AFTER INSERT OR DELETE OR UPDATE ON 
ams.alert_attribute FOR EACH ROW EXECUTE PROCEDURE 
_ams_cluster.logtrigger('_ams_cluster', '2', 'kk')
    _ams_cluster_truncatetrigger BEFORE TRUNCATE ON ams.alert_attribute FOR 
EACH STATEMENT EXECUTE PROCEDURE _ams_cluster.log_truncate('2')
Disabled user triggers:
    _ams_cluster_denyaccess BEFORE INSERT OR DELETE OR UPDATE ON 
ams.alert_attribute FOR EACH ROW EXECUTE PROCEDURE 
_ams_cluster.denyaccess('_ams_cluster')
    _ams_cluster_truncatedeny BEFORE TRUNCATE ON ams.alert_attribute FOR EACH 
STATEMENT EXECUTE PROCEDURE _ams_cluster.deny_truncate()

ams=# 
ams=# \d ams.alert_instance 
                    Table "ams.alert_instance"
       Column        |              Type              | Modifiers 
---------------------+--------------------------------+-----------
 alert_instance_id   | integer                        | not null
 alert_definition_id | integer                        | not null
 alert_instance_key  | character varying(500)         | not null
 start_active_date   | timestamp(0) without time zone | not null
 stop_active_date    | timestamp(0) without time zone | 
 active              | smallint                       | not null
 acknowledged        | smallint                       | not null
 ack_clear_time      | timestamp(0) without time zone | 
 user_set_clear_time | smallint                       | 
 category_id         | integer                        | not null
 condition_start     | timestamp(0) without time zone | not null
 unack_reason        | character varying(1)           | 
 viewer_visible      | smallint                       | not null
Indexes:
    "pk_alert_instance" PRIMARY KEY, btree (alert_instance_id), tablespace 
"tbls5"
    "idx_alert_inst_1" btree (alert_instance_key, alert_definition_id, 
alert_instance_id, active, acknowledged, ack_clear_time), tablespace "tbls5"
    "idx_alert_inst_cat_id" btree (category_id), tablespace "tbls5"
    "idx_alert_inst_def_id" btree (alert_definition_id), tablespace "tbls5"
Check constraints:
    "ck_alert_inst_acked" CHECK (acknowledged = 0 OR acknowledged = 1)
    "ck_alert_inst_active" CHECK (active = 0 OR active = 1)
    "ck_alert_inst_set_cl_tm" CHECK (user_set_clear_time = 0 OR 
user_set_clear_time = 1)
    "ck_alert_inst_viewer_vis" CHECK (viewer_visible = 0 OR viewer_visible = 1)
Foreign-key constraints:
    "fk_alert_inst_cat_id" FOREIGN KEY (category_id) REFERENCES 
ams.category(category_id)
    "fk_alert_inst_def_id" FOREIGN KEY (alert_definition_id) REFERENCES 
ams.alert_definition(alert_definition_id)
    "fk_alert_inst_unack_reason" FOREIGN KEY (unack_reason) REFERENCES 
ams.unack_reason(unack_reason)
Referenced by:
    TABLE "ams.alert_attribute" CONSTRAINT "fk_alert_attr_instance_id" FOREIGN 
KEY (alert_instance_id) REFERENCES ams.alert_instance(alert_instance_id) ON 
DELETE CASCADE
Triggers:
    _ams_cluster_logtrigger AFTER INSERT OR DELETE OR UPDATE ON 
ams.alert_instance FOR EACH ROW EXECUTE PROCEDURE 
_ams_cluster.logtrigger('_ams_cluster', '1', 'k')
    _ams_cluster_truncatetrigger BEFORE TRUNCATE ON ams.alert_instance FOR EACH 
STATEMENT EXECUTE PROCEDURE _ams_cluster.log_truncate('1')
Disabled user triggers:
    _ams_cluster_denyaccess BEFORE INSERT OR DELETE OR UPDATE ON 
ams.alert_instance FOR EACH ROW EXECUTE PROCEDURE 
_ams_cluster.denyaccess('_ams_cluster')
    _ams_cluster_truncatedeny BEFORE TRUNCATE ON ams.alert_instance FOR EACH 
STATEMENT EXECUTE PROCEDURE _ams_cluster.deny_truncate()

ams=# 

--> PostgreSQL version number you are running:
9.5.14

--> How you installed PostgreSQL:
We use a customized installer. It copies a “make install”-based distribution 
into place on a target install host.
Our compiler is gcc 5.4.0.

CFLAGS: -m32 -march=opteron -mno-3dnow -ggdb -O2 -Wall

$P/CONFIGURE_OPTIONS := \
        --with-perl \
        --with-openssl \
        --with-libxml \
        --with-libxslt \
        --with-includes="$(COMMON)/include/perl $(COMMON)/include" \
        --with-libs="$(COMMON)/lib" --prefix=$($P/PREFIX) \
        --enable-debug \
        --disable-nls

--> Changes made to the settings in the postgresql.conf file:  see Server 
Configuration for a quick way to list them all.
ams=# SELECT version();
                                                        version                 
                                        
------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 9.5.14 on i686-pc-linux-gnu, compiled by gcc (Ubuntu 
5.4.0-6ubuntu1~16.04.9aka9.0.3) 5.4.0 20160609, 32-bit
(1 row)

ams=# SELECT name, current_setting(name), SOURCE
ams-#   FROM pg_settings
ams-#   WHERE SOURCE NOT IN ('default', 'override');
              name              |  current_setting   |        source        
--------------------------------+--------------------+----------------------
 application_name               | psql               | client
 autovacuum_vacuum_cost_delay   | 100ms              | configuration file
 checkpoint_completion_target   | 0.9                | configuration file
 client_encoding                | SQL_ASCII          | client
 cpu_index_tuple_cost           | 0.001              | configuration file
 cpu_operator_cost              | 0.0005             | configuration file
 cpu_tuple_cost                 | 0.002              | configuration file
 DateStyle                      | ISO, MDY           | configuration file
 default_statistics_target      | 150                | configuration file
 default_text_search_config     | pg_catalog.english | configuration file
 effective_cache_size           | 8007MB             | configuration file
 lc_messages                    | C                  | configuration file
 lc_monetary                    | C                  | configuration file
 lc_numeric                     | C                  | configuration file
 lc_time                        | C                  | configuration file
 listen_addresses               | 0.0.0.0            | configuration file
 log_autovacuum_min_duration    | 0                  | configuration file
 log_checkpoints                | on                 | configuration file
 log_connections                | on                 | configuration file
 log_disconnections             | on                 | configuration file
 log_line_prefix                | %m %p %u %h %d     | configuration file
 log_min_duration_statement     | 1min               | configuration file
 log_min_messages               | notice             | configuration file
 log_temp_files                 | 0                  | configuration file
 maintenance_work_mem           | 512MB              | configuration file
 max_connections                | 600                | configuration file
 max_stack_depth                | 2MB                | environment variable
 max_wal_size                   | 15360TB            | configuration file
 random_page_cost               | 2.5                | configuration file
 restart_after_crash            | off                | configuration file
 shared_buffers                 | 2GB                | configuration file
 ssl                            | on                 | configuration file
 ssl_ca_file                    | root.crt           | configuration file
 standard_conforming_strings    | on                 | configuration file
 statement_timeout              | 1min               | configuration file
 superuser_reserved_connections | 10                 | configuration file
 syslog_ident                   | postgresql9.5-1.12 | configuration file
 tcp_keepalives_idle            | 0                  | configuration file
 unix_socket_permissions        | 0770               | configuration file
 work_mem                       | 30MB               | configuration file
(40 rows)

ams=# 

--> Operating system and version:
Ubuntu 16.04.5 LTS

--> What program you're using to connect to PostgreSQL:
Most DML is performed by a standalone JVM using the PostgreSQL JDBC Driver 
v42.0.0.
 
--> Is there anything relevant or unusual in the PostgreSQL server logs?:
We have sometimes but not always seen "invalid memory alloc request size" 
errors around the time of these events. Our experience suggests that "invalid 
memory alloc request size" errors are a symptom specifically of index 
corruption and are remediated by reindexing.
 
--> For questions about any kind of error:
--> What you were doing when the error happened / how to cause the error:
Periodic DML (inserts and updates) focused primarily on two DB tables 
(ams.alert_instance and ams.alert_attribute) with hourly batched transfers of 
inactive alerts to history tables (deletes from ams.alert_instance, 
ams.alert_attribute and corresponding inserts to ams.alert_instance_hist, 
ams.alert_attribute_hist within the same transaction).

--> The EXACT TEXT of the error message you're getting, if there is one: (Copy 
and paste the message to the email, do not send a screenshot)
2018-12-27 07:25:24.527 GMT 30462 appuser 127.0.0.1 ams ERROR:  invalid byte 
sequence for encoding "UTF8": 0x86
The specific identified byte varies with each incident.

--> CPU Info
2*4-core Intel(R) Xeon(R) CPU E31270 @ 3.40GHz

--> Mem Info
16GB RAM

--> Storage Info
r...@ams4.dfw.netmgmt:~# lsscsi -l

[0:0:0:0]    disk    ATA      INTEL SSDSA2BW12 0362  /dev/sda 
  state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30
[0:0:1:0]    disk    ATA      INTEL SSDSA2BW12 0362  /dev/sdb 
  state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30
[0:0:2:0]    disk    ATA      INTEL SSDSA2BW12 0362  /dev/sdc 
  state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30
[0:0:3:0]    disk    ATA      INTEL SSDSA2BW12 0362  /dev/sdd 
  state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30
[0:0:4:0]    disk    ATA      INTEL SSDSA2BW12 0362  /dev/sde 
  state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30
[0:0:5:0]    disk    ATA      INTEL SSDSA2BW12 0362  /dev/sdf 
  state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30
[0:0:6:0]    disk    ATA      INTEL SSDSA2BW12 0362  /dev/sdg 
  state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30
[0:0:7:0]    disk    ATA      INTEL SSDSA2BW12 0362  /dev/sdh 
  state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30
r...@ams4.dfw.netmgmt:~# 

r...@ams4.dfw.netmgmt:~# modinfo md_mod
filename:       
/lib/modules/4.14.68-4.14.3-amd64-3adf3675665129fa/kernel/drivers/md/md-mod.ko
alias:          block-major-9-*
alias:          md
description:    MD RAID framework
license:        GPL
depends:        
retpoline:      Y
intree:         Y
name:           md_mod
vermagic:       4.14.68-4.14.3-amd64-3adf3675665129fa SMP mod_unload 
modversions 
parm:           start_dirty_degraded:int
parm:           create_on_open:bool
r...@ams4.dfw.netmgmt:~# 
Tom    :-)

postgres 9.5 DB corruption: invalid byte sequence for encoding "UTF8"

Reply via email to