On Mon, Aug 8, 2016 at 12:19 PM, Adrian Klaver <adrian.kla...@aklaver.com> wrote:
> On 08/08/2016 09:11 AM, Ioana Danes wrote: > >> Hi, >> >> I suspect I am having a case of data corruption. Here are the details: >> >> I am running postgres 9.4.8: >> >> postgresql94-9.4.8-1PGDG.rhel7.x86_64 >> postgresql94-contrib-9.4.8-1PGDG.rhel7.x86_64 >> postgresql94-libs-9.4.8-1PGDG.rhel7.x86_64 >> postgresql94-server-9.4.8-1PGDG.rhel7.x86_64 >> >> on CentOS Linux release 7.2.1511 (Core) >> >> This is happening in a production environment but luckily on the >> reporting database. >> I have a cluster of 3 databases, db1 and db2 are masters and replicate >> between each other and also replicate to db3 (db1 <-> db2, db1 -> db3, >> db2 -> db3). >> For replication I am using Bucardo. >> > > I would say this is more a question for the Burcardo list: > > https://mail.endcrypt.com/mailman/listinfo/bucardo-general > > I am just not seeing that replicating two masters on to a single database > is going to end well. > Only one master is active at one time the other one is in stand by that is a topic for another discussion but in our case that works well. That was my first assumption, that it is a kind of a race condition or a bug on replication but I quickly ruled that out because that does not explain why when I filtered the table by transactionid = 75315815 it shows one record with transactionid 75315811... select gameplayid, transactionid, encodedplay from abrazo.matchgameplay where transactionid in (75315815) order by transactionid;; gameplayid | transactionid | encodedplay ------------+---------------+-------------- 160019271 | 75315815 | mix:9,0,9 160019269 | 75315815 | mix:9,8,9 160019267 | 75315815 | mix:9,2,2 160019265 | 75315815 | mix:2,2,8 160019263 | *75315811 *| backup:1,9,1 160019261 | 75315815 | backup:2,0,9 So I don't think it is a replication issue... > > >> The problem I am having is that one record in a table it shows a wrong >> value for one single field: >> >> select gameplayid, transactionid, encodedplay from mytable where >> transactionid in (75315811, 75315815) order by transactionid; >> >> gameplayid | transactionid | encodedplay >> ------------+---------------+-------------- >> 160019239 | 75315811 | mix:5,2,7 >> 160019237 | 75315811 | mix:5,4,8 >> 160019235 | 75315811 | mix:6,2,9 >> 160019233 | 75315811 | mix:1,9,8 >> 160019271 | 75315815 | mix:9,0,9 >> 160019269 | 75315815 | mix:9,8,9 >> 160019267 | 75315815 | mix:9,2,2 >> 160019265 | 75315815 | mix:2,2,8 >> 160019263 | *75315811* | backup:1,9,1 >> 160019261 | 75315815 | backup:2,0,9 >> >> select gameplayid, transactionid, encodedplay from mytable where >> transactionid in (75315815) order by transactionid; >> >> gameplayid | transactionid | encodedplay >> ------------+---------------+-------------- >> 160019271 | 75315815 | mix:9,0,9 >> 160019269 | 75315815 | mix:9,8,9 >> 160019267 | 75315815 | mix:9,2,2 >> 160019265 | 75315815 | mix:2,2,8 >> 160019263 | *75315811* | backup:1,9,1 >> 160019261 | 75315815 | backup:2,0,9 >> >> select gameplayid, transactionid, encodedplay from mytable where >> transactionid in (75315811) order by transactionid; >> >> gameplayid | transactionid | encodedplay >> ------------+---------------+-------------- >> 160019239 | 75315811 | mix:5,2,7 >> 160019237 | 75315811 | mix:5,4,8 >> 160019235 | 75315811 | mix:6,2,9 >> 160019233 | 75315811 | mix:1,9,8 >> >> So the record with gameplayid = 160019263 have a wrong transactionid, >> 75315811 instead of 75315815. >> The correct value is 75315815 and that I know because of the following >> facts: >> - on db1 and db2 transactionid = 75315815 for gameplayid = 160019263, >> - this table gets mostly inserts, very rare updates and only on other 2 >> fields not this one. >> - there is another parent table that shows the number of records in this >> table which is 4 for transactionid =75315811 and 6 for transactionid = >> 7531581. >> >> This table has an index by transactionid and that index seem correct >> because the filtering and the ordering are fine (like the filed has the >> correct value)... >> >> What puzzles me is that the value that shows in this field is a real >> value from another record... >> >> I only caught this issue because I have a script that runs in the night >> that compares the databases ... >> >> By now I updated the field with the correct value and everything seem >> stable. >> >> Postgres logs don't have any information about file corruption or any >> other kind of error. I also checked other logs on the system and I could >> not find any traces of corruption. >> >> select name, setting from pg_settings order by 1; >> name >> | setting >> -------------------------------------+---------------------- >> ------------------------------------------------------------ >> ------------------------------- >> allow_system_table_mods | off >> application_name | psql >> archive_command | test -f >> /cbnDBscripts/tmp/PITR_primarydb_stop_backup || rsync --timeout=60 -atz >> %p stldrdb:/data01/wal_files/%f >> archive_mode | on >> archive_timeout | 60 >> array_nulls | on >> authentication_timeout | 60 >> autovacuum | on >> autovacuum_analyze_scale_factor | 0.1 >> autovacuum_analyze_threshold | 50 >> autovacuum_freeze_max_age | 200000000 >> autovacuum_max_workers | 3 >> autovacuum_multixact_freeze_max_age | 400000000 >> autovacuum_naptime | 60 >> autovacuum_vacuum_cost_delay | 20 >> autovacuum_vacuum_cost_limit | -1 >> autovacuum_vacuum_scale_factor | 0.2 >> autovacuum_vacuum_threshold | 50 >> autovacuum_work_mem | -1 >> backslash_quote | safe_encoding >> bgwriter_delay | 200 >> bgwriter_lru_maxpages | 100 >> bgwriter_lru_multiplier | 2 >> block_size | 8192 >> bonjour | off >> bonjour_name | >> bytea_output | hex >> check_function_bodies | on >> checkpoint_completion_target | 0.5 >> checkpoint_segments | 16 >> checkpoint_timeout | 300 >> checkpoint_warning | 30 >> client_encoding | UTF8 >> client_min_messages | error >> commit_delay | 0 >> commit_siblings | 5 >> config_file | /data01/postgres/postgresql.conf >> constraint_exclusion | partition >> cpu_index_tuple_cost | 0.005 >> cpu_operator_cost | 0.0025 >> cpu_tuple_cost | 0.01 >> cursor_tuple_fraction | 0.1 >> data_checksums | off >> data_directory | /data01/postgres >> DateStyle | ISO, MDY >> db_user_namespace | off >> deadlock_timeout | 1000 >> debug_assertions | off >> debug_pretty_print | on >> debug_print_parse | off >> debug_print_plan | off >> debug_print_rewritten | off >> default_statistics_target | 100 >> default_tablespace | >> default_text_search_config | pg_catalog.english >> default_transaction_deferrable | off >> default_transaction_isolation | read committed >> default_transaction_read_only | off >> default_with_oids | off >> dynamic_library_path | $libdir >> dynamic_shared_memory_type | posix >> effective_cache_size | 1048576 >> effective_io_concurrency | 1 >> enable_bitmapscan | on >> enable_hashagg | on >> enable_hashjoin | on >> enable_indexonlyscan | on >> enable_indexscan | on >> enable_material | on >> enable_mergejoin | on >> enable_nestloop | on >> enable_seqscan | off >> enable_sort | on >> enable_tidscan | on >> escape_string_warning | on >> event_source | PostgreSQL >> exit_on_error | off >> external_pid_file | >> extra_float_digits | 0 >> from_collapse_limit | 8 >> fsync | on >> full_page_writes | on >> geqo | on >> geqo_effort | 5 >> geqo_generations | 0 >> geqo_pool_size | 0 >> geqo_seed | 0 >> geqo_selection_bias | 2 >> geqo_threshold | 12 >> gin_fuzzy_search_limit | 0 >> hba_file | /data01/postgres/pg_hba.conf >> hot_standby | off >> hot_standby_feedback | off >> huge_pages | try >> ident_file | /data01/postgres/pg_ident.conf >> ignore_checksum_failure | off >> ignore_system_indexes | off >> integer_datetimes | on >> IntervalStyle | postgres >> join_collapse_limit | 8 >> krb_caseins_users | off >> krb_server_keyfile | FILE:/etc/sysconfig/pgsql/krb5 >> .keytab >> lc_collate | en_US.UTF-8 >> lc_ctype | en_US.UTF-8 >> lc_messages | en_US.UTF-8 >> lc_monetary | en_US.UTF-8 >> lc_numeric | en_US.UTF-8 >> lc_time | en_US.UTF-8 >> listen_addresses | * >> local_preload_libraries | >> lock_timeout | 0 >> lo_compat_privileges | off >> log_autovacuum_min_duration | -1 >> log_checkpoints | off >> log_connections | on >> log_destination | csvlog >> log_directory | pg_log >> log_disconnections | on >> log_duration | off >> log_error_verbosity | default >> log_executor_stats | off >> log_file_mode | 0600 >> log_filename | postgresql-%d.log >> logging_collector | on >> log_hostname | off >> log_line_prefix | %t %d %u >> log_lock_waits | off >> log_min_duration_statement | 2000 >> log_min_error_statement | error >> log_min_messages | error >> log_parser_stats | off >> log_planner_stats | off >> log_rotation_age | 1440 >> log_rotation_size | 0 >> log_statement | none >> log_statement_stats | off >> log_temp_files | -1 >> log_timezone | America/St_Lucia >> log_truncate_on_rotation | on >> maintenance_work_mem | 131072 >> max_connections | 300 >> max_files_per_process | 1000 >> max_function_args | 100 >> max_identifier_length | 63 >> max_index_keys | 32 >> max_locks_per_transaction | 64 >> max_pred_locks_per_transaction | 64 >> max_prepared_transactions | 0 >> max_replication_slots | 0 >> max_stack_depth | 2048 >> max_standby_archive_delay | 30000 >> max_standby_streaming_delay | 30000 >> max_wal_senders | 0 >> max_worker_processes | 8 >> password_encryption | on >> port | 5432 >> post_auth_delay | 0 >> pre_auth_delay | 0 >> quote_all_identifiers | off >> random_page_cost | 4 >> restart_after_crash | on >> search_path | "$user",public,hstore,dblink,w >> ww_fdw >> segment_size | 131072 >> seq_page_cost | 1 >> server_encoding | UTF8 >> server_version | 9.4.8 >> server_version_num | 90408 >> session_preload_libraries | >> session_replication_role | origin >> shared_buffers | 262144 >> shared_preload_libraries | >> sql_inheritance | on >> ssl | off >> ssl_ca_file | >> ssl_cert_file | server.crt >> ssl_ciphers | HIGH:MEDIUM:+3DES:!aNULL >> ssl_crl_file | >> ssl_ecdh_curve | prime256v1 >> ssl_key_file | server.key >> ssl_prefer_server_ciphers | on >> ssl_renegotiation_limit | 0 >> standard_conforming_strings | on >> statement_timeout | 0 >> stats_temp_directory | pg_stat_tmp >> superuser_reserved_connections | 3 >> synchronize_seqscans | on >> synchronous_commit | on >> synchronous_standby_names | >> syslog_facility | local0 >> syslog_ident | postgres >> tcp_keepalives_count | 0 >> tcp_keepalives_idle | 0 >> tcp_keepalives_interval | 0 >> temp_buffers | 1024 >> temp_file_limit | -1 >> temp_tablespaces | >> TimeZone | America/St_Lucia >> timezone_abbreviations | Default >> trace_notify | off >> trace_recovery_messages | log >> trace_sort | off >> track_activities | on >> track_activity_query_size | 1024 >> track_counts | on >> track_functions | none >> track_io_timing | off >> transaction_deferrable | off >> transaction_isolation | read committed >> transaction_read_only | off >> transform_null_equals | off >> unix_socket_directories | /var/run/postgresql, /tmp >> unix_socket_group | >> unix_socket_permissions | 0777 >> update_process_title | on >> vacuum_cost_delay | 0 >> vacuum_cost_limit | 200 >> vacuum_cost_page_dirty | 20 >> vacuum_cost_page_hit | 1 >> vacuum_cost_page_miss | 10 >> vacuum_defer_cleanup_age | 0 >> vacuum_freeze_min_age | 50000000 >> vacuum_freeze_table_age | 150000000 >> vacuum_multixact_freeze_min_age | 5000000 >> vacuum_multixact_freeze_table_age | 150000000 >> wal_block_size | 8192 >> wal_buffers | 128 >> wal_keep_segments | 0 >> wal_level | hot_standby >> wal_log_hints | off >> wal_receiver_status_interval | 10 >> wal_receiver_timeout | 60000 >> wal_segment_size | 2048 >> wal_sender_timeout | 60000 >> wal_sync_method | fdatasync >> wal_writer_delay | 200 >> work_mem | 2048 >> xmlbinary | base64 >> xmloption | content >> zero_damaged_pages | off >> (239 rows) >> >> >> Any suggestions, thoughts? >> >> >> > > -- > Adrian Klaver > adrian.kla...@aklaver.com >