Logical replication slots stuck in catchup on a very large table

Mathieu Poussin Fri, 09 Jun 2023 02:05:49 -0700

Hello everyone.

I am working on a database migration from old post PostgreSQL 10 hosted in a 
Stolon in Kubernetes (yes I know, it's looking for troubles) to a managed AWS 
RDS Postgresql 15 database.


The only potential solution I found to do this migration is to use the logical 
replication, as we can't have much downtime and the database is quite big 
(around 2TB).

I made many tests on multiple non production environments and had quite strange 
results, on our staging platform for example I could not, so far, achieve a 
complete migration, I had many issues of EOF during wal streaming, I then 
tweaked the wal_sender_timeout and wal_received timeout when I tested the 
migration on the production database and it looks like this part of the issue 
is solved. (I will have to retest it on staging, I tested on production as 
maybe only staging had this issue)
2 others environment were able to do the migration correctly (same schema, 
without the WAL timeout tweaks), one being around 800GB, like staging and the 
other much smaller, around 80GB).

Apart from the timeout issue that looks solved so far, I have a problem with 
our biggest table (which is a link table for a many 2 many relation), it's the 
one that were usually failing on staging and now also failing on the production 
test migration.

There you have the information about the table in question :

back=> \d+ diagnostics_episodes
                               Table "public.diagnostics_episodes"
    Column     |  Type   | Collation | Nullable | Default | Storage | Stats 
target | Description
---------------+---------+-----------+----------+---------+---------+--------------+-------------
 diagnostic_id | bigint  |           |          |         | plain   |           
   |
 episode_id    | bigint  |           |          |         | plain   |           
   |
 is_deleted    | boolean |           | not null | false   | plain   |           
   |
Indexes:
    "diagnostics_episodes_unique" UNIQUE, btree (diagnostic_id, episode_id)
    "index_diagnostics_episodes_on_episode_id" btree (episode_id)
Check constraints:
    "diagnostics_episodes_diagnostic_id_null" CHECK (diagnostic_id IS NOT NULL)
    "diagnostics_episodes_episode_id_null" CHECK (episode_id IS NOT NULL)
Foreign-key constraints:
    "fk_rails_11c88d28cf" FOREIGN KEY (episode_id) REFERENCES episodes(id) ON 
UPDATE CASCADE ON DELETE CASCADE
    "fk_rails_5857fe47a7" FOREIGN KEY (diagnostic_id) REFERENCES 
diagnostics(id) ON UPDATE CASCADE ON DELETE CASCADE
Replica Identity: FULL
Access method: heap

The table size is around 279 GB (from \d)

We had to use replica identity full on this table as setting the replica 
identity to the unique index would require downtime to add the NOT NULL 
constraints on the 2 FK columns.

And this is the current state of the replication, all the tables are in R 
state, except this single table still stuck in F state, it looks like it is 
also blocking the main replication slot of the logical replication ?
(I restarted the postgresql not long ago to check if it would unblock it, no 
luck)

SUBSCRIBER: 

back=> SELECT
  pss.relid, pss.relid::regclass AS obj,
  pss.pid, pss.latest_end_time, pg_size_pretty(pspc.bytes_processed) as 
data_copied,
  pspc.tuples_processed,
  NOW() - pss.last_msg_receipt_time as age
FROM
  pg_stat_subscription AS pss
LEFT JOIN
  pg_stat_progress_copy AS pspc ON pss.pid = pspc.pid;
 relid |         obj          | pid  |        latest_end_time        | 
data_copied | tuples_processed |       age
-------+----------------------+------+-------------------------------+-------------+------------------+-----------------
       |                      | 6346 | 2023-06-09 08:39:23.523942+00 |          
   |                  | 00:09:57.941278
 16923 | diagnostics_episodes | 6347 | 2023-06-09 08:39:23.87963+00  |          
   |                  | 00:08:31.553164
(2 rows)


PUBLISHER:

production=# SELECT
  prs.slot_name, prs.slot_type, prs.database, psa.wait_event,
  psr.state, prs.temporary, prs.active, prs.active_pid, prs.restart_lsn,
  prs.confirmed_flush_lsn
FROM
  pg_replication_slots AS prs
LEFT JOIN
  pg_stat_activity AS psa ON prs.active_pid = psa.pid
LEFT JOIN
  pg_stat_replication AS psr ON psa.pid = psr.pid
WHERE
  prs.slot_type = 'logical';
                slot_name                | slot_type |  database  |     
wait_event     |  state  | temporary | active | active_pid | restart_lsn  | 
confirmed_flush_lsn
-----------------------------------------+-----------+------------+--------------------+---------+-----------+--------+------------+--------------+---------------------
 azure_to_aws                            | logical   | production | 
WalSenderWriteData | catchup | f         | t      |      21534 | A6F/1EF5C210 | 
A6F/1F48F728
 pg_17845_sync_16923_7225826535892257639 | logical   | production | 
WalSenderWriteData | catchup | f         | t      |      21537 | A6D/22E49068 | 
A6D/22EDF188
(2 rows)

(the restart_lsn and confirmed_flush_lsn are not changing over time)

production=# SELECT
  prs.slot_name, prs.active, psr.state, prs.active_pid,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(),
  prs.confirmed_flush_lsn)) AS diff_size
FROM
  pg_replication_slots AS prs
LEFT JOIN pg_stat_replication AS psr ON prs.slot_name = psr.application_name
WHERE
  prs.slot_type = 'logical';
                slot_name                | active |  state  | active_pid | 
diff_size
-----------------------------------------+--------+---------+------------+-----------
 azure_to_aws                            | t      | catchup |      21534 | 2855 
MB
 pg_17845_sync_16923_7225826535892257639 | t      | catchup |      21537 | 11 GB
(2 rows)


Something I can see on the monitoring of the subscriber is that it's doing a 
lot of read activity (I can't get more info as it's on RDS so no strace, 
etc..), I can see it's reading at around 400MBps non stop (and it stops if I 
disable the subscription and terminate the process related to the replication), 
but I don't see any write going thru.
On either side I don't see any error on the logs, on the subscriber I have the 
debug5 level log and I don't see any transaction being applied either by the 
replication worker.
On pg_stat_activity I see the logical replication worker of this table always 
active, sometime doing IO:DateFileRead, the main replication worker however is 
always "idle" waiting for IPC:LogicalFileSyncChange

I am not sure what do to next ? I have been trying to troubleshoot this for a 
week but no luck, I also got some help on the PostgreSQL slack server but we 
are not 100% sure of what could be the issue either ( 
https://postgresteam.slack.com/archives/C0FS3UTAP/p1686041198195869 )
On the loadtest environment I first migrated, I saw it had the NOT NULL 
constraint set on the columns of the table (but still with the REPLICA IDENTITY 
FULL) and it worked, I tried to remove the NOT NULL constraint and do again the 
migration and it still worked, so I am not sure it is related to that on the 
other environments?
What could be the other causes of this ?

Thank you.

Logical replication slots stuck in catchup on a very large table

Reply via email to