Hello everyone.
I am working on a database migration from old post PostgreSQL 10 hosted in a
Stolon in Kubernetes (yes I know, it's looking for troubles) to a managed AWS
RDS Postgresql 15 database.
The only potential solution I found to do this migration is to use the logical
replication, as we can't have much downtime and the database is quite big
(around 2TB).
I made many tests on multiple non production environments and had quite strange
results, on our staging platform for example I could not, so far, achieve a
complete migration, I had many issues of EOF during wal streaming, I then
tweaked the wal_sender_timeout and wal_received timeout when I tested the
migration on the production database and it looks like this part of the issue
is solved. (I will have to retest it on staging, I tested on production as
maybe only staging had this issue)
2 others environment were able to do the migration correctly (same schema,
without the WAL timeout tweaks), one being around 800GB, like staging and the
other much smaller, around 80GB).
Apart from the timeout issue that looks solved so far, I have a problem with
our biggest table (which is a link table for a many 2 many relation), it's the
one that were usually failing on staging and now also failing on the production
test migration.
There you have the information about the table in question :
back=> \d+ diagnostics_episodes
Table "public.diagnostics_episodes"
Column | Type | Collation | Nullable | Default | Storage | Stats
target | Description
---------------+---------+-----------+----------+---------+---------+--------------+-------------
diagnostic_id | bigint | | | | plain |
|
episode_id | bigint | | | | plain |
|
is_deleted | boolean | | not null | false | plain |
|
Indexes:
"diagnostics_episodes_unique" UNIQUE, btree (diagnostic_id, episode_id)
"index_diagnostics_episodes_on_episode_id" btree (episode_id)
Check constraints:
"diagnostics_episodes_diagnostic_id_null" CHECK (diagnostic_id IS NOT NULL)
"diagnostics_episodes_episode_id_null" CHECK (episode_id IS NOT NULL)
Foreign-key constraints:
"fk_rails_11c88d28cf" FOREIGN KEY (episode_id) REFERENCES episodes(id) ON
UPDATE CASCADE ON DELETE CASCADE
"fk_rails_5857fe47a7" FOREIGN KEY (diagnostic_id) REFERENCES
diagnostics(id) ON UPDATE CASCADE ON DELETE CASCADE
Replica Identity: FULL
Access method: heap
The table size is around 279 GB (from \d)
We had to use replica identity full on this table as setting the replica
identity to the unique index would require downtime to add the NOT NULL
constraints on the 2 FK columns.
And this is the current state of the replication, all the tables are in R
state, except this single table still stuck in F state, it looks like it is
also blocking the main replication slot of the logical replication ?
(I restarted the postgresql not long ago to check if it would unblock it, no
luck)
SUBSCRIBER:
back=> SELECT
pss.relid, pss.relid::regclass AS obj,
pss.pid, pss.latest_end_time, pg_size_pretty(pspc.bytes_processed) as
data_copied,
pspc.tuples_processed,
NOW() - pss.last_msg_receipt_time as age
FROM
pg_stat_subscription AS pss
LEFT JOIN
pg_stat_progress_copy AS pspc ON pss.pid = pspc.pid;
relid | obj | pid | latest_end_time |
data_copied | tuples_processed | age
-------+----------------------+------+-------------------------------+-------------+------------------+-----------------
| | 6346 | 2023-06-09 08:39:23.523942+00 |
| | 00:09:57.941278
16923 | diagnostics_episodes | 6347 | 2023-06-09 08:39:23.87963+00 |
| | 00:08:31.553164
(2 rows)
PUBLISHER:
production=# SELECT
prs.slot_name, prs.slot_type, prs.database, psa.wait_event,
psr.state, prs.temporary, prs.active, prs.active_pid, prs.restart_lsn,
prs.confirmed_flush_lsn
FROM
pg_replication_slots AS prs
LEFT JOIN
pg_stat_activity AS psa ON prs.active_pid = psa.pid
LEFT JOIN
pg_stat_replication AS psr ON psa.pid = psr.pid
WHERE
prs.slot_type = 'logical';
slot_name | slot_type | database |
wait_event | state | temporary | active | active_pid | restart_lsn |
confirmed_flush_lsn
-----------------------------------------+-----------+------------+--------------------+---------+-----------+--------+------------+--------------+---------------------
azure_to_aws | logical | production |
WalSenderWriteData | catchup | f | t | 21534 | A6F/1EF5C210 |
A6F/1F48F728
pg_17845_sync_16923_7225826535892257639 | logical | production |
WalSenderWriteData | catchup | f | t | 21537 | A6D/22E49068 |
A6D/22EDF188
(2 rows)
(the restart_lsn and confirmed_flush_lsn are not changing over time)
production=# SELECT
prs.slot_name, prs.active, psr.state, prs.active_pid,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(),
prs.confirmed_flush_lsn)) AS diff_size
FROM
pg_replication_slots AS prs
LEFT JOIN pg_stat_replication AS psr ON prs.slot_name = psr.application_name
WHERE
prs.slot_type = 'logical';
slot_name | active | state | active_pid |
diff_size
-----------------------------------------+--------+---------+------------+-----------
azure_to_aws | t | catchup | 21534 | 2855
MB
pg_17845_sync_16923_7225826535892257639 | t | catchup | 21537 | 11 GB
(2 rows)
Something I can see on the monitoring of the subscriber is that it's doing a
lot of read activity (I can't get more info as it's on RDS so no strace,
etc..), I can see it's reading at around 400MBps non stop (and it stops if I
disable the subscription and terminate the process related to the replication),
but I don't see any write going thru.
On either side I don't see any error on the logs, on the subscriber I have the
debug5 level log and I don't see any transaction being applied either by the
replication worker.
On pg_stat_activity I see the logical replication worker of this table always
active, sometime doing IO:DateFileRead, the main replication worker however is
always "idle" waiting for IPC:LogicalFileSyncChange
I am not sure what do to next ? I have been trying to troubleshoot this for a
week but no luck, I also got some help on the PostgreSQL slack server but we
are not 100% sure of what could be the issue either (
https://postgresteam.slack.com/archives/C0FS3UTAP/p1686041198195869 )
On the loadtest environment I first migrated, I saw it had the NOT NULL
constraint set on the columns of the table (but still with the REPLICA IDENTITY
FULL) and it worked, I tried to remove the NOT NULL constraint and do again the
migration and it still worked, so I am not sure it is related to that on the
other environments?
What could be the other causes of this ?
Thank you.