my-ship-it opened a new issue, #1654:
URL: https://github.com/apache/cloudberry/issues/1654

   ## Summary
   
   `gprecoverseg -F` (and `gpaddmirrors`) runs `pg_basebackup` twice per 
segment when `internal_wal_replication_slot` does not exist on the primary. The 
first attempt completes the full data copy but fails at the WAL streaming 
phase, causing `pg_basebackup` to remove the entire data directory. A second 
attempt with `--create-slot` then starts the full copy from scratch.
   
   For large segments (e.g. ~1TB), this effectively doubles the recovery time 
and I/O.
   
   Reported in #1648.
   
   ## Root Cause
   
   In `gpMgmt/sbin/gpsegrecovery.py`, `FullRecovery.run()` uses a two-attempt 
strategy:
   
   1. First attempt: `pg_basebackup --slot internal_wal_replication_slot` 
(without `--create-slot`), assuming the slot exists.
   2. If it fails, second attempt: `pg_basebackup --create-slot --slot 
internal_wal_replication_slot`.
   
   The assumption was that the first attempt would "fail quickly" if the slot 
doesn't exist. However, the slot check only happens during `START_REPLICATION` 
(WAL streaming phase) — **after** the full data copy is already complete. The 
same issue exists in `gpMgmt/bin/lib/gpconfigurenewsegment`.
   
   There is an existing `GPDB_12_MERGE_FIXME` comment in the code acknowledging 
this:
   
   ```python
   #  GPDB_12_MERGE_FIXME could we check it before? or let
   #  pg_basebackup create slot if not exists.
   ```
   
   ## Proposed Fix
   
   Before running `pg_basebackup`, check whether 
`internal_wal_replication_slot` exists on the primary (via a replication or 
utility-mode connection to `pg_replication_slots`), and create it if needed. 
This ensures the first `pg_basebackup` attempt always succeeds, avoiding the 
costly retry.
   
   ## Affected Files
   
   - `gpMgmt/sbin/gpsegrecovery.py` — `FullRecovery.run()`
   - `gpMgmt/bin/lib/gpconfigurenewsegment` — `ConfExpSegCmd.run()`
   
   ## Workaround
   
   Manually create the replication slot on each primary before running 
`gprecoverseg`:
   
   ```bash
   PGOPTIONS='-c gp_role=utility' psql -h <primary_host> -p <primary_port> -d 
postgres -c \
     "SELECT 
pg_create_physical_replication_slot('internal_wal_replication_slot');"
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to