my-ship-it opened a new issue, #1654:
URL: https://github.com/apache/cloudberry/issues/1654
## Summary
`gprecoverseg -F` (and `gpaddmirrors`) runs `pg_basebackup` twice per
segment when `internal_wal_replication_slot` does not exist on the primary. The
first attempt completes the full data copy but fails at the WAL streaming
phase, causing `pg_basebackup` to remove the entire data directory. A second
attempt with `--create-slot` then starts the full copy from scratch.
For large segments (e.g. ~1TB), this effectively doubles the recovery time
and I/O.
Reported in #1648.
## Root Cause
In `gpMgmt/sbin/gpsegrecovery.py`, `FullRecovery.run()` uses a two-attempt
strategy:
1. First attempt: `pg_basebackup --slot internal_wal_replication_slot`
(without `--create-slot`), assuming the slot exists.
2. If it fails, second attempt: `pg_basebackup --create-slot --slot
internal_wal_replication_slot`.
The assumption was that the first attempt would "fail quickly" if the slot
doesn't exist. However, the slot check only happens during `START_REPLICATION`
(WAL streaming phase) — **after** the full data copy is already complete. The
same issue exists in `gpMgmt/bin/lib/gpconfigurenewsegment`.
There is an existing `GPDB_12_MERGE_FIXME` comment in the code acknowledging
this:
```python
# GPDB_12_MERGE_FIXME could we check it before? or let
# pg_basebackup create slot if not exists.
```
## Proposed Fix
Before running `pg_basebackup`, check whether
`internal_wal_replication_slot` exists on the primary (via a replication or
utility-mode connection to `pg_replication_slots`), and create it if needed.
This ensures the first `pg_basebackup` attempt always succeeds, avoiding the
costly retry.
## Affected Files
- `gpMgmt/sbin/gpsegrecovery.py` — `FullRecovery.run()`
- `gpMgmt/bin/lib/gpconfigurenewsegment` — `ConfExpSegCmd.run()`
## Workaround
Manually create the replication slot on each primary before running
`gprecoverseg`:
```bash
PGOPTIONS='-c gp_role=utility' psql -h <primary_host> -p <primary_port> -d
postgres -c \
"SELECT
pg_create_physical_replication_slot('internal_wal_replication_slot');"
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]