Hi Rahila, On Thu, Sep 25, 2025 at 12:02 PM Rahila Syed <[email protected]> wrote:
> Hi, > > Please find attached a POC patch that introduces changes to the WAL sender > and > receiver, allowing WAL records to be sent to standbys before they are > flushed > to disk on the primary during physical replication. This is intended to > improve > replication latency by reducing the amount of WAL read from disk. > For large transactions, this approach ensures that the bulk of the > transaction’s > WAL records are already sent to the standby before the flush occurs on the > primary. > As a result, the flush on the primary and standby happen closer together, > reducing replication lag. > At the high level idea LGTM. > > Observations from the benchmark: > 1. The patch improves TPS by ~13% in the sync replication setup. In > repeated runs, > I see that the TPS increase is anywhere between 5% to 13% . > 2. WAL sender reads significantly less WAL from disk, indicating more > efficient use > of WAL buffers and reduced disk I/O > Can you please measure the transaction commit latency improvement as well. Commit latency = Primary_Disk_Flush_time + Standby_disk_fluish_time + network_roundtrip_time > > Following are some of the details of the implementation: > > 1. Primary does not wait for flush before starting to send data, so it is > likely to > send smaller chunks of data. To prevent network overload, changes are made > to > avoid sending excessively small packets. > 2. The sender includes the current flush pointer in the replication > protocol > messages, so the standby knows up to which point WAL has been safely > flushed > on the primary. > 3. The logic ensures that standbys do not apply transactions that have not > been flushed on the primary, by updating the flushedUpto position on the > standby > only up to the flushPtr received from the primary. > 4. WAL records received from the primary are written and can be flushed to > disk on the > standby, but are only marked as flushed up to the flushPtr reported by the > primary. > What happens in crash recovery scenarios? For example, when a standby crash restart, it replays until the end of WAL. In this case, it may end up replaying WAL that was never flushed on the primary (if primary does a crash recovery). Shouldn't archive on standby not upload WAL before WAL gets flushed on the primary? Same applicable for pg_receivewal. Thanks, Satya >
