pg_basebackup -F t fails when fsync spends more time than tcp_user_timeout

[email protected] Sun, 01 Sep 2019 21:43:53 -0700

Hi


pg_basebackup -F t fails when fsync spends more time than tcp_user_timeout in 
following environment.

[Environment]
Postgres 13dev (master branch)
Red Hat Enterprise Postgres 7.4

[Error]
$ pg_basebackup -F t --progress --verbose -h <hostname> -D <directory>
pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
pg_basebackup: write-ahead log start point: 0/5A000060 on timeline 1
pg_basebackup: starting background WAL receiver
pg_basebackup: created temporary replication slot "pg_basebackup_15647"
pg_basebackup: error: could not read COPY data: server closed the connection 
unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

[Analysis]
- pg_basebackup -F t creates a tar file and does fsync() for each tablespace.
  (Otherwise, -F p does fsync() only once at the end.)
- While doing fsync() for a tar file for one tablespace, wal sender sends the 
content of the next tablespace.
  When fsync() spends long time, the tcp socket of pg_basebackup returns "zero 
window" packets to wal sender.
  This means the tcp socket buffer of pg_basebackup is exhausted since 
pg_basebackup cannot receive during fsync().
- The socket of wal sender retries to send the packet, but resets connection 
after tcp_user_timeout.
  After wal sender resets connection, pg_basebackup cannot receive data and 
fails with above error.

[Solution]
I think fsync() for each tablespace is not necessary.
Like pg_basebackup -F p, I think fsync() is necessary only once at the end.


Could you give me any comment?


Regards,
Ryohei Takahashi

pg_basebackup -F t fails when fsync spends more time than tcp_user_timeout

Reply via email to