Re: Postgres 10, slave not catching up with master

Boris Sagadin Tue, 23 Oct 2018 09:29:52 -0700

Nothing special, just:

standby_mode = 'on'
primary_conninfo = 'host=...  user=repmgr application_name=nodex'
recovery_target_timeline = 'latest'



Boris

On Tue, Oct 23, 2018 at 3:10 PM, Hellmuth Vargas <[email protected]> wrote:

> Hi
>
> can share recovery.conf file settings??
>
> El mar., 23 de oct. de 2018 a la(s) 00:28, Boris Sagadin (
> [email protected]) escribió:
>
>> Yes, turning wal_compression off improves things. Slave that was
>> mentioned unfortunately lagged too much before this setting was applied and
>> was turned off. However the remaining slave lags less now, although still
>> occasionally up to a few minutes. I think single threadedness of recovery
>> is a big slowdown for write heavy databases. Maybe an option to increase
>> wal_size beyond 16MB in v11 will help.
>>
>> In the meantime we'll solve this by splitting the DB to 2 or 3 clusters
>> or maybe trying out some sharding solution like Citus.
>>
>>
>> Boris
>>
>> On Sun, Oct 21, 2018 at 9:06 AM, Boris Sagadin <[email protected]>
>> wrote:
>>
>>> Hello,
>>>
>>> I have a database running on i3.8xlarge (256GB RAM, 32 CPU cores, 4x
>>> 1.9TB NVMe drive) AWS instance with about 5TB of disk space occupied, ext4,
>>> Ubuntu 16.04.
>>>
>>> Multi-tenant DB with about 40000 tables, insert heavy.
>>>
>>> I started a new slave with identical HW specs, SR. DB started syncing
>>> from master, which took about 4 hours, then it started applying the WALs.
>>> However, it seems it can't catch up. Delay is still around 3 hours
>>> (measured with now() - pg_last_xact_replay_timestamp()), even a day
>>> later. It goes a few 100s up and down, but it seems to float around 3h mark.
>>>
>>> Disk IO is low at about 10%, measured with iostat, no connected clients,
>>> recovery process is at around 90% CPU single core usage.
>>>
>>> Tried tuning the various parameters, but with no avail. Only thing I
>>> found suspicious is stracing the recovery process constantly produces many
>>> errors such as:
>>>
>>> lseek(428, 0, SEEK_END)                 = 780124160
>>> lseek(30, 0, SEEK_END)                  = 212992
>>> read(9, 0x7ffe4001f557, 1)              = -1 EAGAIN (Resource
>>> temporarily unavailable)
>>> lseek(680, 0, SEEK_END)                 = 493117440
>>> read(9, 0x7ffe4001f557, 1)              = -1 EAGAIN (Resource
>>> temporarily unavailable)
>>> lseek(774, 0, SEEK_END)                 = 583368704
>>>
>>> ...[snip]...
>>>
>>> read(9, 0x7ffe4001f557, 1)              = -1 EAGAIN (Resource
>>> temporarily unavailable)
>>> lseek(774, 0, SEEK_END)                 = 583368704
>>> read(9, 0x7ffe4001f557, 1)              = -1 EAGAIN (Resource
>>> temporarily unavailable)
>>> lseek(277, 0, SEEK_END)                 = 502882304
>>> lseek(6, 516096, SEEK_SET)              = 516096
>>> read(6, 
>>> "\227\320\5\0\1\0\0\0\0\340\7\246\26\274\0\0\315\0\0\0\0\0\0\0}\0178\5&/\260\r"...,
>>> 8192) = 8192
>>> read(9, 0x7ffe4001f557, 1)              = -1 EAGAIN (Resource
>>> temporarily unavailable)
>>> lseek(735, 0, SEEK_END)                 = 272809984
>>> read(9, 0x7ffe4001f557, 1)              = -1 EAGAIN (Resource
>>> temporarily unavailable)
>>> lseek(277, 0, SEEK_END)                 = 502882304
>>>
>>> ls -l fd/9
>>> lr-x------ 1 postgres postgres 64 Oct 21 06:21 fd/9 -> pipe:[46358]
>>>
>>>
>>> Perf top on recovery produces:
>>>
>>>  27.76%  postgres            [.] pglz_decompress
>>>    9.90%  [kernel]            [k] entry_SYSCALL_64_after_swapgs
>>>    7.09%  postgres            [.] hash_search_with_hash_value
>>>    4.26%  libpthread-2.23.so  [.] llseek
>>>    3.64%  libpthread-2.23.so  [.] __read_nocancel
>>>    2.80%  [kernel]            [k] __fget_light
>>>    2.67%  postgres            [.] 0x000000000034d3ba
>>>    1.85%  [kernel]            [k] ext4_llseek
>>>    1.84%  postgres            [.] pg_comp_crc32c_sse42
>>>    1.44%  postgres            [.] hash_any
>>>    1.35%  postgres            [.] 0x000000000036afad
>>>    1.29%  postgres            [.] MarkBufferDirty
>>>    1.21%  postgres            [.] XLogReadRecord
>>> [...]
>>>
>>> Tried changing the process limits with prlimit to unlimited, but no
>>> change.
>>>
>>> I can turn off the WAL compression but I doubt this is the main culprit.
>>> Any ideas appreciated.
>>>
>>> Regards,
>>> Boris
>>>
>>>
>>
>
> --
> Cordialmente,
>
> Ing. Hellmuth I. Vargas S.
> Esp. Telemática y Negocios por Internet
> Oracle Database 10g Administrator Certified Associate
> EnterpriseDB Certified PostgreSQL 9.3 Associate
>
>

Re: Postgres 10, slave not catching up with master

Reply via email to