Re: Postgres 10, slave not catching up with master

Boris Sagadin Tue, 23 Oct 2018 22:40:50 -0700

Yes, times are all identical, set to UTC, ntpd is used.

 log_delay
-----------
 15.788175


This is delay at this moment, but we graph replication delay and it's
fluctuating between 0 and 30s. Before I turned off wal compression, lag was
much bigger (0 to up to 8 minutes). We have lots of tables (40k) and many
upserts.


Boris

On Tue, Oct 23, 2018 at 8:24 PM, Hellmuth Vargas <hiv...@gmail.com> wrote:

> Hi
>
> Both servers are configured with the same date, time and time
> configuration?
>
> El mar., 23 de oct. de 2018 a la(s) 13:16, Hellmuth Vargas (
> hiv...@gmail.com) escribió:
>
>> Hi
>>
>> which result you get from the following query:
>>
>> SELECT CASE WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn()
>> THEN 0
>> ELSE EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp())
>> END AS log_delay;
>>
>> source:
>>
>> https://severalnines.com/blog/postgresql-streaming-replication-deep-dive
>>
>> El mar., 23 de oct. de 2018 a la(s) 11:28, Boris Sagadin (
>> bo...@infosplet.com) escribió:
>>
>>> Nothing special, just:
>>>
>>> standby_mode = 'on'
>>> primary_conninfo = 'host=...  user=repmgr application_name=nodex'
>>> recovery_target_timeline = 'latest'
>>>
>>>
>>> Boris
>>>
>>> On Tue, Oct 23, 2018 at 3:10 PM, Hellmuth Vargas <hiv...@gmail.com>
>>> wrote:
>>>
>>>> Hi
>>>>
>>>> can share recovery.conf file settings??
>>>>
>>>> El mar., 23 de oct. de 2018 a la(s) 00:28, Boris Sagadin (
>>>> bo...@infosplet.com) escribió:
>>>>
>>>>> Yes, turning wal_compression off improves things. Slave that was
>>>>> mentioned unfortunately lagged too much before this setting was applied 
>>>>> and
>>>>> was turned off. However the remaining slave lags less now, although still
>>>>> occasionally up to a few minutes. I think single threadedness of recovery
>>>>> is a big slowdown for write heavy databases. Maybe an option to increase
>>>>> wal_size beyond 16MB in v11 will help.
>>>>>
>>>>> In the meantime we'll solve this by splitting the DB to 2 or 3
>>>>> clusters or maybe trying out some sharding solution like Citus.
>>>>>
>>>>>
>>>>> Boris
>>>>>
>>>>> On Sun, Oct 21, 2018 at 9:06 AM, Boris Sagadin <bo...@infosplet.com>
>>>>> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I have a database running on i3.8xlarge (256GB RAM, 32 CPU cores, 4x
>>>>>> 1.9TB NVMe drive) AWS instance with about 5TB of disk space occupied, 
>>>>>> ext4,
>>>>>> Ubuntu 16.04.
>>>>>>
>>>>>> Multi-tenant DB with about 40000 tables, insert heavy.
>>>>>>
>>>>>> I started a new slave with identical HW specs, SR. DB started syncing
>>>>>> from master, which took about 4 hours, then it started applying the WALs.
>>>>>> However, it seems it can't catch up. Delay is still around 3 hours
>>>>>> (measured with now() - pg_last_xact_replay_timestamp()), even a day
>>>>>> later. It goes a few 100s up and down, but it seems to float around 3h 
>>>>>> mark.
>>>>>>
>>>>>> Disk IO is low at about 10%, measured with iostat, no connected
>>>>>> clients, recovery process is at around 90% CPU single core usage.
>>>>>>
>>>>>> Tried tuning the various parameters, but with no avail. Only thing I
>>>>>> found suspicious is stracing the recovery process constantly produces 
>>>>>> many
>>>>>> errors such as:
>>>>>>
>>>>>> lseek(428, 0, SEEK_END)                 = 780124160
>>>>>> lseek(30, 0, SEEK_END)                  = 212992
>>>>>> read(9, 0x7ffe4001f557, 1)              = -1 EAGAIN (Resource
>>>>>> temporarily unavailable)
>>>>>> lseek(680, 0, SEEK_END)                 = 493117440
>>>>>> read(9, 0x7ffe4001f557, 1)              = -1 EAGAIN (Resource
>>>>>> temporarily unavailable)
>>>>>> lseek(774, 0, SEEK_END)                 = 583368704
>>>>>>
>>>>>> ...[snip]...
>>>>>>
>>>>>> read(9, 0x7ffe4001f557, 1)              = -1 EAGAIN (Resource
>>>>>> temporarily unavailable)
>>>>>> lseek(774, 0, SEEK_END)                 = 583368704
>>>>>> read(9, 0x7ffe4001f557, 1)              = -1 EAGAIN (Resource
>>>>>> temporarily unavailable)
>>>>>> lseek(277, 0, SEEK_END)                 = 502882304
>>>>>> lseek(6, 516096, SEEK_SET)              = 516096
>>>>>> read(6, 
>>>>>> "\227\320\5\0\1\0\0\0\0\340\7\246\26\274\0\0\315\0\0\0\0\0\0\0}\0178\5&/\260\r"...,
>>>>>> 8192) = 8192
>>>>>> read(9, 0x7ffe4001f557, 1)              = -1 EAGAIN (Resource
>>>>>> temporarily unavailable)
>>>>>> lseek(735, 0, SEEK_END)                 = 272809984
>>>>>> read(9, 0x7ffe4001f557, 1)              = -1 EAGAIN (Resource
>>>>>> temporarily unavailable)
>>>>>> lseek(277, 0, SEEK_END)                 = 502882304
>>>>>>
>>>>>> ls -l fd/9
>>>>>> lr-x------ 1 postgres postgres 64 Oct 21 06:21 fd/9 -> pipe:[46358]
>>>>>>
>>>>>>
>>>>>> Perf top on recovery produces:
>>>>>>
>>>>>>  27.76%  postgres            [.] pglz_decompress
>>>>>>    9.90%  [kernel]            [k] entry_SYSCALL_64_after_swapgs
>>>>>>    7.09%  postgres            [.] hash_search_with_hash_value
>>>>>>    4.26%  libpthread-2.23.so  [.] llseek
>>>>>>    3.64%  libpthread-2.23.so  [.] __read_nocancel
>>>>>>    2.80%  [kernel]            [k] __fget_light
>>>>>>    2.67%  postgres            [.] 0x000000000034d3ba
>>>>>>    1.85%  [kernel]            [k] ext4_llseek
>>>>>>    1.84%  postgres            [.] pg_comp_crc32c_sse42
>>>>>>    1.44%  postgres            [.] hash_any
>>>>>>    1.35%  postgres            [.] 0x000000000036afad
>>>>>>    1.29%  postgres            [.] MarkBufferDirty
>>>>>>    1.21%  postgres            [.] XLogReadRecord
>>>>>> [...]
>>>>>>
>>>>>> Tried changing the process limits with prlimit to unlimited, but no
>>>>>> change.
>>>>>>
>>>>>> I can turn off the WAL compression but I doubt this is the main
>>>>>> culprit. Any ideas appreciated.
>>>>>>
>>>>>> Regards,
>>>>>> Boris
>>>>>>
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Cordialmente,
>>>>
>>>> Ing. Hellmuth I. Vargas S.
>>>> Esp. Telemática y Negocios por Internet
>>>> Oracle Database 10g Administrator Certified Associate
>>>> EnterpriseDB Certified PostgreSQL 9.3 Associate
>>>>
>>>>
>>>
>>
>> --
>> Cordialmente,
>>
>> Ing. Hellmuth I. Vargas S.
>>
>>
>>
>
> --
> Cordialmente,
>
> Ing. Hellmuth I. Vargas S.
>
>
>

Re: Postgres 10, slave not catching up with master

Reply via email to