Re: [11.6] WALs recovery ordering with Restore Command - timelines

2019-12-06 Thread Pavel Suderevsky
One more log example:
>
> 2019-12-06 07:11:16 CST  LOG:  database system was shut down in recovery
> at 2019-12-06 07:11:08 CST
> Searching WAL: 0022.history, location: pg_wal/RECOVERYHISTORY
> 2019-12-06 07:11:16 CST  LOG:  restored log file "0022.history" from
> archive
> Searching WAL: 0023.history, location: pg_wal/RECOVERYHISTORY
> 2019-12-06 07:11:16 CST  LOG:  entering standby mode
> Searching WAL: 0022.history, location: pg_wal/RECOVERYHISTORY
> 2019-12-06 07:11:16 CST  LOG:  restored log file "0022.history" from
> archive
> Searching WAL: 002218C6003F, location: pg_wal/RECOVERYXLOG
> Searching WAL: 002118C6003F, location: pg_wal/RECOVERYXLOG
> 2019-12-06 07:11:20 CST  LOG:  restored log file
> "002118C6003F" from archive
> Searching WAL: 0021.history, location: pg_wal/RECOVERYHISTORY
> 2019-12-06 07:11:20 CST  LOG:  restored log file "0021.history" from
> archive
> Searching WAL: 002218BF001B, location: pg_wal/RECOVERYXLOG
> Searching WAL: 002118BF001B, location: pg_wal/RECOVERYXLOG
> 2019-12-06 07:11:27 CST  LOG:  restored log file
> "002118BF001B" from archive
> 2019-12-06 07:11:27 CST  LOG:  redo starts at 18BF/1B311260
> Searching WAL: 002218BF001C, location: pg_wal/RECOVERYXLOG
> Searching WAL: 002118BF001C, location: pg_wal/RECOVERYXLOG
> 2019-12-06 07:11:34 CST  LOG:  restored log file
> "002118BF001C" from archive
> Searching WAL: 002218BF001D, location: pg_wal/RECOVERYXLOG
> Searching WAL: 002118BF001D, location: pg_wal/RECOVERYXLOG
> 2019-12-06 07:11:40 CST  LOG:  restored log file
> "002118BF001D" from archive
> Searching WAL: 002218BF001E, location: pg_wal/RECOVERYXLOG
> Searching WAL: 002118BF001E, location: pg_wal/RECOVERYXLOG
> 2019-12-06 07:11:46 CST  LOG:  restored log file
> "002118BF001E" from archive
> Searching WAL: 002218BF001F, location: pg_wal/RECOVERYXLOG
> Searching WAL: 002118BF001F, location: pg_wal/RECOVERYXLOG
> 2019-12-06 07:11:53 CST  LOG:  restored log file
> "002118BF001F" from archive
>
Command:
*restore_command = 'echo -e "Searching WAL: %f, location: %p";
/usr/bin/pgbackrest --stanza=platform archive-get %f "%p"'*

As you can see Postgres tries restoring* 0022** WALs *before timeline
switch LSN* is reached while restoring *0021**.

Should I resend this issue to a pgsql-bugs
<https://www.postgresql.org/list/pgsql-bugs/>?

пн, 25 нояб. 2019 г. в 23:26, Pavel Suderevsky :

> Hi,
>
> PostgreSQL 11.6
> Centos 7.6
> pgBackrest 2.19
>
>
> Case:
> Master is on timeline 15 while Standby that should be synchronized with
> Master as a hot standby is on timeline 13. WALs to be obtained from archive.
>
> recovery.conf:
> >primary_conninfo = 'user=usename host=10.10.10.3 port=5432 sslmode=prefer
> application_name=pg2'
> >recovery_target_timeline = 'latest'
> >restore_command = '/usr/bin/pgbackrest --stanza=platform archive-get %f
> "%p" --log-level-console info'
> >standby_mode = 'on'
>
> Standby recovery is done like that:
> 1. try restore 000F0A2700E5
> 2. unable to find 000F0A2700E5 in the archive
> 3. try restore 000E0A2700E5
> 4. found 000E0A2700E5 in the archive
> -- trying next WAL for BOTH timelines
> 1. try restore 000F0A2700E6
> 2. unable to find 000F0A2700E6 in the archive
> 3. try restore 000E0A2700E6
> 4. found 000E0A2700E6 in the archive
>
> Why does Postgres restore WALs not in a timelines order? First 13, then
> 14, then 15. Up to timeline switch position. WALs it try to restore for the
> latest timeline are deliberately not yet exist. It leads to terrible
> recovery performance because of long "unable to find" operations.
>


[11.6] WALs recovery ordering with Restore Command - timelines

2019-11-25 Thread Pavel Suderevsky
Hi,

PostgreSQL 11.6
Centos 7.6
pgBackrest 2.19


Case:
Master is on timeline 15 while Standby that should be synchronized with
Master as a hot standby is on timeline 13. WALs to be obtained from archive.

recovery.conf:
>primary_conninfo = 'user=usename host=10.10.10.3 port=5432 sslmode=prefer
application_name=pg2'
>recovery_target_timeline = 'latest'
>restore_command = '/usr/bin/pgbackrest --stanza=platform archive-get %f
"%p" --log-level-console info'
>standby_mode = 'on'

Standby recovery is done like that:
1. try restore 000F0A2700E5
2. unable to find 000F0A2700E5 in the archive
3. try restore 000E0A2700E5
4. found 000E0A2700E5 in the archive
-- trying next WAL for BOTH timelines
1. try restore 000F0A2700E6
2. unable to find 000F0A2700E6 in the archive
3. try restore 000E0A2700E6
4. found 000E0A2700E6 in the archive

Why does Postgres restore WALs not in a timelines order? First 13, then 14,
then 15. Up to timeline switch position. WALs it try to restore for the
latest timeline are deliberately not yet exist. It leads to terrible
recovery performance because of long "unable to find" operations.


Re: 9.6.11- could not truncate directory "pg_serial": apparent wraparound

2019-06-27 Thread Pavel Suderevsky
Hi,

Got this issue again.
Settings on the platform (PG 9.6.11):
max_pred_locks_per_transaction = 3000
max_connections = 800

Despite the fact that documentation says:
> with the exception of fast-path locks, each lock manager will deliver a
consistent set of results
I've noticed the following:
1. pg_locks showed 2 million SIReadLocks for the pid that has been in the
"idle" state for a dozen of seconds already.
2. pg_locks showed count of 5 million SIReadLock granted although I
expected a limit of 2.4 million SIReadLocks with settings provided above.

And still no any signs of serializable transactions that could live for a 1
billion xids.
--
Pavel Suderevsky
E: psuderevsky(at)gmail(dot)com

вт, 9 апр. 2019 г. в 16:00, Pavel Suderevsky :

> On Sun, Apr 7, 2019 at 2:31 AM Pavel Suderevsky 
>> wrote:
>> > Probably if you advise me what could cause "pg_serial": apparent
>> wraparound messages I would have more chances to handle all the performance
>> issues.
>>
>> Did you see that warning at some point before the later error?
>>
> Thomas,
>
> Thank you for your reply!
>
> No, there have never been such warnings.
>
> I wonder if this condition required you to have a serializable
>> transaction running (or prepared) while you consume 2^30 AKA ~1
>> billion xids.  I think it is unreachable in v11+ because commit
>> e5eb4fa8 allowed for more SLRU pages to avoid this artificially early
>> wrap.
>
>
> Do I understand right that this is about Virtual txids? Have no idea how
> even a something close to a billion of transaction ids could be consumed on
> this system.
>
> --
> Pavel Suderevsky
> E: psuderev...@gmail.com
>


Re: 9.6.11- could not truncate directory "pg_serial": apparent wraparound

2019-04-09 Thread Pavel Suderevsky
>
> On Sun, Apr 7, 2019 at 2:31 AM Pavel Suderevsky 
> wrote:
> > Probably if you advise me what could cause "pg_serial": apparent
> wraparound messages I would have more chances to handle all the performance
> issues.
>
> Did you see that warning at some point before the later error?
>
Thomas,

Thank you for your reply!

No, there have never been such warnings.

I wonder if this condition required you to have a serializable
> transaction running (or prepared) while you consume 2^30 AKA ~1
> billion xids.  I think it is unreachable in v11+ because commit
> e5eb4fa8 allowed for more SLRU pages to avoid this artificially early
> wrap.


Do I understand right that this is about Virtual txids? Have no idea how
even a something close to a billion of transaction ids could be consumed on
this system.

--
Pavel Suderevsky
E: psuderev...@gmail.com


Re: 9.6.11- could not truncate directory "pg_serial": apparent wraparound

2019-04-06 Thread Pavel Suderevsky
Guys, still need your help.

Previous night:
*2019-04-05 00:35:04 UTC  LOG:  could not truncate directory "pg_serial":
apparent wraparound*
*2019-04-05 00:40:04 UTC  LOG:  could not truncate directory "pg_serial":
apparent wraparound*
(2 checkpoints)

It turned that I have some problem with performance related to predicate
locking on this platform.
A lot of long prepared statements with the *SerializableXactHashLock* and
*predicate_lock_manager* wait_events followed by high CPU usage happened
during 00:30 and 00:45. During this period there were 55k pred locks
granted at max and 30k in average. Probably because of high CPU usage some
statements were spending a lot of time in bind/parse steps.

Probably if you advise me what could cause *"pg_serial": apparent
wraparound* messages I would have more chances to handle all the
performance issues.

Thank you!
--
Pavel Suderevsky
E: psuderev...@gmail.com

пн, 11 мар. 2019 г. в 19:09, Pavel Suderevsky :

> Hi,
>
> PG: 9.6.11
> OS: CentOS
> Env: AWS EC2
>
> I've faced the following exceptions in PostgreSQL server log:
> > could not truncate directory "pg_serial": apparent wraparound
> Sometimes it repeats every 5 min and the longest period was 40 min.
>
> In fact, I can't find any suspicious events happening that periods.
> pg_wait_sampling didn't catch any events, no long queries (more than 60s),
> Autovacuum workers or transactions in "idle in transaction" state were in
> action at this time.
>
> The only related I could find in archive is:
> https://www.postgresql.org/message-id/CACjxUsON4Vya3a6r%3DubwmN-4qTDDfZjuwSzjnL1QjdUc8_gzLw%40mail.gmail.com
> >You should not see the errors you are reporting nor
> >the warning I mentioned unless a serializable transaction remains
> >active long enough for about 1 billion transaction IDs to be
> >consumed.
>
> Database age now is just 18.5 millions of transactions.
>
> Server has two standbys (sync and async), hot_standby_feedback is off.
>
> Please advice what I can do to find a reason of these exceptions.
>


9.6.11- could not truncate directory "pg_serial": apparent wraparound

2019-03-11 Thread Pavel Suderevsky
Hi,

PG: 9.6.11
OS: CentOS
Env: AWS EC2

I've faced the following exceptions in PostgreSQL server log:
> could not truncate directory "pg_serial": apparent wraparound
Sometimes it repeats every 5 min and the longest period was 40 min.

In fact, I can't find any suspicious events happening that periods.
pg_wait_sampling didn't catch any events, no long queries (more than 60s),
Autovacuum workers or transactions in "idle in transaction" state were in
action at this time.

The only related I could find in archive is:
https://www.postgresql.org/message-id/CACjxUsON4Vya3a6r%3DubwmN-4qTDDfZjuwSzjnL1QjdUc8_gzLw%40mail.gmail.com
>You should not see the errors you are reporting nor
>the warning I mentioned unless a serializable transaction remains
>active long enough for about 1 billion transaction IDs to be
>consumed.

Database age now is just 18.5 millions of transactions.

Server has two standbys (sync and async), hot_standby_feedback is off.

Please advice what I can do to find a reason of these exceptions.


Long statement execution. Data fetch by client

2019-02-20 Thread Pavel Suderevsky
Hi,

OS: 7.5.1804
PG version: 9.6.12
Client: Java application (jdbc 42.0.0)

I have the issue with unexpectedly long statements:

LOG:  duration: 9284.887 ms  parse : SET SESSION CHARACTERISTICS
AS TRANSACTION ISOLATION LEVEL READ COMMITTED
LOG:  duration: 12739.364 ms  parse : SHOW TRANSACTION ISOLATION
LEVEL
LOG:  duration: 10500.315 ms  parse : SET SESSION CHARACTERISTICS
AS TRANSACTION ISOLATION LEVEL READ COMMITTED
LOG:  duration: 10467.578 ms  parse : SHOW TRANSACTION ISOLATION
LEVEL
LOG:  duration: 25901.488 ms  execute S_3: COMMIT

It's not a fsync issue. Another statements executed without issues during
execution of mentioned statements. pg_stat_statements didn't catch any such
long statement. No wait_events catched for these statements (checked by
taking pg_stat_activity snapshots every 100ms). Processes were in R state
(according atop, and related to that moments that were catched).
I have an assumption that issue is happening during data fetch by Java.
Could you please assist with the following questions?
1. Does query have 'active' state while data is being fetched by client?
2. Do I understand right that log_statement includes fetch time when
records duration time in log, while pg_stat_statements includes only
server-side computation time?
3. Is there any way to catch client's long fetch event on PostgreSQL server
side?
4. Is that all about FETCH or what else can cause such long execution of
statements like 'SHOW TRANSACTION ISOLATION LEVEL'?

Thank you very much in advance.
--
Pavel Suderevsky