Re: LOG: invalid record length at : wanted 24, got 0

2023-03-01 Thread Harinath Kanchu
Thanks Bharath for your response,

> You mentioned that
> the connection to primary was lost, so you need to dive deep as to why
> it got lost. If the connection was lost half-way through fetching the
> WAL record, the standby may emit such a LOG message.

The connection was lost due to bad network. Currently we are okay with bad 
network, because applications in general has to always expect bad network in 
their design.

To simulate the bad network, we manually killed the wal-sender process in 
primary multiple times, this should be same as primary unable to send messages 
to standby due to bad network. And in these experiments, the standby is able to 
join the primary after checking WAL files in archive, within few seconds.

But, when the standby gets disconnected due to bad network, the standby is 
unable to join back, so we wanted to understand why this is happening.


> I understand this problem and there's a proposed patch to help with
> this - 
> https://www.postgresql.org/message-id/CALj2ACVryN_PdFmQkbhga1VeW10VgQ4Lv9JXO=3njkvzt8q...@mail.gmail.com.
> 
> It basically allows one to set a timeout as to how much duration the
> standby can restore from archive before switching to stream.
> Therefore, in your case, the standby doesn't have to wait for 1hr to
> connect to primary, but it can connect before that.

Good to know about this patch, because we tried to do something similar to this.

But this patch will not solve our problem, because here the hot-standby is NOT 
downloading WAL files from the archive for very long time and wasting time, 
but,  the standby simply switching the sources from STREAM to ARCHIVE in a 
tight loop doing nothing.

In fact, the standby is trying to connect to the primary for streaming back 
after it fails to pull the latest WAL file from archive. But trying to stream 
from the primary fails again immediately.


To explain this situation better, I am adding more logs.

Note: the logs mentioned here includes the custom logging we enable for 
debugging purposes.



2023-02-09 19:52:30.909 GMT [  ] () [0 | 0 | 63e4244f.54]: LOG:  Starting 
the state machine again from starting, changing current source to 
XLOG_FROM_ARCHIVE
2023-02-09 19:52:30.909 GMT [  ] () [0 | 0 | 63e4244f.54]: LOG:  switched 
WAL source from stream to archive after failure
ERROR: 2023/02/09 19:52:31.616383 Archive '00060002001C' does not 
exist.
2023-02-09 19:52:31.618 GMT [  ] () [0 | 0 | 63e4244f.54]: LOG:  
Successfully read the WAL file using XLogFileReadAnyTLI from archive or 
existing pg_wal, returning true and exit.
2023-02-09 19:52:31.619 GMT [  ] () [0 | 0 | 63e4244f.54]: LOG:  last 
source archive failed, and now switching to new source.
2023-02-09 19:52:31.619 GMT [  ] () [0 | 0 | 63e4244f.54]: LOG:  Moving to 
XLOG_FROM_STREAM state and start walreceiver if necessary
2023-02-09 19:52:31.619 GMT [  ] () [0 | 0 | 63e4244f.54]: LOG:  switched 
WAL source from archive to stream after failure with WalStreamingPreferred false
2023-02-09 19:52:31.619 GMT [  ] () [0 | 0 | 63e4244f.54]: LOG:  changing 
the current file timeline to new tli.
2023-02-09 19:52:31.619 GMT [  ] () [0 | 0 | 63e4244f.54]: LOG:  we have 
data from wal receiver.
2023-02-09 19:52:31.619 GMT [  ] () [0 | 0 | 63e4244f.54]: LOG:  after 
streaming file, we dont have WAL file opened, so read now.
2023-02-09 19:52:31.619 GMT [  ] () [0 | 0 | 63e4244f.54]: LOG:  Due to WAL 
receiver, we have WAL file present and opened and hence returning true
2023-02-09 19:52:31.619 GMT [  ] () [0 | 0 | 63e4244f.54]: LOG:  invalid 
record length at 2/1C0324C8: wanted 24, got 0
2023-02-09 19:52:31.619 GMT [  ] () [0 | 0 | 63e4244f.54]: LOG:  last 
source stream failed, and now switching to new source.
2023-02-09 19:52:31.619 GMT [  ] () [0 | 0 | 63e4244f.54]: LOG:  Failure 
while streaming, we might have found invalid record in WAL streamed from master
2023-02-09 19:52:31.619 GMT [  ] () [0 | 0 | 63e4244f.54]: LOG:  Stopping 
WAL receiver as we saw serious failure while streaming
INFO: 2023/02/09 19:52:31.796489 WAL: 0007.history will be downloaded from 
archive
ERROR: 2023/02/09 19:52:32.330491 Archive '0007.history' does not exist.
2023-02-09 19:52:32.333 GMT [  ] () [0 | 0 | 63e4244f.54]: LOG:  We were 
requested to recover to latest timeline, but rescan is NOT needed.
2023-02-09 19:52:32.333 GMT [  ] () [0 | 0 | 63e4244f.54]: LOG:  We decided 
to sleep before retry the state-machine
2023-02-09 19:52:35.910 GMT [  ] () [0 | 0 | 63e4244f.54]: LOG:  Starting 
the state machine again from starting, changing current source to 
XLOG_FROM_ARCHIVE
2023-02-09 19:52:35.910 GMT [  ] () [0 | 0 | 63e4244f.54]: LOG:  switched 
WAL source from stream to archive after failure
ERROR: 2023/02/09 19:52:36.607613 Archive '00060002001C' does not 
exist.
2023-02-09 19:52:36.610 GMT [  ] () [0 | 0 | 63e4244f.54]: LOG:  
Successfully read the WAL file using XLogFile

LOG: invalid record length at : wanted 24, got 0

2023-02-28 Thread Harinath Kanchu
Hello,

We are seeing an interesting STANDBY behavior, that’s happening once in 3-4 
days.

The standby suddenly disconnects from the primary, and it throws the error 
“LOG: invalid record length at : wanted 24, got0”.

And then it tries to restore the WAL file from the archive. Due to low write 
activity on primary, the WAL file will be switched and archived only after 1 hr.

So, it stuck in a loop of switching the WAL sources from STREAM and ARCHIVE 
without replicating the primary.

Due to this there will be write outage as the standby is synchronous standby.

We are using “wal_sync_method” as “fsync” assuming WAL file not getting flushed 
correctly.

But this is happening even after making it as “fsync” instead of “fdatasync”.

Restarting the STANDBY sometimes fixes this problem, but detecting this 
automatically is a big problem as the postgres standby process will be still 
running fine, but WAL RECEIVER process is up and down continuously due to 
switching of WAL sources.


How can we fix this ? Any suggestions regarding this will be appreciated.


Postgres Version: 13.6
OS: RHEL Linux


Thank you,


Best,
Harinath.



Any way to get timestamp from LSN value ?

2022-07-28 Thread Harinath Kanchu
Hello,

Is there any way to get the timestamp of the transaction using LSN value ?

For example: 
can we use the minimum recovery ending location in pg control file to get the 
minimum recovery timestamp ?

Minimum recovery ending location: 28/28000B68


Thanks in advance,

Best,
Harinath.

How to get accurate backup end time when it is taken from synchronous standby ?

2022-07-27 Thread Harinath Kanchu
Greetings,


When we take backups from a synchronous standby replica, how can we get the 
accurate timestamp of the backup end time ? (As backup history files are not 
generated on standbys)
For example:
this is a part of control file after a backup (created using wal-g by calling 
pg_startbackup and pg_stopbackup),
Fake LSN counter for unlogged rels:   0/3E8
Minimum recovery ending location: 28/28000B68
Min recovery ending loc's timeline:   2
Backup start location:0/0
Backup end location:  0/0
End-of-backup record required:no
here I can see that minimum recovery ending location as LSN value, how can we 
get the timestamp of it ?
The backup label file looks like this.
INFO: 2022/07/26 23:25:23.850621  LABLE FILE START --
INFO: 2022/07/26 23:25:23.850628 START WAL LOCATION: 1D/F94C7320 (file 
0002001D00F9)
INFO: 2022/07/26 23:25:23.850633 CHECKPOINT LOCATION: 1E/EDA8700
INFO: 2022/07/26 23:25:23.850639 BACKUP METHOD: streamed
INFO: 2022/07/26 23:25:23.850645 BACKUP FROM: standby
INFO: 2022/07/26 23:25:23.850653 START TIME: 2022-07-26 23:10:27 GMT
INFO: 2022/07/26 23:25:23.850659 LABEL: 2022-07-26 23:10:27.545378 + UTC 
m=+0.167723956
INFO: 2022/07/26 23:25:23.850665 START TIMELINE: 2
INFO: 2022/07/26 23:25:23.850669 
INFO: 2022/07/26 23:25:23.850676  LABLE FILE END --

How can we do PITR using timestamp if we don’t know the accurate timestamp of 
minimum recovery point ?







Thanks.
Best,
Harinath