Re: Failing streaming replication on PostgreSQL 14

2024-04-22 Thread Nicolas Seinlet
Hi,

facing the same situation again, but this time, analyzing the wal with xxd 
shows a different pattern. I had no blocks of .

The output of pg_waldump is:
pg_waldump: fatal: error in WAL record at 11C/93F9FF70: invalid magic number 
 in log segment 0001011C0093, offset 16384000

The output of xxd -C16 is

00f9ff60: b364 0079 6e61 6d69 6320 6c80 0300   .d.ynamic l.
00f9ff70: 4000  6659 a406 60f7 f993 1c01   @...fY..`...
00f9ff80: 000b  82b3 8d9b 0020 1000 7f06   . ..

I'm still unable to determine the cause of the issue, nor if the issue is on 
the primary server sending a corrupted wal segment, or on the secondary 
receiving a corrupted wal segment, or the openzfs filesystem on the primary 
allowing wal_sender to read still-not-written wal segment, or ...

Is there any log option I can add on the two clusters to help me locate the 
issue's origin?

thanks,

Nicolas.

On Tuesday, April 16th, 2024 at 09:56, Nicolas Seinlet  
wrote:

> 

> 

> Hello,
> 

> > What exactly is "cyphered ZFS"? Can you reproduce the problem with some
> > other filesystem? If it's something very unusual, it might well be a
> > bug in the filesystem.
> 

> 

> The filesystem is openzfs with native aes-256-gcm encryption:
> https://openzfs.github.io/openzfs-docs/man/master/7/zfsprops.7.html#encryption
> 

> I've not tested if we get the same issue on another filesystem.
> 

> I don't face the issue on Ubuntu 20.04/openzfs 0.8/PostgreSQL 12, but I have 
> fewer systems with this deployment.
> On Ubuntu 22.04/openzfs 2.1.5/PostgreSQL 14, I face the issue from time to 
> time, without knowing what triggers the error.
> 

> thanks for helping,
> 

> Nicolas.



signature.asc
Description: OpenPGP digital signature


Re: Failing streaming replication on PostgreSQL 14

2024-04-16 Thread Nicolas Seinlet
Hello,

> What exactly is "cyphered ZFS"? Can you reproduce the problem with some
> other filesystem? If it's something very unusual, it might well be a
> bug in the filesystem.

The filesystem is openzfs with native aes-256-gcm encryption:
https://openzfs.github.io/openzfs-docs/man/master/7/zfsprops.7.html#encryption

I've not tested if we get the same issue on another filesystem.

I don't face the issue on Ubuntu 20.04/openzfs 0.8/PostgreSQL 12, but I have 
fewer systems with this deployment.
On Ubuntu 22.04/openzfs 2.1.5/PostgreSQL 14, I face the issue from time to 
time, without knowing what triggers the error.

thanks for helping,

Nicolas.



publickey - nicolas@seinlet.com - 0xCAEB7FAF.asc
Description: application/pgp-keys


signature.asc
Description: OpenPGP digital signature


Re: Failing streaming replication on PostgreSQL 14

2024-04-15 Thread Alvaro Herrera
On 2024-Apr-15, Nicolas Seinlet wrote:

> I'm using the Ubuntu/cyphered ZFS/PostgreSQL combination. I'm using
> Ubuntu LTS (20.04 22.04) and provided ZFS/PostgreSQL with LTS
> (PostgreSQL 12 on Ubuntu 20.04 and 14 on 22.04).

What exactly is "cyphered ZFS"?  Can you reproduce the problem with some
other filesystem?  If it's something very unusual, it might well be a
bug in the filesystem.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Find a bug in a program, and fix it, and the program will work today.
Show the program how to find and fix a bug, and the program
will work forever" (Oliver Silfridge)




Re: Failing streaming replication on PostgreSQL 14

2024-04-15 Thread Nicolas Seinlet

On Monday, April 15th, 2024 at 14:36, Ron Johnson  
wrote:

> On Mon, Apr 15, 2024 at 2:53 AM Nicolas Seinlet  wrote:
> 

> > Hello everyone,
> > 

> > Since I moved some clusters from PostgreSQL 12 to 14, I noticed random 
> > failures in streaming replication. I say "random" mostly because I haven't 
> > got the source of the issue.
> > 

> > I'm using the Ubuntu/cyphered ZFS/PostgreSQL combination. I'm using Ubuntu 
> > LTS (20.04 22.04) and provided ZFS/PostgreSQL with LTS (PostgreSQL 12 on 
> > Ubuntu 20.04 and 14 on 22.04).
> > 

> > The streaming replication of PostgreSQL is configured with 
> > `primary_conninfo 'host=main_server port=5432 user=replicant 
> > password=a_very_secure_password sslmode=require 
> > application_name=replication_postgresql_app' ` , no replication slot nor 
> > restore command, and the wal is configured with `full_page_writes = off 
> > wal_init_zero = off wal_recycle = off`
> > 

> > If this works like a charm on PostgreSQL 12, it's sometimes failing with 
> > PostgreSQL 14. As we also changed the OS, maybe the issue relies somewhere 
> > else.
> > 

> > When the issue is detected, the WAL on the primary is correct. A piece of 
> > the WAL is wrong on the secondary. Only some bytes. Some bytes later, the 
> > wal is again correct. Stopping PostgreSQL on the secondary, removing the 
> > wrong WAL file, and restarting PostgreSQL solves the issue.
> > 

> > We've added another secondary and noticed the issue can appear on one of 
> > the secondaries, not both at the same time.
> > 

> > What can I do to detect the origin of this issue?
> 

> 

> 1. Minor version number?
> 2. Using replication_slots?
> 3. Error message(s)?
> 

> 


Hi,


1.  PostgreSQL 14.11
2.  No. no replication slot nor restore command. As we've understood the 
replication slot, it's a mechanism to keep on the primary side everything 
needed for the secondary to recover. Will this make the primary acknowledge 
that the secondary received the good wal file?
3.  incorrect resource manager data checksum

Looking at the WAL files with xxd gives the following diff:

The bad one:
006c9160: 0a6e 7514 5030 2e31 0e35 016c 0f07 0009 2f62 6568 6100 7669 6f72 3a6e 
6f72 be6d  .nu.P0.1.5.l/beha.vior:nor.m
006c9180:               
   
006c91a0:               
   
006c91c0: 437a 4263 7500 7273 6f72 3a70 6f69 0302 4503 9023 3237 3665 3720 323b 
223e 5527  CzBcu.rsor:poi..E..#276e7 2;">U'

The good one contains the same 1st and 4th lines, but the 2nd and 3rd lines 
contain the correct values, as if a packet was missed.

Thanks for helping,

Nicolas.


signature.asc
Description: OpenPGP digital signature


Re: Failing streaming replication on PostgreSQL 14

2024-04-15 Thread Ron Johnson
On Mon, Apr 15, 2024 at 2:53 AM Nicolas Seinlet  wrote:

> Hello everyone,
>
> Since I moved some clusters from PostgreSQL 12 to 14, I noticed random
> failures in streaming replication. I say "random" mostly because I haven't
> got the source of the issue.
>
> I'm using the Ubuntu/cyphered ZFS/PostgreSQL combination. I'm using Ubuntu
> LTS (20.04 22.04) and provided ZFS/PostgreSQL with LTS (PostgreSQL 12 on
> Ubuntu 20.04 and 14 on 22.04).
>
> The streaming replication of PostgreSQL is configured with
> `primary_conninfo 'host=main_server port=5432 user=replicant
> password=a_very_secure_password sslmode=require
> application_name=replication_postgresql_app' ` , no replication slot nor
> restore command, and the wal is configured with `full_page_writes = off
> wal_init_zero = off wal_recycle = off`
>
> If this works like a charm on PostgreSQL 12, it's sometimes failing with
> PostgreSQL 14. As we also changed the OS, maybe the issue relies somewhere
> else.
>
> When the issue is detected, the WAL on the primary is correct. A piece of
> the WAL is wrong on the secondary. Only some bytes. Some bytes later, the
> wal is again correct. Stopping PostgreSQL on the secondary, removing the
> wrong WAL file, and restarting PostgreSQL solves the issue.
>
> We've added another secondary and noticed the issue can appear on one of
> the secondaries, not both at the same time.
>
> What can I do to detect the origin of this issue?
>

1. Minor version number?
2. Using replication_slots?
3. Error message(s)?


Failing streaming replication on PostgreSQL 14

2024-04-14 Thread Nicolas Seinlet
Hello everyone,

Since I moved some clusters from PostgreSQL 12 to 14, I noticed random failures 
in streaming replication. I say "random" mostly because I haven't got the 
source of the issue.

I'm using the Ubuntu/cyphered ZFS/PostgreSQL combination. I'm using Ubuntu LTS 
(20.04 22.04) and provided ZFS/PostgreSQL with LTS (PostgreSQL 12 on Ubuntu 
20.04 and 14 on 22.04).

The streaming replication of PostgreSQL is configured with `primary_conninfo 
'host=main_server port=5432 user=replicant password=a_very_secure_password 
sslmode=require application_name=replication_postgresql_app' ` , no replication 
slot nor restore command, and the wal is configured with `full_page_writes = 
off wal_init_zero = off wal_recycle = off`

If this works like a charm on PostgreSQL 12, it's sometimes failing with 
PostgreSQL 14. As we also changed the OS, maybe the issue relies somewhere else.

When the issue is detected, the WAL on the primary is correct. A piece of the 
WAL is wrong on the secondary. Only some bytes. Some bytes later, the wal is 
again correct. Stopping PostgreSQL on the secondary, removing the wrong WAL 
file, and restarting PostgreSQL solves the issue.

We've added another secondary and noticed the issue can appear on one of the 
secondaries, not both at the same time.

What can I do to detect the origin of this issue?

Have a nice week,

Nicolas.


signature.asc
Description: OpenPGP digital signature