Re: Are ZFS snapshots unsafe when PGSQL is spreading through multiple zpools?

2023-01-18 Thread Ron

On 1/18/23 10:54, Rob Sargent wrote:

On 1/18/23 09:38, HECTOR INGERTO wrote:


I wanted to understand the underlying issue.

I use ZFS snapshots instead of a “correct” backup because with only two 
machines it allows me to have backups in the main machine and in the 
secondary too that acts as hotspare at the same time.


To accomplish the same I would need 3 nodes. The main, the replica 
hotspare and the proper backup.




Isn't "proper backup" typically thought of as offsite reloadable file 
capture, hopefully relatively recent?  Disaster recovery, and all that.


That is in fact what "proper backup" means.  "Hot spare" means high 
availability, and *HA is not DR*.




--
Born in Arizona, moved to Babylonia.

Re: Are ZFS snapshots unsafe when PGSQL is spreading through multiple zpools?

2023-01-18 Thread Rob Sargent

On 1/18/23 09:38, HECTOR INGERTO wrote:


I wanted to understand the underlying issue.

I use ZFS snapshots instead of a “correct” backup because with only 
two machines it allows me to have backups in the main machine and in 
the secondary too that acts as hotspare at the same time.


To accomplish the same I would need 3 nodes. The main, the replica 
hotspare and the proper backup.




Isn't "proper backup" typically thought of as offsite reloadable file 
capture, hopefully relatively recent?  Disaster recovery, and all that.




RE: Are ZFS snapshots unsafe when PGSQL is spreading through multiple zpools?

2023-01-18 Thread HECTOR INGERTO
I wanted to understand the underlying issue.

I use ZFS snapshots instead of a “correct” backup because with only two 
machines it allows me to have backups in the main machine and in the secondary 
too that acts as hotspare at the same time.

To accomplish the same I would need 3 nodes. The main, the replica hotspare and 
the proper backup.



De: Laurenz Albe<mailto:laurenz.a...@cybertec.at>
Enviado: miércoles, 18 de enero de 2023 11:02
Para: HECTOR INGERTO<mailto:hector_...@hotmail.com>; Magnus 
Hagander<mailto:mag...@hagander.net>
CC: pgsql-gene...@postgresql.org 
<mailto:pgsql-gene...@postgresql.org>
Asunto: Re: Are ZFS snapshots unsafe when PGSQL is spreading through multiple 
zpools?

On Tue, 2023-01-17 at 15:22 +, HECTOR INGERTO wrote:
> > Another case: a transaction COMMITs, and a slightly later transaction reads 
> > the data
> > and sets a hint bit.  If the snapshot of the file system with the data 
> > directory in it
> > is slightly later than the snapshot of the file system with "pg_wal", the 
> > COMMIT might
> > not be part of the snapshot, but the hint bit could be.
> >
> > Then these uncommitted data could be visible if you recover from the 
> > snapshot.
>
> Thank you all. I have it clearer now.
>
> As a last point. Making the snapshot to the WAL dataset first or last would 
> make any difference?

Imagine you run DROP TABLE.  During the implicit COMMIT at the end of the 
statement,
the files behind the table are deleted.  If the snapshot of "pg_wal" is earlier 
than
the snapshot of the data files, you end up with a table that is not yet dropped,
but the files are gone.

I won't try to find an example if you now ask what if no checkpoint ends 
between the
statements, the snapshot on "pg_wal" is earlier and we don't run DROP TABLE.

Why do you go to all this effort rather than performing a correct backup?

Yours,
Laurenz Albe



Re: Are ZFS snapshots unsafe when PGSQL is spreading through multiple zpools?

2023-01-18 Thread Laurenz Albe
On Tue, 2023-01-17 at 15:22 +, HECTOR INGERTO wrote:
> > Another case: a transaction COMMITs, and a slightly later transaction reads 
> > the data
> > and sets a hint bit.  If the snapshot of the file system with the data 
> > directory in it
> > is slightly later than the snapshot of the file system with "pg_wal", the 
> > COMMIT might
> > not be part of the snapshot, but the hint bit could be.
> > 
> > Then these uncommitted data could be visible if you recover from the 
> > snapshot.
>  
> Thank you all. I have it clearer now.
>  
> As a last point. Making the snapshot to the WAL dataset first or last would 
> make any difference?

Imagine you run DROP TABLE.  During the implicit COMMIT at the end of the 
statement,
the files behind the table are deleted.  If the snapshot of "pg_wal" is earlier 
than
the snapshot of the data files, you end up with a table that is not yet dropped,
but the files are gone.

I won't try to find an example if you now ask what if no checkpoint ends 
between the
statements, the snapshot on "pg_wal" is earlier and we don't run DROP TABLE.

Why do you go to all this effort rather than performing a correct backup?

Yours,
Laurenz Albe




Re: Are ZFS snapshots unsafe when PGSQL is spreading through multiple zpools?

2023-01-17 Thread Bruce Momjian
On Tue, Jan 17, 2023 at 03:22:02PM +, HECTOR INGERTO wrote:
> > Another case: a transaction COMMITs, and a slightly later transaction reads
> the data
> > and sets a hint bit.  If the snapshot of the file system with the data
> directory in it
> > is slightly later than the snapshot of the file system with "pg_wal", the
> COMMIT might
> > not be part of the snapshot, but the hint bit could be.
> >
> > Then these uncommitted data could be visible if you recover from the
> snapshot.
> >
> > Yours,
> > Laurenz Albe
> 
>  
> 
> Thank you all. I have it clearer now.
> 
>  
> 
> As a last point. Making the snapshot to the WAL dataset first or last would
> make any difference?

How would you know which WAL snapshot to use, and if there is any
missing data between them?

I have often wondered if we could document rules where multiple
asychronous snapshots would be safe but have never gotten very far.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

Embrace your flaws.  They make you human, rather than perfect,
which you will never be.




RE: Are ZFS snapshots unsafe when PGSQL is spreading through multiple zpools?

2023-01-17 Thread HECTOR INGERTO
> Another case: a transaction COMMITs, and a slightly later transaction reads 
> the data
> and sets a hint bit.  If the snapshot of the file system with the data 
> directory in it
> is slightly later than the snapshot of the file system with "pg_wal", the 
> COMMIT might
> not be part of the snapshot, but the hint bit could be.
>
> Then these uncommitted data could be visible if you recover from the snapshot.
>
> Yours,
> Laurenz Albe

Thank you all. I have it clearer now.

As a last point. Making the snapshot to the WAL dataset first or last would 
make any difference?




Re: Are ZFS snapshots unsafe when PGSQL is spreading through multiple zpools?

2023-01-17 Thread Laurenz Albe
On Mon, 2023-01-16 at 14:37 +, HECTOR INGERTO wrote:
> > The database relies on the data being consistent when it performs crash 
> > recovery.
> > Imagine that a checkpoint is running while you take your snapshot.  The 
> > checkpoint
> > syncs a data file with a new row to disk.  Then it writes a WAL record and 
> > updates
> > the control file.  Now imagine that the table with the new row is on a 
> > different
> > file system, and your snapshot captures the WAL and the control file, but 
> > not
> > the new row (it was still sitting in the kernel page cache when the 
> > snapshot was taken).
> > You end up with a lost row.
> > 
> > That is only one scenario.  Many other ways of corruption can happen.
>  
> Can we say then that the risk comes only from the possibility of a checkpoint 
> running
> inside the time gap between the non-simultaneous snapshots?

Another case: a transaction COMMITs, and a slightly later transaction reads the 
data
and sets a hint bit.  If the snapshot of the file system with the data 
directory in it
is slightly later than the snapshot of the file system with "pg_wal", the 
COMMIT might
not be part of the snapshot, but the hint bit could be.

Then these uncommitted data could be visible if you recover from the snapshot.

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com




Re: Are ZFS snapshots unsafe when PGSQL is spreading through multiple zpools?

2023-01-17 Thread Magnus Hagander
On Mon, Jan 16, 2023 at 3:37 PM HECTOR INGERTO 
wrote:

> > The database relies on the data being consistent when it performs crash
> recovery.
> > Imagine that a checkpoint is running while you take your snapshot.  The
> checkpoint
> > syncs a data file with a new row to disk.  Then it writes a WAL record
> and updates
> > the control file.  Now imagine that the table with the new row is on a
> different
> > file system, and your snapshot captures the WAL and the control file,
> but not
> > the new row (it was still sitting in the kernel page cache when the
> snapshot was taken).
> > You end up with a lost row.
> >
> > That is only one scenario.  Many other ways of corruption can happen.
>
>
>
> Can we say then that the risk comes only from the possibility of a
> checkpoint running inside the time gap between the non-simultaneous
> snapshots?
>

No. The risk comes from any write at all, foreground or background,
occurring inside this gap. Checkpoints are only one of many types of writes
that can cause the problems.

The data integrity is dependent on writes happening in the order they are
issued/confirmed when they are synchronous. Thus terms like WAL = Write
*Ahead* Log.

--
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: Are ZFS snapshots unsafe when PGSQL is spreading through multiple zpools?

2023-01-17 Thread Alban Hertroys


> On 16 Jan 2023, at 15:37, HECTOR INGERTO  wrote:
> 
> > The database relies on the data being consistent when it performs crash 
> > recovery.
> > Imagine that a checkpoint is running while you take your snapshot.  The 
> > checkpoint
> > syncs a data file with a new row to disk.  Then it writes a WAL record and 
> > updates
> > the control file.  Now imagine that the table with the new row is on a 
> > different
> > file system, and your snapshot captures the WAL and the control file, but 
> > not
> > the new row (it was still sitting in the kernel page cache when the 
> > snapshot was taken).
> > You end up with a lost row.
> > 
> > That is only one scenario.  Many other ways of corruption can happen.
>  
> Can we say then that the risk comes only from the possibility of a checkpoint 
> running inside the time gap between the non-simultaneous snapshots?

I recently followed a course on distributed algorithms and recognised one of 
the patterns here.

The problem boils down to a distributed snapshotting algorithm, where both ZFS 
filesystem processes each initiate their own snapshot independently.

Without communicating with each other and with the database which messages (in 
this case traffic to and from the database to each FS) are part of their 
snapshots (sent or received), there are chances of lost messages, where either 
none of the process snapshots know that a 'message' was sent or none received 
it.

Algorithms like Tarry, Lai-Yang or the Echo algorithm solve this by adding 
communication between those processes about messages in transit.

Alban Hertroys
--
There is always an exception to always.








RE: Are ZFS snapshots unsafe when PGSQL is spreading through multiple zpools?

2023-01-16 Thread HECTOR INGERTO
> The database relies on the data being consistent when it performs crash 
> recovery.
> Imagine that a checkpoint is running while you take your snapshot.  The 
> checkpoint
> syncs a data file with a new row to disk.  Then it writes a WAL record and 
> updates
> the control file.  Now imagine that the table with the new row is on a 
> different
> file system, and your snapshot captures the WAL and the control file, but not
> the new row (it was still sitting in the kernel page cache when the snapshot 
> was taken).
> You end up with a lost row.
>
> That is only one scenario.  Many other ways of corruption can happen.

Can we say then that the risk comes only from the possibility of a checkpoint 
running inside the time gap between the non-simultaneous snapshots?


Re: Are ZFS snapshots unsafe when PGSQL is spreading through multiple zpools?

2023-01-16 Thread Laurenz Albe
On Mon, 2023-01-16 at 08:41 +, HECTOR INGERTO wrote:
> I have understood I shall not do it, but could the technical details be 
> discussed about
> why silent DB corruption can occur with non-atomical snapshots?

The database relies on the data being consistent when it performs crash 
recovery.
Imagine that a checkpoint is running while you take your snapshot.  The 
checkpoint
syncs a data file with a new row to disk.  Then it writes a WAL record and 
updates
the control file.  Now imagine that the table with the new row is on a different
file system, and your snapshot captures the WAL and the control file, but not
the new row (it was still sitting in the kernel page cache when the snapshot 
was taken).
You end up with a lost row.

That is only one scenario.  Many other ways of corruption can happen.

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com




RE: Are ZFS snapshots unsafe when PGSQL is spreading through multiple zpools?

2023-01-16 Thread HECTOR INGERTO
I have understood I shall not do it, but could the technical details be 
discussed about why silent DB corruption can occur with non-atomical snapshots?




Re: Are ZFS snapshots unsafe when PGSQL is spreading through multiple zpools?

2023-01-15 Thread Ron
We regularly use "history" databases.  Put them on slow media, and only take 
a backup when data is added to them (monthly, quarterly, etc).


On 1/15/23 15:57, HECTOR INGERTO wrote:


> But you cannot and should not rely on snapshots alone

That’s only for non atomic (multiple pools) snapshots. Isn’t?

If I need to rely only on ZFS (automated) snapshots, then the best option 
would be to have two DB? Each one in each own pool. One HDD DB and one SSD 
DB. Then, the backend code should know on which DB the requested data is.


*De: *Magnus Hagander <mailto:mag...@hagander.net>
*Enviado: *domingo, 15 de enero de 2023 20:36
*Para: *HECTOR INGERTO <mailto:hector_...@hotmail.com>
*CC: *pgsql-gene...@postgresql.org
*Asunto: *Re: Are ZFS snapshots unsafe when PGSQL is spreading through 
multiple zpools?


On Sun, Jan 15, 2023 at 8:18 PM HECTOR INGERTO  wrote:

Hello everybody,

I’m using PostgreSQL on openZFS. I use ZFS snapshots as a backup +
hotspare method.

From man zfs-snapshot: “Snapshots are taken atomically, so that all
snapshots correspond to the same moment in time.” So if a PSQL
instance is started from a zfs snapshot, it will start to replay the
WAL from the last checkpoint, in the same way it would do in a crash
or power loss scenario. So from my knowledge, ZFS snapshots can be
used to rollback to a previous point in time. Also, sending those
snapshots to other computers will allow you to have hotspares and
remote backups. If I’m wrong here, I would appreciate being told about
it because I’m basing the whole question on this premise.

On the other hand, we have the tablespace PGSQL feature, which is
great because it allows “unimportant” big data to be written into
cheap HDD and frequently used data into fast NVMe.

So far, so good. The problem is when both ideas are merged. Then,
snapshots from different pools are NOT atomical, snapshot on the HDD
pool isn’t going to be done at the same exact time as the one on the
SSD pool, and I don’t know enough about PGSQL internals to know how
dangerous this is. So here is where I would like to ask for your help
with the following questions:

First of all, what kind of problem can this lead to? Are we talking
about potential whole DB corruption or only the loss of a few of the
latest transactions?

Silent data corruption. *not* just losing your latest transaction.

In second place, if I’m initializing a corrupted PGSQL instance
because ZFS snapshots are from different pools and slightly different
times, am I going to notice it somehow or is it going to fail silently?

Silent. You might notice at the application level. Might.

In third and last place, is there some way to quantify the amount of
risk taken when snapshotting a PGSQL instance spread across two (or
more) different pools?

"Don't do it".

If you can't get atomic snapshots, don't do it, period.

You can use them together with a regular online backup. That is 
pg_start_backup() //  // pg_stop_backup() 
together with log archiving. That's a perfectly valid method. But you 
cannot and should not rely on snapshots alone.


--

 Magnus Hagander
 Me: https://www.hagander.net/ 
<https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.hagander.net%2F=05%7C01%7C%7C4860509b67ea484420fb08daf72fddd4%7C84df9e7fe9f640afb435%7C1%7C0%7C638094082195595508%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=LRa%2BFTXpoZNsMLMrNLbL6xmgo9I3Mxx2CcCAh6nmguU%3D=0>
 Work: https://www.redpill-linpro.com/ 
<https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.redpill-linpro.com%2F=05%7C01%7C%7C4860509b67ea484420fb08daf72fddd4%7C84df9e7fe9f640afb435%7C1%7C0%7C638094082195752157%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=ziYhcTa5YOvHZEr2xk2nEKvSjLICE75zRhhCehvzIMs%3D=0>




--
Born in Arizona, moved to Babylonia.

RE: Are ZFS snapshots unsafe when PGSQL is spreading through multiple zpools?

2023-01-15 Thread HECTOR INGERTO

> But you cannot and should not rely on snapshots alone

That’s only for non atomic (multiple pools) snapshots. Isn’t?

If I need to rely only on ZFS (automated) snapshots, then the best option would 
be to have two DB? Each one in each own pool. One HDD DB and one SSD DB. Then, 
the backend code should know on which DB the requested data is.

De: Magnus Hagander<mailto:mag...@hagander.net>
Enviado: domingo, 15 de enero de 2023 20:36
Para: HECTOR INGERTO<mailto:hector_...@hotmail.com>
CC: pgsql-gene...@postgresql.org<mailto:pgsql-gene...@postgresql.org>
Asunto: Re: Are ZFS snapshots unsafe when PGSQL is spreading through multiple 
zpools?



On Sun, Jan 15, 2023 at 8:18 PM HECTOR INGERTO 
mailto:hector_...@hotmail.com>> wrote:
Hello everybody,

I’m using PostgreSQL on openZFS. I use ZFS snapshots as a backup + hotspare 
method.

>From man zfs-snapshot: “Snapshots are taken atomically, so that all snapshots 
>correspond to the same moment in time.” So if a PSQL instance is started from 
>a zfs snapshot, it will start to replay the WAL from the last checkpoint, in 
>the same way it would do in a crash or power loss scenario. So from my 
>knowledge, ZFS snapshots can be used to rollback to a previous point in time. 
>Also, sending those snapshots to other computers will allow you to have 
>hotspares and remote backups. If I’m wrong here, I would appreciate being told 
>about it because I’m basing the whole question on this premise.

On the other hand, we have the tablespace PGSQL feature, which is great because 
it allows “unimportant” big data to be written into cheap HDD and frequently 
used data into fast NVMe.

So far, so good. The problem is when both ideas are merged. Then, snapshots 
from different pools are NOT atomical, snapshot on the HDD pool isn’t going to 
be done at the same exact time as the one on the SSD pool, and I don’t know 
enough about PGSQL internals to know how dangerous this is. So here is where I 
would like to ask for your help with the following questions:

First of all, what kind of problem can this lead to? Are we talking about 
potential whole DB corruption or only the loss of a few of the latest 
transactions?

Silent data corruption. *not* just losing your latest transaction.


In second place, if I’m initializing a corrupted PGSQL instance because ZFS 
snapshots are from different pools and slightly different times, am I going to 
notice it somehow or is it going to fail silently?

Silent. You might notice at the application level. Might.


In third and last place, is there some way to quantify the amount of risk taken 
when snapshotting a PGSQL instance spread across two (or more) different pools?


"Don't do it".

If you can't get atomic snapshots, don't do it, period.

You can use them together with a regular online backup. That is 
pg_start_backup() //  // pg_stop_backup() together 
with log archiving. That's a perfectly valid method. But you cannot and should 
not rely on snapshots alone.

--
 Magnus Hagander
 Me: 
https://www.hagander.net/<https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.hagander.net%2F=05%7C01%7C%7C4860509b67ea484420fb08daf72fddd4%7C84df9e7fe9f640afb435%7C1%7C0%7C638094082195595508%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=LRa%2BFTXpoZNsMLMrNLbL6xmgo9I3Mxx2CcCAh6nmguU%3D=0>
 Work: 
https://www.redpill-linpro.com/<https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.redpill-linpro.com%2F=05%7C01%7C%7C4860509b67ea484420fb08daf72fddd4%7C84df9e7fe9f640afb435%7C1%7C0%7C638094082195752157%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=ziYhcTa5YOvHZEr2xk2nEKvSjLICE75zRhhCehvzIMs%3D=0>



Re: Are ZFS snapshots unsafe when PGSQL is spreading through multiple zpools?

2023-01-15 Thread Magnus Hagander
On Sun, Jan 15, 2023 at 10:57 PM HECTOR INGERTO 
wrote:

>
>
> > But you cannot and should not rely on snapshots alone
>
>
>
> That’s only for non atomic (multiple pools) snapshots. Isn’t?
>

Right. For single-filesystem installs it should be fine. Just make sure it
has both the data and the WAL directories in the same one.




> If I need to rely only on ZFS (automated) snapshots, then the best option
> would be to have two DB? Each one in each own pool. One HDD DB and one SSD
> DB. Then, the backend code should know on which DB the requested data is.
>

 You could. I wouldn't -- I would set it up to use proper backups instead,
maybe using snapshots as the infrastructure. That gives you other
advantages as well, like being able to do PITR. It's a little more to set
up, but I'd say it's worth it.

//Magnus


Re: Are ZFS snapshots unsafe when PGSQL is spreading through multiple zpools?

2023-01-15 Thread Magnus Hagander
On Sun, Jan 15, 2023 at 8:18 PM HECTOR INGERTO 
wrote:

> Hello everybody,
>
>
>
> I’m using PostgreSQL on openZFS. I use ZFS snapshots as a backup +
> hotspare method.
>
>
>
> From man zfs-snapshot: “Snapshots are taken atomically, so that all
> snapshots correspond to the same moment in time.” So if a PSQL instance is
> started from a zfs snapshot, it will start to replay the WAL from the last
> checkpoint, in the same way it would do in a crash or power loss scenario. So
> from my knowledge, ZFS snapshots can be used to rollback to a previous
> point in time. Also, sending those snapshots to other computers will allow
> you to have hotspares and remote backups. If I’m wrong here, I would
> appreciate being told about it because I’m basing the whole question on
> this premise.
>
>
>
> On the other hand, we have the tablespace PGSQL feature, which is great
> because it allows “unimportant” big data to be written into cheap HDD and
> frequently used data into fast NVMe.
>
>
>
> So far, so good. The problem is when both ideas are merged. Then,
> snapshots from different pools are NOT atomical, snapshot on the HDD pool
> isn’t going to be done at the same exact time as the one on the SSD pool,
> and I don’t know enough about PGSQL internals to know how dangerous this
> is. So here is where I would like to ask for your help with the following
> questions:
>
>
>
> First of all, what kind of problem can this lead to? Are we talking about
> potential whole DB corruption or only the loss of a few of the latest
> transactions?
>

Silent data corruption. *not* just losing your latest transaction.



> In second place, if I’m initializing a corrupted PGSQL instance because
> ZFS snapshots are from different pools and slightly different times, am I
> going to notice it somehow or is it going to fail silently?
>

Silent. You might notice at the application level. Might.



> In third and last place, is there some way to quantify the amount of risk
> taken when snapshotting a PGSQL instance spread across two (or more)
> different pools?
>
>
>
"Don't do it".

If you can't get atomic snapshots, don't do it, period.

You can use them together with a regular online backup. That is
pg_start_backup() //  // pg_stop_backup()
together with log archiving. That's a perfectly valid method. But you
cannot and should not rely on snapshots alone.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/