Re: block-level incremental backup

2019-04-09 Thread Arthur Zakirov

Hello,

On 09.04.2019 18:48, Robert Haas wrote:

- It would also be nice if pg_basebackup could write backups to places
other than the local disk, like an object store, a tape drive, etc.
But that also sounds like a separate effort.

Thoughts? 


(Just thinking out loud) Also it might be useful to have remote restore 
facility (i.e. if pg_combinebackup could write to non-local storage), so 
you don't need to restore the instance into a locale place and copy/move 
to the remote machine. But it seems to me that it is the most nontrivial 
feature and requires much more effort than other points.


In pg_probackup we have remote restore via SSH in the beta state. But 
SSH isn't an option for in-core approach I think.


--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company




Re: block-level incremental backup

2019-04-09 Thread Andres Freund
Hi,

On 2019-04-09 11:48:38 -0400, Robert Haas wrote:
> 2. When you use pg_basebackup in this way, each relation file that is
> not sent in its entirety is replaced by a file with a different name.
> For example, instead of base/16384/16417, you might get
> base/16384/partial.16417 or however we decide to name them.

Hm. But that means that files that are shipped nearly in their entirety,
need to be fully rewritten. Wonder if it's better to ship them as files
with holes, and have the metadata in a separate file. That'd then allow
to just fill in the holes with data from the older version.  I'd assume
that there's a lot of workloads where some significantly sized relations
will get updated in nearly their entirety between backups.


> Each such file will store near the beginning of the file a list of all the
> blocks contained in that file, and the blocks themselves will follow
> at offsets that can be predicted from the metadata at the beginning of
> the file.  The idea is that you shouldn't have to read the whole file
> to figure out which blocks it contains, and if you know specifically
> what blocks you want, you should be able to reasonably efficiently
> read just those blocks.  A backup taken in this manner should also
> probably create some kind of metadata file in the root directory that
> stops the server from starting and lists other salient details of the
> backup.  In particular, you need the threshold LSN for the backup
> (i.e. contains blocks newer than this) and the start LSN for the
> backup (i.e. the LSN that would have been returned from
> pg_start_backup).

I wonder if we shouldn't just integrate that into pg_control or such. So
that:

> 3. There should be a new tool that knows how to merge a full backup
> with any number of incremental backups and produce a complete data
> directory with no remaining partial files.

Could just be part of server startup?


> - I imagine that the server would offer this functionality through a
> new replication command or a syntax extension to an existing command,
> so it could also be used by tools other than pg_basebackup if they
> wished.

Would this logic somehow be usable from tools that don't want to copy
the data directory via pg_basebackup (e.g. for parallelism, to directly
send to some backup service / SAN / whatnot)?


> - It would also be nice if pg_basebackup could write backups to places
> other than the local disk, like an object store, a tape drive, etc.
> But that also sounds like a separate effort.

Indeed seems separate. But worthwhile.

Greetings,

Andres Freund




Re: block-level incremental backup

2019-04-09 Thread Gary M
Having worked in the data storage industry since the '80s, I think backup
is an important capability. Having said that, the ideas should be expanded
to an overall data management strategy combining local and remote storage
including cloud.

>From my experience, record and transaction consistency is critical to any
replication action, including backup.  The approach commonly includes a
starting baseline, snapshot if you prefer, and a set of incremental changes
to the snapshot.  I always used the transaction logs for both backup and
remote replication to other DBMS. In standard ECMA-208 @94, you will note a
file object with a transaction property. Although the language specifies
files, a file may be any set of records.

SAN based snapshots usually occur on the SAN storage device, meaning if
cached data (unwritten to disk) will not be snapshotted or inconsistently
reference and likely result in a corrupted database on restore.

Snapshots are point in time states of storage objects. Between snapshot
periods, any number of changes many occur.  If a record of "all changes"
are required, snapshot methods must be augmented with a historical record..
the transaction log.

 Delta block methods for backups have been in practice for many years. ZFS
had adopted the practice for block management. The ability of incremental
backups, whether block, transactions or other methods, is dependent on
prior data. Like primary storage, backup media can fail, become lost and be
inadvertently corrupted. The result of incremental data backup loss is the
restored data after the point of loss is likely corrupted.

cheers,
garym

On Tue, Apr 9, 2019 at 10:35 AM Andres Freund  wrote:

> Hi,
>
> On 2019-04-09 11:48:38 -0400, Robert Haas wrote:
> > 2. When you use pg_basebackup in this way, each relation file that is
> > not sent in its entirety is replaced by a file with a different name.
> > For example, instead of base/16384/16417, you might get
> > base/16384/partial.16417 or however we decide to name them.
>
> Hm. But that means that files that are shipped nearly in their entirety,
> need to be fully rewritten. Wonder if it's better to ship them as files
> with holes, and have the metadata in a separate file. That'd then allow
> to just fill in the holes with data from the older version.  I'd assume
> that there's a lot of workloads where some significantly sized relations
> will get updated in nearly their entirety between backups.
>
>
> > Each such file will store near the beginning of the file a list of all
> the
> > blocks contained in that file, and the blocks themselves will follow
> > at offsets that can be predicted from the metadata at the beginning of
> > the file.  The idea is that you shouldn't have to read the whole file
> > to figure out which blocks it contains, and if you know specifically
> > what blocks you want, you should be able to reasonably efficiently
> > read just those blocks.  A backup taken in this manner should also
> > probably create some kind of metadata file in the root directory that
> > stops the server from starting and lists other salient details of the
> > backup.  In particular, you need the threshold LSN for the backup
> > (i.e. contains blocks newer than this) and the start LSN for the
> > backup (i.e. the LSN that would have been returned from
> > pg_start_backup).
>
> I wonder if we shouldn't just integrate that into pg_control or such. So
> that:
>
> > 3. There should be a new tool that knows how to merge a full backup
> > with any number of incremental backups and produce a complete data
> > directory with no remaining partial files.
>
> Could just be part of server startup?
>
>
> > - I imagine that the server would offer this functionality through a
> > new replication command or a syntax extension to an existing command,
> > so it could also be used by tools other than pg_basebackup if they
> > wished.
>
> Would this logic somehow be usable from tools that don't want to copy
> the data directory via pg_basebackup (e.g. for parallelism, to directly
> send to some backup service / SAN / whatnot)?
>
>
> > - It would also be nice if pg_basebackup could write backups to places
> > other than the local disk, like an object store, a tape drive, etc.
> > But that also sounds like a separate effort.
>
> Indeed seems separate. But worthwhile.
>
> Greetings,
>
> Andres Freund
>
>
>


Re: block-level incremental backup

2019-04-09 Thread Robert Haas
On Tue, Apr 9, 2019 at 12:35 PM Andres Freund  wrote:
> Hm. But that means that files that are shipped nearly in their entirety,
> need to be fully rewritten. Wonder if it's better to ship them as files
> with holes, and have the metadata in a separate file. That'd then allow
> to just fill in the holes with data from the older version.  I'd assume
> that there's a lot of workloads where some significantly sized relations
> will get updated in nearly their entirety between backups.

I don't want to rely on holes at the FS level.  I don't want to have
to worry about what Windows does and what every Linux filesystem does
and what NetBSD and FreeBSD and Dragonfly BSD and MacOS do.  And I
don't want to have to write documentation for the fine manual
explaining to people that they need to use a hole-preserving tool when
they copy an incremental backup around.  And I don't want to have to
listen to complaints from $USER that their backup tool, $THING, is not
hole-aware.  Just - no.

But what we could do is have some threshold (as git does), beyond
which you just send the whole file.  For example if >90% of the blocks
have changed, or >80% or whatever, then you just send everything.
That way, if you have a database where you have lots and lots of 1GB
segments with low churn (so that you can't just use full backups) and
lots and lots of 1GB segments with high churn (to create the problem
you're describing) you'll still be OK.

> > 3. There should be a new tool that knows how to merge a full backup
> > with any number of incremental backups and produce a complete data
> > directory with no remaining partial files.
>
> Could just be part of server startup?

Yes, but I think that sucks.  You might not want to start the server
but rather just create a new synthetic backup.  And realistically,
it's hard to imagine the server doing anything but synthesizing the
backup first and then proceeding as normal.  In theory there's no
reason why it couldn't be smart enough to construct the files it needs
"on demand" in the background, but that sounds really hard and I don't
think there's enough value to justify that level of effort.  YMMV, of
course.

> > - I imagine that the server would offer this functionality through a
> > new replication command or a syntax extension to an existing command,
> > so it could also be used by tools other than pg_basebackup if they
> > wished.
>
> Would this logic somehow be usable from tools that don't want to copy
> the data directory via pg_basebackup (e.g. for parallelism, to directly
> send to some backup service / SAN / whatnot)?

Well, I'm imagining it as a piece of server-side functionality that
can figure out what has changed using one of several possible methods,
and then send that stuff to you.  So I think if you don't have a
server connection you are out of luck.  If you have a server
connection but just want to be told what has changed rather than
actually being given that data, that might be something that could be
worked into the design.  I'm not sure whether that's a real need,
though, or just extra work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: block-level incremental backup

2019-04-09 Thread Robert Haas
On Tue, Apr 9, 2019 at 12:32 PM Arthur Zakirov  wrote:
> In pg_probackup we have remote restore via SSH in the beta state. But
> SSH isn't an option for in-core approach I think.

That's a little off-topic for this thread, but I think we should have
some kind of extensible mechanism for pg_basebackup and maybe other
tools, so that you can teach it to send backups to AWS or your
teletype or etch them on stone tablets or whatever without having to
modify core code.  But let's not design that mechanism on this thread,
'cuz that will distract from what I want to talk about here.  Feel
free to start a new thread for it, though, and I'll jump in.  :-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: block-level incremental backup

2019-04-09 Thread Peter Eisentraut
On 2019-04-09 17:48, Robert Haas wrote:
> It will
> probably be more efficient in many cases to instead scan all the WAL
> generated since that LSN and extract block references from it, but
> that is only possible if the server has all of that WAL available or
> can somehow get it from the archive.

This could be a variant of a replication slot that preserves WAL between
incremental backup runs.

> 3. There should be a new tool that knows how to merge a full backup
> with any number of incremental backups and produce a complete data
> directory with no remaining partial files.

Are there by any chance standard file formats and tools that describe a
binary difference between directories?  That would be really useful here.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: block-level incremental backup

2019-04-09 Thread Alvaro Herrera
On 2019-Apr-09, Peter Eisentraut wrote:

> On 2019-04-09 17:48, Robert Haas wrote:

> > 3. There should be a new tool that knows how to merge a full backup
> > with any number of incremental backups and produce a complete data
> > directory with no remaining partial files.
> 
> Are there by any chance standard file formats and tools that describe a
> binary difference between directories?  That would be really useful here.

VCDIFF? https://tools.ietf.org/html/rfc3284

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: block-level incremental backup

2019-04-10 Thread Andrey Borodin
Hi!

> 9 апр. 2019 г., в 20:48, Robert Haas  написал(а):
> 
> Thoughts?
Thanks for this long and thoughtful post!

At Yandex, we are using incremental backups for some years now. Initially, we 
used patched pgbarman, then we implemented this functionality in WAL-G. And 
there are many things to be done yet. We have more than 1Pb of clusters 
backuped with this technology.
Most of the time we use this technology as a part of HA setup in managed 
PostgreSQL service. So, for us main goals are to operate backups cheaply and 
restore new node quickly. Here's what I see from our perspective.

1. Yes, this feature is important.

2. This importance comes not from reduced disk storage, magnetic disks and 
object storages are very cheap.

3. Incremental backups save a lot of network bandwidth. It is non-trivial for 
the storage system to ingest hundreds of Tb daily.

4. Incremental backups are a redundancy of WAL, intended for parallel 
application. Incremental backup applied sequentially is not very useful, it 
will not be much faster than simple WAL replay in many cases.

5. As long as increments duplicate WAL functionality - it is not worth pursuing 
tradeoffs of storage utilization reduction. We scan WAL during archivation, 
extract numbers of changed blocks and store changemap for a group of WALs in 
the archive.

6. This changemaps can be used for the increment of the visibility map (if I 
recall correctly). But you cannot compare LSNs on a page of visibility map: 
some operations do not bump them.

7. We use changemaps during backups and during WAL replay - we know blocks that 
will change far in advance and prefetch them to page cache like pg_prefaulter 
does.

8. There is similar functionality in RMAN for one well-known database. They 
used to store 8 sets of change maps. That database also has cool functionality 
"increment for catchup".

9. We call incremental backup a "delta backup". This wording describes purpose 
more precisely: it is not "next version of DB", it is "difference between two 
DB states". But wording choice does not matter much.


Here are slides from my talk at PgConf.APAC[0]. I've proposed a talk on this 
matter to PgCon, but it was not accepted. I will try next year :)

> 9 апр. 2019 г., в 20:48, Robert Haas  написал(а):
> - This is just a design proposal at this point; there is no code.  If
> this proposal, or some modified version of it, seems likely to be
> acceptable, I and/or my colleagues might try to implement it.

I'll be happy to help with code, discussion and patch review.

Best regards, Andrey Borodin.

[0] https://yadi.sk/i/Y_S1iqNN5WxS6A



Re: block-level incremental backup

2019-04-10 Thread Konstantin Knizhnik



On 09.04.2019 18:48, Robert Haas wrote:

1. There should be a way to tell pg_basebackup to request from the
server only those blocks where LSN >= threshold_value.


Some times ago I have implemented alternative version of ptrack utility 
(not one used in pg_probackup)
which detects updated block at file level. It is very simple and may be 
it can be sometimes integrated in master.

I attached patch to vanilla to this mail.
Right now it contains just two GUCs:

ptrack_map_size: Size of ptrack map (number of elements) used for 
incremental backup: 0 disabled.

ptrack_block_log: Logarithm of ptrack block size (amount of pages)

and one function:

pg_ptrack_get_changeset(startlsn pg_lsn) returns 
{relid,relfilenode,reltablespace,forknum,blocknum,segsize,updlsn,path}


Idea is very simple: it creates hash map of fixed size (ptrack_map_size) 
and stores LSN of written pages in this map.
As far as postgres default page size seems to be too small  for ptrack 
block (requiring too large hash map or increasing number of conflicts, 
as well as
increasing number of random reads) it is possible to configure ptrack 
block to consists of multiple pages (power of 2).


This patch is using memory mapping mechanism. Unfortunately there is no 
portable wrapper for it in Postgres, so I have to provide own 
implementations for Unix/Windows. Certainly it is not good and should be 
rewritten.


How to use?

1. Define ptrack_map_size in postgres.conf, for example (use simple 
number for more uniform hashing):


ptrack_map_size = 103

2.  Remember current lsn.

psql postgres -c "select pg_current_wal_lsn()"
 pg_current_wal_lsn

 0/224A268
(1 row)

3. Do some updates.

$ pgbench -T 10 postgres

4. Select changed blocks.

 select * from pg_ptrack_get_changeset('0/224A268');
 relid | relfilenode | reltablespace | forknum | blocknum | segsize |  
updlsn   | path

---+-+---+-+--+-+---+--
 16390 |   16396 |  1663 |   0 | 1640 |   1 | 
0/224FD88 | base/12710/16396
 16390 |   16396 |  1663 |   0 | 1641 |   1 | 
0/2258680 | base/12710/16396
 16390 |   16396 |  1663 |   0 | 1642 |   1 | 
0/22615A0 | base/12710/16396

...

Certainly ptrack should be used as part of some backup tool (as 
pg_basebackup or pg_probackup).



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 61a8f11..f4b8506 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -25,9 +25,20 @@
 #include 
 #include 
 
+#include 
+#ifndef WIN32
+#include "sys/mman.h"
+#endif
+
 #include "miscadmin.h"
+#include "funcapi.h"
+#include "access/hash.h"
+#include "access/table.h"
 #include "access/xlogutils.h"
 #include "access/xlog.h"
+#include "access/htup_details.h"
+#include "catalog/pg_class.h"
+#include "catalog/pg_tablespace.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
 #include "storage/fd.h"
@@ -36,6 +47,7 @@
 #include "storage/relfilenode.h"
 #include "storage/smgr.h"
 #include "storage/sync.h"
+#include "utils/builtins.h"
 #include "utils/hsearch.h"
 #include "utils/memutils.h"
 #include "pg_trace.h"
@@ -116,6 +128,18 @@ static MemoryContext MdCxt;		/* context for all MdfdVec objects */
  */
 #define EXTENSION_DONT_CHECK_SIZE	(1 << 4)
 
+/*
+ * Size of ptrack map (number of entries)
+ */
+int ptrack_map_size;
+
+/*
+ * Logarithm of ptrack block size (amount of pages)
+ */
+int  ptrack_block_log;
+
+static int ptrack_fd;
+static pg_atomic_uint64* ptrack_map;
 
 /* local routines */
 static void mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum,
@@ -138,7 +162,7 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno,
 			 BlockNumber blkno, bool skipFsync, int behavior);
 static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
 		   MdfdVec *seg);
-
+static void  ptrack_mark_block(SMgrRelation reln, ForkNumber forkno, BlockNumber blkno);
 
 /*
  *	mdinit() -- Initialize private state for magnetic disk storage manager.
@@ -422,6 +446,8 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		register_dirty_segment(reln, forknum, v);
 
 	Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
+
+	ptrack_mark_block(reln, forknum, blocknum);
 }
 
 /*
@@ -575,6 +601,8 @@ mdwriteback(SMgrRelation reln, ForkNumber forknum,
 		nblocks -= nflush;
 		blocknum += nflush;
 	}
+
+	ptrack_sync();
 }
 
 /*
@@ -700,6 +728,8 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 
 	if (!skipFsync && !SmgrIsTemp(reln))
 		register_dirty_segment(reln, forknum, v);
+
+	ptrack_mark_block(reln, forknum, blocknum);
 }
 
 /*
@@ -886,6 +916,7 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 			FilePathName(v->mdfd_vfd;
 		segno--;
 	}
+	ptrack_sync();

Re: block-level incremental backup

2019-04-10 Thread Jehan-Guillaume de Rorthais
Hi,

On Tue, 9 Apr 2019 11:48:38 -0400
Robert Haas  wrote:

> Several companies, including EnterpriseDB, NTT, and Postgres Pro, have
> developed technology that permits a block-level incremental backup to
> be taken from a PostgreSQL server.  I believe the idea in all of those
> cases is that non-relation files should be backed up in their
> entirety, but for relation files, only those blocks that have been
> changed need to be backed up.  I would like to propose that we should
> have a solution for this problem in core, rather than leaving it to
> each individual PostgreSQL company to develop and maintain their own
> solution. Generally my idea is:
> 
> 1. There should be a way to tell pg_basebackup to request from the
> server only those blocks where LSN >= threshold_value.  There are
> several possible ways for the server to implement this, the simplest
> of which is to just scan all the blocks and send only the ones that
> satisfy that criterion.  That might sound dumb, but it does still save
> network bandwidth, and it works even without any prior setup.

+1 this is a simple design and probably a first easy step bringing a lot of
benefices already.

> It will probably be more efficient in many cases to instead scan all the WAL
> generated since that LSN and extract block references from it, but
> that is only possible if the server has all of that WAL available or
> can somehow get it from the archive.

I seize the opportunity to discuss about this on the fly.

I've been playing with the idea of producing incremental backups from
archives since many years. But I've only started PoC'ing on it this year.

My idea would be create a new tool working on archived WAL. No burden
server side. Basic concept is:

* parse archives
* record latest relevant FPW for the incr backup
* write new WALs with recorded FPW and removing/rewriting duplicated walrecords.

It's just a PoC and I hadn't finished the WAL writing part...not even talking
about the replay part. I'm not even sure this project is a good idea, but it is
a good educational exercice to me in the meantime. 

Anyway, using real life OLTP production archives, my stats were:

  # WAL   xlogrec kept Size WAL kept
12739%   50%
38322%   38%
63920%   29%

Based on this stats, I expect this would save a lot of time during recovery in
a first step. If it get mature, it might even save a lot of archives space or
extend the retention period with degraded granularity. It would even help
taking full backups with a lower frequency.

Any thoughts about this design would be much appreciated. I suppose this should
be offlist or in a new thread to avoid polluting this thread as this is a
slightly different subject.

Regards,


PS: I was surprised to still find some existing piece of code related to
pglesslog in core. This project has been discontinued and WAL format changed in
the meantime.




Re: block-level incremental backup

2019-04-10 Thread Robert Haas
On Tue, Apr 9, 2019 at 5:28 PM Alvaro Herrera  wrote:
> On 2019-Apr-09, Peter Eisentraut wrote:
> > On 2019-04-09 17:48, Robert Haas wrote:
> > > 3. There should be a new tool that knows how to merge a full backup
> > > with any number of incremental backups and produce a complete data
> > > directory with no remaining partial files.
> >
> > Are there by any chance standard file formats and tools that describe a
> > binary difference between directories?  That would be really useful here.
>
> VCDIFF? https://tools.ietf.org/html/rfc3284

I don't understand VCDIFF very well, but I see some potential problems
with going in this direction.

First, suppose we take a full backup on Monday.  Then, on Tuesday, we
want to take an incremental backup.  In my proposal, the backup server
only needs to provide the database with one piece of information: the
start-LSN of the previous backup.  The server determines which blocks
are recently modified and sends them to the client, which stores them.
The end.  On the other hand, storing a maximally compact VCDIFF seems
to require that, for each block modified in the Tuesday backup, we go
read the corresponding block as it existed on Monday.  Assuming that
the server is using some efficient method of locating modified blocks,
this will approximately double the amount of read I/O required to
complete the backup: either the server or the client must now read not
only the current version of the block but the previous versions.  If
the previous backup is an incremental backup that does not contain
full block images but only VCDIFF content, whoever is performing the
VCDIFF calculation will need to walk the entire backup chain and
reconstruct the previous contents of the previous block so that it can
compute the newest VCDIFF.  A customer who does an incremental backup
every day and maintains a synthetic full backup from 1 week prior will
see a roughly eightfold increase in read I/O compared to the design I
proposed.

The same problem exists at restore time.  In my design, the total read
I/O required is equal to the size of the database, plus however much
metadata needs to be read from older delta files -- and that should be
fairly small compared to the actual data being read, at least in
normal, non-extreme cases.  But if we are going to proceed by applying
a series of delta files, we're going to need to read every older
backup in its entirety.  If the turnover percentage is significant,
say 20%/day, and if the backup chain is say 7 backups long to get back
to a full backup, this is a huge difference.  Instead of having to
read ~100% of the database size, as in my proposal, we'll need to read
100% + (6 * 20%) = 220% of the database size.

Since VCDIFF uses an add-copy-run language to described differences,
we could try to work around the problem that I just described by
describing each changed data block as an 8192-byte add, and unchanged
blocks as an 8192-byte copy.  If we did that, then I think that the
problem at backup time goes away: we can write out a VCDIFF-format
file for the changed blocks based just on knowing that those are the
blocks that have changed, without needing to access the older file. Of
course, if we do it this way, the file will be larger than it would be
if we actually compared the old and new block contents and wrote out a
minimal VCDIFF, but it does make taking a backup a lot simpler.  Even
with this proposal, though, I think we still have trouble with restore
time.  I proposed putting the metadata about which blocks are included
in a delta file at the beginning of the file, which allows a restore
of a new incremental backup to relatively efficiently flip through
older backups to find just the blocks that it needs, without having to
read the whole file.  But I think (although I am not quite sure) that
in the VCDIFF format, the payload for an ADD instruction is stored
near the payload.  The result would be that you'd have to basically
read the whole file at restore time to figure out which blocks were
available from that file and which ones needed to be retrieved from an
older backup.  So while this approach would fix the backup-time
problem, I believe that it would still require significantly more read
I/O at restore time than my proposal.

Furthermore, if, at backup time, we have to do anything that requires
access to the old data, either the client or the server needs to have
access to that data.  Nonwithstanding the costs of reading it, that
doesn't seem very desirable.  The server is quite unlikely to have
access to the backups, because most users want to back up to a
different server in order to guard against a hardware failure.  The
client is more likely to be running on a machine where it has access
to the data, because many users back up to the same machine every day,
so the machine that is taking the current backup probably has the
older one.  However, accessing that old backup might not be cheap.  It
could be located in an object store in the cl

Re: block-level incremental backup

2019-04-10 Thread Robert Haas
On Wed, Apr 10, 2019 at 10:57 AM Jehan-Guillaume de Rorthais
 wrote:
> My idea would be create a new tool working on archived WAL. No burden
> server side. Basic concept is:
>
> * parse archives
> * record latest relevant FPW for the incr backup
> * write new WALs with recorded FPW and removing/rewriting duplicated 
> walrecords.
>
> It's just a PoC and I hadn't finished the WAL writing part...not even talking
> about the replay part. I'm not even sure this project is a good idea, but it 
> is
> a good educational exercice to me in the meantime.
>
> Anyway, using real life OLTP production archives, my stats were:
>
>   # WAL   xlogrec kept Size WAL kept
> 12739%   50%
> 38322%   38%
> 63920%   29%
>
> Based on this stats, I expect this would save a lot of time during recovery in
> a first step. If it get mature, it might even save a lot of archives space or
> extend the retention period with degraded granularity. It would even help
> taking full backups with a lower frequency.
>
> Any thoughts about this design would be much appreciated. I suppose this 
> should
> be offlist or in a new thread to avoid polluting this thread as this is a
> slightly different subject.

Interesting idea, but I don't see how it can work if you only deal
with the FPWs and not the other records.  For instance, suppose that
you take a full backup at time T0, and then at time T1 there are two
modifications to a certain block in quick succession.  That block is
then never touched again.  Since no checkpoint intervenes between the
modifications, the first one emits an FPI and the second does not.
Capturing the FPI is fine as far as it goes, but unless you also do
something with the non-FPI change, you lose that second modification.
You could fix that by having your tool replicate the effects of WAL
apply outside the server, but that sounds like a ton of work and a ton
of possible bugs.

I have a related idea, though.  Suppose that, as Peter says upthread,
you have a replication slot that prevents old WAL from being removed.
You also have a background worker that is connected to that slot.  It
decodes WAL and produces summary files containing all block-references
extracted from those WAL records and the associated LSN (or maybe some
approximation of the LSN instead of the exact value, to allow for
compression and combining of nearby references).  Then you hold onto
those summary files after the actual WAL is removed.  Now, when
somebody asks the server for all blocks changed since a certain LSN,
it can use those summary files to figure out which blocks to send
without having to read all the pages in the database.  Although I
believe that a simple system that finds modified blocks by reading
them all is good enough for a first version of this feature and useful
in its own right, a more efficient system will be a lot more useful,
and something like this seems to me to be probably the best way to
implement it.

The reason why I think this is likely to be superior to other possible
approaches, such as the ptrack approach Konstantin suggests elsewhere
on this thread, is because it pushes the work of figuring out which
blocks have been modified into the background.  With a ptrack-type
approach, the server has to do some non-zero amount of extra work in
the foreground every time it modifies a block.  With an approach based
on WAL-scanning, the work is done in the background and nobody has to
wait for it.  It's possible that there are other considerations which
aren't occurring to me right now, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: block-level incremental backup

2019-04-10 Thread Robert Haas
On Wed, Apr 10, 2019 at 10:22 AM Konstantin Knizhnik
 wrote:
> Some times ago I have implemented alternative version of ptrack utility
> (not one used in pg_probackup)
> which detects updated block at file level. It is very simple and may be
> it can be sometimes integrated in master.

I don't think this is completely crash-safe.  It looks like it
arranges to msync() the ptrack file at appropriate times (although I
haven't exhaustively verified the logic), but it uses MS_ASYNC, so
it's possible that the ptrack file could get updated on disk either
before or after the relation file itself.  I think before is probably
OK -- it just risks having some blocks look modified when they aren't
really -- but after seems like it is very much not OK.  And changing
this to use MS_SYNC would probably be really expensive.  Likely a
better approach would be to hook into the new fsync queue machinery
that Thomas Munro added to PostgreSQL 12.

It looks like your system maps all the blocks in the system into a
fixed-size map using hashing.  If the number of modified blocks
between the full backup and the incremental backup is large compared
to the size of the ptrack map, you'll start to get a lot of
false-positives.  It will look as if much of the database needs to be
backed up.  For example, in your sample configuration, you have
ptrack_map_size = 103. If you've got a 100GB database with 20%
daily turnover, that's about 2.6 million blocks.  If you set bump a
random entry ~2.6 million times in a map with 103 entries, on the
average ~92% of the entries end up getting bumped, so you will get
very little benefit from incremental backup.  This problem drops off
pretty fast if you raise the size of the map, but it's pretty critical
that your map is large enough for the database you've got, or you may
as well not bother.

It also appears that your system can't really handle resizing of the
map in any friendly way.  So if your data size grows, you may be faced
with either letting the map become progressively less effective, or
throwing it out and losing all the data you have.

None of that is to say that what you're presenting here has no value,
but I think it's possible to do better (and I think we should try).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: block-level incremental backup

2019-04-10 Thread Ashwin Agrawal
On Wed, Apr 10, 2019 at 9:21 AM Robert Haas  wrote:

> I have a related idea, though.  Suppose that, as Peter says upthread,
> you have a replication slot that prevents old WAL from being removed.
> You also have a background worker that is connected to that slot.  It
> decodes WAL and produces summary files containing all block-references
> extracted from those WAL records and the associated LSN (or maybe some
> approximation of the LSN instead of the exact value, to allow for
> compression and combining of nearby references).  Then you hold onto
> those summary files after the actual WAL is removed.  Now, when
> somebody asks the server for all blocks changed since a certain LSN,
> it can use those summary files to figure out which blocks to send
> without having to read all the pages in the database.  Although I
> believe that a simple system that finds modified blocks by reading
> them all is good enough for a first version of this feature and useful
> in its own right, a more efficient system will be a lot more useful,
> and something like this seems to me to be probably the best way to
> implement it.
>

Not to fork the conversation from incremental backups, but similar approach
is what we have been thinking for pg_rewind. Currently, pg_rewind requires
all the WAL logs to be present on source side from point of divergence to
rewind. Instead just parse the wal and keep the changed blocks around on
sourece. Then don't need to retain the WAL but can still rewind using the
changed block map. So, rewind becomes much similar to incremental backup
proposed here after performing rewind activity using target side WAL only.


Re: block-level incremental backup

2019-04-10 Thread Robert Haas
On Wed, Apr 10, 2019 at 7:51 AM Andrey Borodin  wrote:
> > 9 апр. 2019 г., в 20:48, Robert Haas  написал(а):
> > - This is just a design proposal at this point; there is no code.  If
> > this proposal, or some modified version of it, seems likely to be
> > acceptable, I and/or my colleagues might try to implement it.
>
> I'll be happy to help with code, discussion and patch review.

That would be great!

We should probably give this discussion some more time before we
plunge into the implementation phase, but I'd love to have some help
with that, whether it's with coding or review or whatever.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: block-level incremental backup

2019-04-10 Thread Robert Haas
On Wed, Apr 10, 2019 at 12:56 PM Ashwin Agrawal  wrote:
> Not to fork the conversation from incremental backups, but similar approach 
> is what we have been thinking for pg_rewind. Currently, pg_rewind requires 
> all the WAL logs to be present on source side from point of divergence to 
> rewind. Instead just parse the wal and keep the changed blocks around on 
> sourece. Then don't need to retain the WAL but can still rewind using the 
> changed block map. So, rewind becomes much similar to incremental backup 
> proposed here after performing rewind activity using target side WAL only.

Interesting.  So if we build a system like this for incremental
backup, or for pg_rewind, the other one can use the same
infrastructure.  That sound excellent.  I'll start a new thread to
talk about that, and hopefully you and Heikki and others will chime in
with thoughts.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: block-level incremental backup

2019-04-10 Thread Jehan-Guillaume de Rorthais
Hi,

First thank you for your answer!

On Wed, 10 Apr 2019 12:21:03 -0400
Robert Haas  wrote:

> On Wed, Apr 10, 2019 at 10:57 AM Jehan-Guillaume de Rorthais
>  wrote:
> > My idea would be create a new tool working on archived WAL. No burden
> > server side. Basic concept is:
> >
> > * parse archives
> > * record latest relevant FPW for the incr backup
> > * write new WALs with recorded FPW and removing/rewriting duplicated
> > walrecords.
> >
> > It's just a PoC and I hadn't finished the WAL writing part...not even
> > talking about the replay part. I'm not even sure this project is a good
> > idea, but it is a good educational exercice to me in the meantime.
> >
> > Anyway, using real life OLTP production archives, my stats were:
> >
> >   # WAL   xlogrec kept Size WAL kept
> > 12739%   50%
> > 38322%   38%
> > 63920%   29%
> >
> > Based on this stats, I expect this would save a lot of time during recovery
> > in a first step. If it get mature, it might even save a lot of archives
> > space or extend the retention period with degraded granularity. It would
> > even help taking full backups with a lower frequency.
> >
> > Any thoughts about this design would be much appreciated. I suppose this
> > should be offlist or in a new thread to avoid polluting this thread as this
> > is a slightly different subject.  
> 
> Interesting idea, but I don't see how it can work if you only deal
> with the FPWs and not the other records.  For instance, suppose that
> you take a full backup at time T0, and then at time T1 there are two
> modifications to a certain block in quick succession.  That block is
> then never touched again.  Since no checkpoint intervenes between the
> modifications, the first one emits an FPI and the second does not.
> Capturing the FPI is fine as far as it goes, but unless you also do
> something with the non-FPI change, you lose that second modification.
> You could fix that by having your tool replicate the effects of WAL
> apply outside the server, but that sounds like a ton of work and a ton
> of possible bugs.

In my current design, the scan is done backward from end to start and I keep all
the records appearing after the last occurrence of their respective FPI.

The next challenge I have to achieve is to deal with multiple blocks records
where some need to be removed and other are FPI to keep (eg. UPDATE).

> I have a related idea, though.  Suppose that, as Peter says upthread,
> you have a replication slot that prevents old WAL from being removed.
> You also have a background worker that is connected to that slot.  It
> decodes WAL and produces summary files containing all block-references
> extracted from those WAL records and the associated LSN (or maybe some
> approximation of the LSN instead of the exact value, to allow for
> compression and combining of nearby references).  Then you hold onto
> those summary files after the actual WAL is removed.  Now, when
> somebody asks the server for all blocks changed since a certain LSN,
> it can use those summary files to figure out which blocks to send
> without having to read all the pages in the database.  Although I
> believe that a simple system that finds modified blocks by reading
> them all is good enough for a first version of this feature and useful
> in its own right, a more efficient system will be a lot more useful,
> and something like this seems to me to be probably the best way to
> implement it.

Summary files looks like what Andrey Borodin described as delta-files and
change maps.

> With an approach based
> on WAL-scanning, the work is done in the background and nobody has to
> wait for it.

Agree with this.




Re: block-level incremental backup

2019-04-10 Thread Robert Haas
On Wed, Apr 10, 2019 at 2:21 PM Jehan-Guillaume de Rorthais
 wrote:
> In my current design, the scan is done backward from end to start and I keep 
> all
> the records appearing after the last occurrence of their respective FPI.

Oh, interesting.  That seems like it would require pretty major
surgery on the WAL stream.

> Summary files looks like what Andrey Borodin described as delta-files and
> change maps.

Yep.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: block-level incremental backup

2019-04-10 Thread Andres Freund
Hi,

On 2019-04-10 14:38:43 -0400, Robert Haas wrote:
> On Wed, Apr 10, 2019 at 2:21 PM Jehan-Guillaume de Rorthais
>  wrote:
> > In my current design, the scan is done backward from end to start and I 
> > keep all
> > the records appearing after the last occurrence of their respective FPI.
> 
> Oh, interesting.  That seems like it would require pretty major
> surgery on the WAL stream.

Can't you just read each segment forward, and then reverse? That's not
that much memory? And sure, there's some inefficient cases where records
span many segments, but that's rare enough that reading a few segments
several times doesn't strike me as particularly bad?

Greetings,

Andres Freund




Re: block-level incremental backup

2019-04-10 Thread Peter Eisentraut
On 2019-04-10 17:31, Robert Haas wrote:
> I think the way to think about this problem, or at least the way I
> think about this problem, is that we need to decide whether want
> file-level incremental backup, block-level incremental backup, or
> byte-level incremental backup.

That is a great analysis.  Seems like block-level is the preferred way
forward.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: block-level incremental backup

2019-04-10 Thread Konstantin Knizhnik




On 10.04.2019 19:51, Robert Haas wrote:

On Wed, Apr 10, 2019 at 10:22 AM Konstantin Knizhnik
 wrote:

Some times ago I have implemented alternative version of ptrack utility
(not one used in pg_probackup)
which detects updated block at file level. It is very simple and may be
it can be sometimes integrated in master.

I don't think this is completely crash-safe.  It looks like it
arranges to msync() the ptrack file at appropriate times (although I
haven't exhaustively verified the logic), but it uses MS_ASYNC, so
it's possible that the ptrack file could get updated on disk either
before or after the relation file itself.  I think before is probably
OK -- it just risks having some blocks look modified when they aren't
really -- but after seems like it is very much not OK.  And changing
this to use MS_SYNC would probably be really expensive.  Likely a
better approach would be to hook into the new fsync queue machinery
that Thomas Munro added to PostgreSQL 12.


I do not think that MS_SYNC or fsync queue is needed here.
If power failure or OS crash cause loose of some writes to ptrack map, 
then in any case {ostgres will perform recovery and updating pages from 
WAL cause once again marking them in ptrack map. So as in case of CLOG 
and many other Postgres files it is not critical to loose some writes 
because them will be restored from WAL. And before truncating WAL, 
Postgres performs checkpoint which flushes all changes to the disk, 
including ptrack map updates.




It looks like your system maps all the blocks in the system into a
fixed-size map using hashing.  If the number of modified blocks
between the full backup and the incremental backup is large compared
to the size of the ptrack map, you'll start to get a lot of
false-positives.  It will look as if much of the database needs to be
backed up.  For example, in your sample configuration, you have
ptrack_map_size = 103. If you've got a 100GB database with 20%
daily turnover, that's about 2.6 million blocks.  If you set bump a
random entry ~2.6 million times in a map with 103 entries, on the
average ~92% of the entries end up getting bumped, so you will get
very little benefit from incremental backup.  This problem drops off
pretty fast if you raise the size of the map, but it's pretty critical
that your map is large enough for the database you've got, or you may
as well not bother.

This is why ptrack block size should be larger than page size.
Assume that it is 1Mb. 1MB is considered to be optimal amount of disk 
IO, when frequent seeks are not degrading read speed (it is most 
critical for HDD). In other words reading 10 random pages (20%) from 
this 1Mb block will takes almost the same amount of time (or even 
longer) than reading all this 1Mb in one operation.


There will be just 10 used entries in ptrack map with very small 
probability of collision.
Actually I have chosen this size (103) for ptrack map because with 
1Mb block size is allows to map without noticable number of collisions 
1Tb database which seems to be enough for most Postgres installations. 
But increasing ptrack map size 10 and even 100 times should not also 
cause problems with modern RAM sizes.




It also appears that your system can't really handle resizing of the
map in any friendly way.  So if your data size grows, you may be faced
with either letting the map become progressively less effective, or
throwing it out and losing all the data you have.

None of that is to say that what you're presenting here has no value,
but I think it's possible to do better (and I think we should try).

Definitely I didn't consider proposed patch as perfect solution and 
certainly it requires improvements (and may be complete redesign).
I just want to present this approach (maintaining hash of block's LSN in 
mapped memory) and keeping track of modified blocks at file level 
(unlike current ptrack implementation which logs changes in all places 
in Postgres code where data is updated).


Also, despite to the fact that this patch may be considered as raw 
prototype, I have spent some time thinking about all aspects of this 
approach including fault tolerance and false positives.






Re: block-level incremental backup

2019-04-10 Thread Jehan-Guillaume de Rorthais
On Wed, 10 Apr 2019 14:38:43 -0400
Robert Haas  wrote:

> On Wed, Apr 10, 2019 at 2:21 PM Jehan-Guillaume de Rorthais
>  wrote:
> > In my current design, the scan is done backward from end to start and I
> > keep all the records appearing after the last occurrence of their
> > respective FPI.  
> 
> Oh, interesting.  That seems like it would require pretty major
> surgery on the WAL stream.

Indeed.

Presently, the surgery in my code is replacing redundant xlogrecord with noop.

I have now to deal with muti-blocks records. So far, I tried to mark non-needed
block with !BKPBLOCK_HAS_DATA and made a simple patch in core to ignore such
marked blocks, but it doesn't play well with dependency between xlogrecord, eg.
during UPDATE. So my plan is to rewrite them to remove non-needed blocks using
eg. XLOG_FPI.

As I wrote, this is mainly an hobby project right now for my own education. Not
sure where it leads me, but I learn a lot while working on it.




Re: block-level incremental backup

2019-04-10 Thread Jehan-Guillaume de Rorthais
On Wed, 10 Apr 2019 11:55:51 -0700
Andres Freund  wrote:

> Hi,
> 
> On 2019-04-10 14:38:43 -0400, Robert Haas wrote:
> > On Wed, Apr 10, 2019 at 2:21 PM Jehan-Guillaume de Rorthais
> >  wrote:  
> > > In my current design, the scan is done backward from end to start and I
> > > keep all the records appearing after the last occurrence of their
> > > respective FPI.  
> > 
> > Oh, interesting.  That seems like it would require pretty major
> > surgery on the WAL stream.  
> 
> Can't you just read each segment forward, and then reverse?

Not sure what you mean.

I first look for the very last XLOG record by jumping to the last WAL and
scanning it forward. 

Then, I do a backward from there to record LSN of xlogrecord to keep.

Finally, I clone each WAL and edit them as needed (as described in my previous
email). This is my current WIP though.

> That's not that much memory?

I don't know, yet. I did not mesure it.




Re: block-level incremental backup

2019-04-10 Thread Michael Paquier
On Wed, Apr 10, 2019 at 09:42:47PM +0200, Peter Eisentraut wrote:
> That is a great analysis.  Seems like block-level is the preferred way
> forward.

In any solution related to incremental backups I have see from
community, all of them tend to prefer block-level backups per the
filtering which is possible based on the LSN of the page header.  The
holes in the middle of the page are also easier to handle so as an
incremental page size is reduced in the actual backup.  My preference
tends toward a block-level approach if we were to do something in this
area, though I fear that performance will be bad if we begin to scan
all the relation files to fetch a set of blocks since a past LSN.
Hence we need some kind of LSN map so as it is possible to skip a
one block or a group of blocks (say one LSN every 8/16 blocks for
example) at once for a given relation if the relation is mostly
read-only.
--
Michael


signature.asc
Description: PGP signature


Re: block-level incremental backup

2019-04-11 Thread Robert Haas
On Thu, Apr 11, 2019 at 12:22 AM Michael Paquier  wrote:
> incremental page size is reduced in the actual backup.  My preference
> tends toward a block-level approach if we were to do something in this
> area, though I fear that performance will be bad if we begin to scan
> all the relation files to fetch a set of blocks since a past LSN.
> Hence we need some kind of LSN map so as it is possible to skip a
> one block or a group of blocks (say one LSN every 8/16 blocks for
> example) at once for a given relation if the relation is mostly
> read-only.

So, in this thread, I want to focus on the UI and how the incremental
backup is stored on disk.  Making the process of identifying modified
blocks efficient is the subject of
http://postgr.es/m/CA+TgmoahOeuuR4pmDP1W=jnryp4fwhyntosa68bfxjq-qb_...@mail.gmail.com

Over there, the merits of what you are describing here and the
competing approaches are under discussion.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: block-level incremental backup

2019-04-11 Thread Anastasia Lubennikova

09.04.2019 18:48, Robert Haas writes:

Thoughts?

Hi,
Thank you for bringing that up.
In-core support of incremental backups is a long-awaited feature.
Hopefully, this take will end up committed in PG13.

Speaking of UI:
1) I agree that it should be implemented as a new replication command.

2) There should be a command to get only a map of changes without actual 
data.


Most backup tools establish server connection, so they can use this 
protocol to get the list of changed blocks.
Then they can use this information for any purpose. For example, 
distribute files between parallel workers to copy the data,
or estimate backup size before data is sent, or store metadata 
separately from the data itself.
Most methods (except straightforward LSN comparison) consist of two 
steps: get a map of changes and read blocks.

So it won't add much of extra work.

example commands:
GET_FILELIST [lsn]
returning json (or whatever) with filenames and maps of changed blocks

Map format is also the subject of discussion.
Now in pg_probackup we reuse code from pg_rewind/datapagemap,
not sure if this format is good for sending data via the protocol, though.

3) The API should provide functions to request data with a granularity 
of file and block.

It will be useful for parallelism and for various future projects.

example commands:
GET_DATAFILE [filename [map of blocks] ]
GET_DATABLOCK [filename] [blkno]
returning data in some format

4) The algorithm of collecting changed blocks is another topic.
Though, it's API should be discussed here:

Do we want to have multiple implementations?
Personally, I think that it's good to provide several strategies,
since they have different requirements and fit for different workloads.

Maybe we can add a hook to allow custom implementations.

Do we want to allow the backup client to tell what block collection 
method to use?

example commands:
GET_FILELIST [lsn] [METHOD lsn | page | ptrack | etc]
Or should it be server-side cost-based decision?

5) The method based on LSN comparison stands out - it can be done in one 
pass.

So it probably requires special protocol commands.
for example:
GET_DATAFILES [lsn]
GET_DATAFILE [filename] [lsn]

This is pretty simple to implement and pg_basebackup can use this method,
at least until we have something more advanced in-core.

I'll be happy to help with design, code, review, and testing.
Hope that my experience with pg_probackup will be useful.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: block-level incremental backup

2019-04-15 Thread Stephen Frost
Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> Several companies, including EnterpriseDB, NTT, and Postgres Pro, have
> developed technology that permits a block-level incremental backup to
> be taken from a PostgreSQL server.  I believe the idea in all of those
> cases is that non-relation files should be backed up in their
> entirety, but for relation files, only those blocks that have been
> changed need to be backed up.

I love the general idea of having additional facilities in core to
support block-level incremental backups.  I've long been unhappy that
any such approach ends up being limited to a subset of the files which
need to be included in the backup, meaning the rest of the files have to
be backed up in their entirety.  I don't think we have to solve for that
as part of this, but I'd like to see a discussion for how to deal with
the other files which are being backed up to avoid needing to just
wholesale copy them.

> I would like to propose that we should
> have a solution for this problem in core, rather than leaving it to
> each individual PostgreSQL company to develop and maintain their own
> solution. 

I'm certainly a fan of improving our in-core backup solutions.

I'm quite concerned that trying to graft this on to pg_basebackup
(which, as you note later, is missing an awful lot of what users expect
from a real backup solution already- retention handling, parallel
capabilities, WAL archive management, and many more... but also is just
not nearly as developed a tool as the external solutions) is going to
make things unnecessairly difficult when what we really want here is
better support from core for block-level incremental backup for the
existing external tools to leverage.

Perhaps there's something here which can be done with pg_basebackup to
have it work with the block-level approach, but I certainly don't see
it as a natural next step for it and really does seem like limiting the
way this is implemented to something that pg_basebackup can easily
digest might make it less useful for the more developed tools.

As an example, I believe all of the other tools mentioned (at least,
those that are open source I'm pretty sure all do) support parallel
backup and therefore having a way to get the block-level changes in a
parallel fashion would be a pretty big thing that those tools will want
and pg_basebackup is single-threaded today and this proposal doesn't
seem to be contemplating changing that, implying that a serial-based
block-level protocol would be fine but that'd be a pretty awful
restriction for the other tools.

> Generally my idea is:
> 
> 1. There should be a way to tell pg_basebackup to request from the
> server only those blocks where LSN >= threshold_value.  There are
> several possible ways for the server to implement this, the simplest
> of which is to just scan all the blocks and send only the ones that
> satisfy that criterion.  That might sound dumb, but it does still save
> network bandwidth, and it works even without any prior setup. It will
> probably be more efficient in many cases to instead scan all the WAL
> generated since that LSN and extract block references from it, but
> that is only possible if the server has all of that WAL available or
> can somehow get it from the archive.  We could also, as several people
> have proposed previously, have some kind of additional relation for
> that stores either a single is-modified bit -- which only helps if the
> reference LSN for the is-modified bit is older than the requested LSN
> but not too much older -- or the highest LSN for each range of K
> blocks, or something like that.  I am at the moment not too concerned
> with the exact strategy we use here. I believe we may want to
> eventually support more than one, since they have different
> trade-offs.

This part of the discussion is a another example of how we're limiting
ourselves in this implementation to the "pg_basebackup can work with
this" case- by only consideration the options of "scan all the files" or
"use the WAL- if the request is for WAL we have available on the
server."  The other backup solutions mentioned in your initial email,
and others that weren't, have a WAL archive which includes a lot more
WAL than just what the primary currently has.  When I've thought about
how WAL could be used to build a differential or incremental backup, the
question of "do we have all the WAL we need" hasn't ever been a
consideration- because the backup tool manages the WAL archive and has
WAL going back across, most likely, weeks or even months.  Having a tool
which can essentially "compress" WAL would be fantastic and would be
able to be leveraged by all of the different backup solutions.

> 2. When you use pg_basebackup in this way, each relation file that is
> not sent in its entirety is replaced by a file with a different name.
> For example, instead of base/16384/16417, you might get
> base/16384/partial.16417 or however we decide to name them.  Each su

Re: block-level incremental backup

2019-04-15 Thread Bruce Momjian
On Mon, Apr 15, 2019 at 09:01:11AM -0400, Stephen Frost wrote:
> Greetings,
> 
> * Robert Haas (robertmh...@gmail.com) wrote:
> > Several companies, including EnterpriseDB, NTT, and Postgres Pro, have
> > developed technology that permits a block-level incremental backup to
> > be taken from a PostgreSQL server.  I believe the idea in all of those
> > cases is that non-relation files should be backed up in their
> > entirety, but for relation files, only those blocks that have been
> > changed need to be backed up.
> 
> I love the general idea of having additional facilities in core to
> support block-level incremental backups.  I've long been unhappy that
> any such approach ends up being limited to a subset of the files which
> need to be included in the backup, meaning the rest of the files have to
> be backed up in their entirety.  I don't think we have to solve for that
> as part of this, but I'd like to see a discussion for how to deal with
> the other files which are being backed up to avoid needing to just
> wholesale copy them.

I assume you are talking about non-heap/index files.  Which of those are
large enough to benefit from incremental backup?

> > I would like to propose that we should
> > have a solution for this problem in core, rather than leaving it to
> > each individual PostgreSQL company to develop and maintain their own
> > solution. 
> 
> I'm certainly a fan of improving our in-core backup solutions.
> 
> I'm quite concerned that trying to graft this on to pg_basebackup
> (which, as you note later, is missing an awful lot of what users expect
> from a real backup solution already- retention handling, parallel
> capabilities, WAL archive management, and many more... but also is just
> not nearly as developed a tool as the external solutions) is going to
> make things unnecessairly difficult when what we really want here is
> better support from core for block-level incremental backup for the
> existing external tools to leverage.

I think there is some interesting complexity brought up in this thread. 
Which options are going to minimize storage I/O, network I/O, have only
background overhead, allow parallel operation, integrate with
pg_basebackup.  Eventually we will need to evaluate the incremental
backup options against these criteria.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: block-level incremental backup

2019-04-15 Thread Robert Haas
On Thu, Apr 11, 2019 at 1:29 PM Anastasia Lubennikova
 wrote:
> 2) There should be a command to get only a map of changes without actual
> data.

Good idea.

> 4) The algorithm of collecting changed blocks is another topic.
> Though, it's API should be discussed here:
>
> Do we want to have multiple implementations?
> Personally, I think that it's good to provide several strategies,
> since they have different requirements and fit for different workloads.
>
> Maybe we can add a hook to allow custom implementations.

I'm not sure a hook is going to be practical, but I do think we want
more than one strategy.

> I'll be happy to help with design, code, review, and testing.
> Hope that my experience with pg_probackup will be useful.

Great, thanks!

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: block-level incremental backup

2019-04-15 Thread Robert Haas
On Mon, Apr 15, 2019 at 9:01 AM Stephen Frost  wrote:
> I love the general idea of having additional facilities in core to
> support block-level incremental backups.  I've long been unhappy that
> any such approach ends up being limited to a subset of the files which
> need to be included in the backup, meaning the rest of the files have to
> be backed up in their entirety.  I don't think we have to solve for that
> as part of this, but I'd like to see a discussion for how to deal with
> the other files which are being backed up to avoid needing to just
> wholesale copy them.

Ideas?  Generally, I don't think that anything other than the main
forks of relations are worth worrying about, because the files are too
small to really matter.  Even if they're big, the main forks of
relations will be much bigger.  I think.

> I'm quite concerned that trying to graft this on to pg_basebackup
> (which, as you note later, is missing an awful lot of what users expect
> from a real backup solution already- retention handling, parallel
> capabilities, WAL archive management, and many more... but also is just
> not nearly as developed a tool as the external solutions) is going to
> make things unnecessairly difficult when what we really want here is
> better support from core for block-level incremental backup for the
> existing external tools to leverage.
>
> Perhaps there's something here which can be done with pg_basebackup to
> have it work with the block-level approach, but I certainly don't see
> it as a natural next step for it and really does seem like limiting the
> way this is implemented to something that pg_basebackup can easily
> digest might make it less useful for the more developed tools.

I agree that there are a bunch of things that pg_basebackup does not
do, such as backup management.  I think a lot of users do not want
PostgreSQL to do backup management for them.  They have an existing
solution that they use to manage backups, and they want PostgreSQL to
interoperate with it. I think it makes sense for pg_basebackup to be
in charge of taking the backup, and then other tools can either use it
as a building block or use the streaming replication protocol to send
approximately the same commands to the server.  I certainly would not
want to expose server capabilities that let you take an incremental
backup and NOT teach pg_basebackup to use them -- then we'd be in a
situation of saying that PostgreSQL has incremental backup, but you
have to get external tool XYZ to use it.  That will be perceived as
PostgreSQL does NOT have incremental backup and this external tool
adds it.

> As an example, I believe all of the other tools mentioned (at least,
> those that are open source I'm pretty sure all do) support parallel
> backup and therefore having a way to get the block-level changes in a
> parallel fashion would be a pretty big thing that those tools will want
> and pg_basebackup is single-threaded today and this proposal doesn't
> seem to be contemplating changing that, implying that a serial-based
> block-level protocol would be fine but that'd be a pretty awful
> restriction for the other tools.

I mentioned this exact issue in my original email.  I spoke positively
of it.  But I think it is different from what is being proposed here.
We could have parallel backup without incremental backup, and that
would be a good feature.  We could have parallel backup without full
backup, and that would also be a good feature.  We could also have
both, which would be best of all.  I don't see that my proposal throws
up any architectural obstacle to parallelism.  I assume parallel
backup, whether full or incremental, would be implemented by dividing
up the files that need to be sent across the available connections; if
incremental backup exists, each connection then has to decide whether
to send the whole file or only part of it.

> This part of the discussion is a another example of how we're limiting
> ourselves in this implementation to the "pg_basebackup can work with
> this" case- by only consideration the options of "scan all the files" or
> "use the WAL- if the request is for WAL we have available on the
> server."  The other backup solutions mentioned in your initial email,
> and others that weren't, have a WAL archive which includes a lot more
> WAL than just what the primary currently has.  When I've thought about
> how WAL could be used to build a differential or incremental backup, the
> question of "do we have all the WAL we need" hasn't ever been a
> consideration- because the backup tool manages the WAL archive and has
> WAL going back across, most likely, weeks or even months.  Having a tool
> which can essentially "compress" WAL would be fantastic and would be
> able to be leveraged by all of the different backup solutions.

I don't think this is a case of limiting ourselves; I think it's a
case of keeping separate considerations properly separate.  As I said
in my original email, the client doesn't rea

Re: block-level incremental backup

2019-04-16 Thread Stephen Frost
Greetings,

* Bruce Momjian (br...@momjian.us) wrote:
> On Mon, Apr 15, 2019 at 09:01:11AM -0400, Stephen Frost wrote:
> > * Robert Haas (robertmh...@gmail.com) wrote:
> > > Several companies, including EnterpriseDB, NTT, and Postgres Pro, have
> > > developed technology that permits a block-level incremental backup to
> > > be taken from a PostgreSQL server.  I believe the idea in all of those
> > > cases is that non-relation files should be backed up in their
> > > entirety, but for relation files, only those blocks that have been
> > > changed need to be backed up.
> > 
> > I love the general idea of having additional facilities in core to
> > support block-level incremental backups.  I've long been unhappy that
> > any such approach ends up being limited to a subset of the files which
> > need to be included in the backup, meaning the rest of the files have to
> > be backed up in their entirety.  I don't think we have to solve for that
> > as part of this, but I'd like to see a discussion for how to deal with
> > the other files which are being backed up to avoid needing to just
> > wholesale copy them.
> 
> I assume you are talking about non-heap/index files.  Which of those are
> large enough to benefit from incremental backup?

Based on discussions I had with Andrey, specifically the visibility map
is an issue for them with WAL-G.  I haven't spent a lot of time thinking
about it, but I can understand how that could be an issue.

> > I'm quite concerned that trying to graft this on to pg_basebackup
> > (which, as you note later, is missing an awful lot of what users expect
> > from a real backup solution already- retention handling, parallel
> > capabilities, WAL archive management, and many more... but also is just
> > not nearly as developed a tool as the external solutions) is going to
> > make things unnecessairly difficult when what we really want here is
> > better support from core for block-level incremental backup for the
> > existing external tools to leverage.
> 
> I think there is some interesting complexity brought up in this thread. 
> Which options are going to minimize storage I/O, network I/O, have only
> background overhead, allow parallel operation, integrate with
> pg_basebackup.  Eventually we will need to evaluate the incremental
> backup options against these criteria.

This presumes that we're going to have multiple competeing incremental
backup options presented, doesn't it?  Are you aware of another effort
going on which aims for inclusion in core?  There's been past attempts
made, but I don't believe there's anyone else currently planning to or
working on something for inclusion in core.

Just to be clear- we're not currently working on one, but I'd really
like to see core provide good support for incremental block-level backup
so that we can leverage when it is there.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: block-level incremental backup

2019-04-16 Thread Robert Haas
On Tue, Apr 16, 2019 at 5:44 PM Stephen Frost  wrote:
> > > I love the general idea of having additional facilities in core to
> > > support block-level incremental backups.  I've long been unhappy that
> > > any such approach ends up being limited to a subset of the files which
> > > need to be included in the backup, meaning the rest of the files have to
> > > be backed up in their entirety.  I don't think we have to solve for that
> > > as part of this, but I'd like to see a discussion for how to deal with
> > > the other files which are being backed up to avoid needing to just
> > > wholesale copy them.
> >
> > I assume you are talking about non-heap/index files.  Which of those are
> > large enough to benefit from incremental backup?
>
> Based on discussions I had with Andrey, specifically the visibility map
> is an issue for them with WAL-G.  I haven't spent a lot of time thinking
> about it, but I can understand how that could be an issue.

If I understand correctly, the VM contains 1 byte per 4 heap pages and
the FSM contains 1 byte per heap page (plus some overhead for higher
levels of the tree).  Since the FSM is not WAL-logged, I'm not sure
there's a whole lot we can do to avoid having to back it up, although
maybe there's some clever idea I'm not quite seeing.  The VM is
WAL-logged, albeit with some strange warts that I have the honor of
inventing, so there's more possibilities there.

Before worrying about it too much, it would be useful to hear more
about the concerns related to these forks, so that we make sure we're
solving the right problem.  It seems difficult for a single relation
to be big enough for these to be much of an issue.  For example, on a
1TB relation, we have 2^40 bytes = 2^27 pages = ~2^25 bits of VM fork
= 32MB.  Not nothing, but 32MB of useless overhead every time you back
up a 1TB database probably isn't going to break the bank.  It might be
more of a concern for users with many small tables.  For example, if
somebody has got a million tables with 1 page in each one, they'll
have a million data pages, a million VM pages, and 3 million FSM pages
(unless the new don't-create-the-FSM-for-small-tables stuff in v12
kicks in).  I don't know if it's worth going to a lot of trouble to
optimize that case.  Creating a million tables with 100 tuples (or
whatever) in each one sounds like terrible database design to me.

> > > I'm quite concerned that trying to graft this on to pg_basebackup
> > > (which, as you note later, is missing an awful lot of what users expect
> > > from a real backup solution already- retention handling, parallel
> > > capabilities, WAL archive management, and many more... but also is just
> > > not nearly as developed a tool as the external solutions) is going to
> > > make things unnecessairly difficult when what we really want here is
> > > better support from core for block-level incremental backup for the
> > > existing external tools to leverage.
> >
> > I think there is some interesting complexity brought up in this thread.
> > Which options are going to minimize storage I/O, network I/O, have only
> > background overhead, allow parallel operation, integrate with
> > pg_basebackup.  Eventually we will need to evaluate the incremental
> > backup options against these criteria.
>
> This presumes that we're going to have multiple competeing incremental
> backup options presented, doesn't it?  Are you aware of another effort
> going on which aims for inclusion in core?  There's been past attempts
> made, but I don't believe there's anyone else currently planning to or
> working on something for inclusion in core.

Yeah, I really hope we don't end up with dueling patches.  I want to
come up with an approach that can be widely-endorsed and then have
everybody rowing in the same direction.  On the other hand, I do think
that we may support multiple options in certain places which may have
the kinds of trade-offs that Bruce mentions.  For instance,
identifying changed blocks by scanning the whole cluster and checking
the LSN of each block has an advantage in that it requires no prior
setup or extra configuration.  Like a sequential scan, it always
works, and that is an advantage.  Of course, for many people, the
competing advantage of a WAL-scanning approach that can save a lot of
I/O will appear compelling, but maybe not for everyone.  I think
there's room for two or three approaches there -- not in the sense of
competing patches, but in the sense of giving users a choice based on
their needs.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: block-level incremental backup

2019-04-17 Thread Bruce Momjian
On Tue, Apr 16, 2019 at 06:40:44PM -0400, Robert Haas wrote:
> Yeah, I really hope we don't end up with dueling patches.  I want to
> come up with an approach that can be widely-endorsed and then have
> everybody rowing in the same direction.  On the other hand, I do think
> that we may support multiple options in certain places which may have
> the kinds of trade-offs that Bruce mentions.  For instance,
> identifying changed blocks by scanning the whole cluster and checking
> the LSN of each block has an advantage in that it requires no prior
> setup or extra configuration.  Like a sequential scan, it always
> works, and that is an advantage.  Of course, for many people, the
> competing advantage of a WAL-scanning approach that can save a lot of
> I/O will appear compelling, but maybe not for everyone.  I think
> there's room for two or three approaches there -- not in the sense of
> competing patches, but in the sense of giving users a choice based on
> their needs.

Well, by having a separate modblock file for each WAL file, you can keep
both WAL and modblock files and use the modblock list to pull pages from
each WAL file, or from the heap/index files, and it can be done in
parallel.  Having WAL and modblock files in the same directory makes
retention simpler.

In fact, you can do an incremental backup just using the modblock files
and the heap/index files, so you don't even need the WAL.

Also, instead of storing the file name and block number in the modblock
file, using the database oid, relfilenode, and block number (3 int32
values) should be sufficient.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: block-level incremental backup

2019-04-17 Thread Stephen Frost
Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Tue, Apr 16, 2019 at 5:44 PM Stephen Frost  wrote:
> > > > I love the general idea of having additional facilities in core to
> > > > support block-level incremental backups.  I've long been unhappy that
> > > > any such approach ends up being limited to a subset of the files which
> > > > need to be included in the backup, meaning the rest of the files have to
> > > > be backed up in their entirety.  I don't think we have to solve for that
> > > > as part of this, but I'd like to see a discussion for how to deal with
> > > > the other files which are being backed up to avoid needing to just
> > > > wholesale copy them.
> > >
> > > I assume you are talking about non-heap/index files.  Which of those are
> > > large enough to benefit from incremental backup?
> >
> > Based on discussions I had with Andrey, specifically the visibility map
> > is an issue for them with WAL-G.  I haven't spent a lot of time thinking
> > about it, but I can understand how that could be an issue.
> 
> If I understand correctly, the VM contains 1 byte per 4 heap pages and
> the FSM contains 1 byte per heap page (plus some overhead for higher
> levels of the tree).  Since the FSM is not WAL-logged, I'm not sure
> there's a whole lot we can do to avoid having to back it up, although
> maybe there's some clever idea I'm not quite seeing.  The VM is
> WAL-logged, albeit with some strange warts that I have the honor of
> inventing, so there's more possibilities there.
> 
> Before worrying about it too much, it would be useful to hear more
> about the concerns related to these forks, so that we make sure we're
> solving the right problem.  It seems difficult for a single relation
> to be big enough for these to be much of an issue.  For example, on a
> 1TB relation, we have 2^40 bytes = 2^27 pages = ~2^25 bits of VM fork
> = 32MB.  Not nothing, but 32MB of useless overhead every time you back
> up a 1TB database probably isn't going to break the bank.  It might be
> more of a concern for users with many small tables.  For example, if
> somebody has got a million tables with 1 page in each one, they'll
> have a million data pages, a million VM pages, and 3 million FSM pages
> (unless the new don't-create-the-FSM-for-small-tables stuff in v12
> kicks in).  I don't know if it's worth going to a lot of trouble to
> optimize that case.  Creating a million tables with 100 tuples (or
> whatever) in each one sounds like terrible database design to me.

As I understand it, the problem is not with backing up an individual
database or cluster, but rather dealing with backing up thousands of
individual clusters with thousands of tables in each, leading to an
awful lot of tables with lots of FSMs/VMs, all of which end up having to
get copied and stored wholesale.  I'll point this thread out to him and
hopefully he'll have a chance to share more specific information.

> > > > I'm quite concerned that trying to graft this on to pg_basebackup
> > > > (which, as you note later, is missing an awful lot of what users expect
> > > > from a real backup solution already- retention handling, parallel
> > > > capabilities, WAL archive management, and many more... but also is just
> > > > not nearly as developed a tool as the external solutions) is going to
> > > > make things unnecessairly difficult when what we really want here is
> > > > better support from core for block-level incremental backup for the
> > > > existing external tools to leverage.
> > >
> > > I think there is some interesting complexity brought up in this thread.
> > > Which options are going to minimize storage I/O, network I/O, have only
> > > background overhead, allow parallel operation, integrate with
> > > pg_basebackup.  Eventually we will need to evaluate the incremental
> > > backup options against these criteria.
> >
> > This presumes that we're going to have multiple competeing incremental
> > backup options presented, doesn't it?  Are you aware of another effort
> > going on which aims for inclusion in core?  There's been past attempts
> > made, but I don't believe there's anyone else currently planning to or
> > working on something for inclusion in core.
> 
> Yeah, I really hope we don't end up with dueling patches.  I want to
> come up with an approach that can be widely-endorsed and then have
> everybody rowing in the same direction.  On the other hand, I do think
> that we may support multiple options in certain places which may have
> the kinds of trade-offs that Bruce mentions.  For instance,
> identifying changed blocks by scanning the whole cluster and checking
> the LSN of each block has an advantage in that it requires no prior
> setup or extra configuration.  Like a sequential scan, it always
> works, and that is an advantage.  Of course, for many people, the
> competing advantage of a WAL-scanning approach that can save a lot of
> I/O will appear compelling, but maybe not for everyone.  I think
> there's 

Re: block-level incremental backup

2019-04-17 Thread Stephen Frost
Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Mon, Apr 15, 2019 at 9:01 AM Stephen Frost  wrote:
> > I love the general idea of having additional facilities in core to
> > support block-level incremental backups.  I've long been unhappy that
> > any such approach ends up being limited to a subset of the files which
> > need to be included in the backup, meaning the rest of the files have to
> > be backed up in their entirety.  I don't think we have to solve for that
> > as part of this, but I'd like to see a discussion for how to deal with
> > the other files which are being backed up to avoid needing to just
> > wholesale copy them.
> 
> Ideas?  Generally, I don't think that anything other than the main
> forks of relations are worth worrying about, because the files are too
> small to really matter.  Even if they're big, the main forks of
> relations will be much bigger.  I think.

Sadly, I haven't got any great ideas today.  I do know that the WAL-G
folks have specifically mentioned issues with the visibility map being
large enough across enough of their systems that it kinda sucks to deal
with.  Perhaps we could do something like the rsync binary-diff protocol
for non-relation files?  This is clearly just hand-waving but maybe
there's something reasonable in that idea.

> > I'm quite concerned that trying to graft this on to pg_basebackup
> > (which, as you note later, is missing an awful lot of what users expect
> > from a real backup solution already- retention handling, parallel
> > capabilities, WAL archive management, and many more... but also is just
> > not nearly as developed a tool as the external solutions) is going to
> > make things unnecessairly difficult when what we really want here is
> > better support from core for block-level incremental backup for the
> > existing external tools to leverage.
> >
> > Perhaps there's something here which can be done with pg_basebackup to
> > have it work with the block-level approach, but I certainly don't see
> > it as a natural next step for it and really does seem like limiting the
> > way this is implemented to something that pg_basebackup can easily
> > digest might make it less useful for the more developed tools.
> 
> I agree that there are a bunch of things that pg_basebackup does not
> do, such as backup management.  I think a lot of users do not want
> PostgreSQL to do backup management for them.  They have an existing
> solution that they use to manage backups, and they want PostgreSQL to
> interoperate with it. I think it makes sense for pg_basebackup to be
> in charge of taking the backup, and then other tools can either use it
> as a building block or use the streaming replication protocol to send
> approximately the same commands to the server.  

There's something like 6 different backup tools, at least, for
PostgreSQL that provide backup management, so I have a really hard time
agreeing with this idea that users don't want a PG backup management
system.  Maybe that's not what you're suggesting here, but that's what
came across to me.

Yes, there are some users who have an existing backup solution and
they'd like a better way to integrate PostgreSQL into that solution,
but that's usually something like filesystem snapshots or an enterprise
backup tool which has a PostgreSQL agent or similar to do the start/stop
and collect up the WAL, not something that's just calling pg_basebackup.

Those are typically not things we have any visibility into though and
aren't open source either (and, at least as often as not, they don't
seem to be very well thought through, based on my experience with those
tools...).

Unless maybe I'm misunderstanding and what you're suggesting here is
that the "existing solution" is something like the external PG-specific
backup tools?  But then the rest doesn't seem to make sense, as only
maybe one or two of those tools use pg_basebackup internally.

> I certainly would not
> want to expose server capabilities that let you take an incremental
> backup and NOT teach pg_basebackup to use them -- then we'd be in a
> situation of saying that PostgreSQL has incremental backup, but you
> have to get external tool XYZ to use it.  That will be perceived as
> PostgreSQL does NOT have incremental backup and this external tool
> adds it.

... but this is exactly the situation we're in already with all of the
*other* features around backup (parallel backup, backup management, WAL
management, etc).  Users want those features, pg_basebackup/PG core
doesn't provide it, and therefore there's a bunch of other tools which
have been written that do.  In addition, saying that PG has incremental
backup but no built-in management of those full-vs-incremental backups
and telling users that they basically have to build that themselves
really feels a lot like we're trying to address a check-box requirement
rather than making something that our users are going to be happy with.

> > As an example, I believe all of the other t

Re: block-level incremental backup

2019-04-18 Thread David Fetter
On Wed, Apr 17, 2019 at 11:57:35AM -0400, Bruce Momjian wrote:
> On Tue, Apr 16, 2019 at 06:40:44PM -0400, Robert Haas wrote:
> > Yeah, I really hope we don't end up with dueling patches.  I want to
> > come up with an approach that can be widely-endorsed and then have
> > everybody rowing in the same direction.  On the other hand, I do think
> > that we may support multiple options in certain places which may have
> > the kinds of trade-offs that Bruce mentions.  For instance,
> > identifying changed blocks by scanning the whole cluster and checking
> > the LSN of each block has an advantage in that it requires no prior
> > setup or extra configuration.  Like a sequential scan, it always
> > works, and that is an advantage.  Of course, for many people, the
> > competing advantage of a WAL-scanning approach that can save a lot of
> > I/O will appear compelling, but maybe not for everyone.  I think
> > there's room for two or three approaches there -- not in the sense of
> > competing patches, but in the sense of giving users a choice based on
> > their needs.
> 
> Well, by having a separate modblock file for each WAL file, you can keep
> both WAL and modblock files and use the modblock list to pull pages from
> each WAL file, or from the heap/index files, and it can be done in
> parallel.  Having WAL and modblock files in the same directory makes
> retention simpler.
> 
> In fact, you can do an incremental backup just using the modblock files
> and the heap/index files, so you don't even need the WAL.
> 
> Also, instead of storing the file name and block number in the modblock
> file, using the database oid, relfilenode, and block number (3 int32
> values) should be sufficient.

Would doing it that way constrain the design of new table access
methods in some meaningful way?

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate




Re: block-level incremental backup

2019-04-18 Thread Bruce Momjian
On Thu, Apr 18, 2019 at 05:32:57PM +0200, David Fetter wrote:
> On Wed, Apr 17, 2019 at 11:57:35AM -0400, Bruce Momjian wrote:
> > Also, instead of storing the file name and block number in the modblock
> > file, using the database oid, relfilenode, and block number (3 int32
> > values) should be sufficient.
> 
> Would doing it that way constrain the design of new table access
> methods in some meaningful way?

I think these are the values used in WAL, so I assume table access
methods already have to map to those, unless they use their own.
I actually don't know.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: block-level incremental backup

2019-04-18 Thread Robert Haas
On Wed, Apr 17, 2019 at 5:20 PM Stephen Frost  wrote:
> As I understand it, the problem is not with backing up an individual
> database or cluster, but rather dealing with backing up thousands of
> individual clusters with thousands of tables in each, leading to an
> awful lot of tables with lots of FSMs/VMs, all of which end up having to
> get copied and stored wholesale.  I'll point this thread out to him and
> hopefully he'll have a chance to share more specific information.

Sounds good.

> I can agree with the idea of having multiple options for how to collect
> up the set of changed blocks, though I continue to feel that a
> WAL-scanning approach isn't something that we'd have implemented in the
> backend at all since it doesn't require the backend and a given backend
> might not even have all of the WAL that is relevant.  I certainly don't
> think it makes sense to have a backend go get WAL from the archive to
> then merge the WAL to provide the result to a client asking for it-
> that's adding entirely unnecessary load to the database server.

My motivation for wanting to include it in the database server was twofold:

1. I was hoping to leverage the background worker machinery.  The
WAL-scanner would just run all the time in the background, and start
up and shut down along with the server.  If it's a standalone tool,
then it can run on a different server or when the server is down, both
of which are nice.  The downside though is that now you probably have
to put it in crontab or under systemd or something, instead of just
setting a couple of GUCs and letting the server handle the rest.  For
me that downside seems rather significant, but YMMV.

2. In order for the information produced by the WAL-scanner to be
useful, it's got to be available to the server when the server is
asked for an incremental backup.  If the information is constructed by
a standalone frontend tool, and stored someplace other than under
$PGDATA, then the server won't have convenient access to it.  I guess
we could make it the client's job to provide that information to the
server, but I kind of liked the simplicity of not needing to give the
server anything more than an LSN.

> A thought that occurs to me is to have the functions for supporting the
> WAL merging be included in libcommon and available to both the
> independent executable that's available for doing WAL merging, and to
> the backend to be able to WAL merging itself-

Yeah, that might be possible.

> but for a specific
> purpose: having a way to reduce the amount of WAL that needs to be sent
> to a replica which has a replication slot but that's been disconnected
> for a while.  Of course, there'd have to be some way to handle the other
> files for that to work to update a long out-of-date replica.  Now, if we
> taught the backup tool about having a replication slot then perhaps we
> could have the backend effectively have the same capability proposed
> above, but without the need to go get the WAL from the archive
> repository.

Hmm, but you can't just skip over WAL records or segments because
there are checksums and previous-record pointers and things

> I'm still not entirely sure that this makes sense to do in the backend
> due to the additional load, this is really just some brainstorming.

Would it really be that much load?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: block-level incremental backup

2019-04-18 Thread Andres Freund
Hi,

On 2019-04-18 11:34:32 -0400, Bruce Momjian wrote:
> On Thu, Apr 18, 2019 at 05:32:57PM +0200, David Fetter wrote:
> > On Wed, Apr 17, 2019 at 11:57:35AM -0400, Bruce Momjian wrote:
> > > Also, instead of storing the file name and block number in the modblock
> > > file, using the database oid, relfilenode, and block number (3 int32
> > > values) should be sufficient.
> > 
> > Would doing it that way constrain the design of new table access
> > methods in some meaningful way?
> 
> I think these are the values used in WAL, so I assume table access
> methods already have to map to those, unless they use their own.
> I actually don't know.

I don't think it'd be a meaningful restriction. Given that we use those
for shared_buffer descriptors, WAL etc.

Greetings,

Andres Freund




Re: block-level incremental backup

2019-04-18 Thread Robert Haas
On Wed, Apr 17, 2019 at 6:43 PM Stephen Frost  wrote:
> Sadly, I haven't got any great ideas today.  I do know that the WAL-G
> folks have specifically mentioned issues with the visibility map being
> large enough across enough of their systems that it kinda sucks to deal
> with.  Perhaps we could do something like the rsync binary-diff protocol
> for non-relation files?  This is clearly just hand-waving but maybe
> there's something reasonable in that idea.

I guess it all comes down to how complicated you're willing to make
the client-server protocol.  With the very simple protocol that I
proposed -- client provides a threshold LSN and server sends blocks
modified since then -- the client need not have access to the old
incremental backup to take a new one.  Of course, if it happens to
have access to the old backup then it can delta-compress however it
likes after-the-fact, but that doesn't help with the amount of network
transfer.  That problem could be solved by doing something like what
you're talking about (with some probably-negligible false match rate)
but I have no intention of trying to implement anything that
complicated, and I don't really think it's necessary, at least not for
a first version.  What I proposed would already allow, for most users,
a large reduction in transfer and storage costs; what you are talking
about here would help more, but also be a lot more work and impose
some additional requirements on the system.  I don't object to you
implementing the more complex system, but I'll pass.

> There's something like 6 different backup tools, at least, for
> PostgreSQL that provide backup management, so I have a really hard time
> agreeing with this idea that users don't want a PG backup management
> system.  Maybe that's not what you're suggesting here, but that's what
> came across to me.

Let me be a little more clear.  Different users want different things.
Some people want a canned PostgreSQL backup solution, while other
people just want access to a reasonable set of facilities from which
they can construct their own solution.  I believe that the proposal I
am making here could be used either by backup tool authors to enhance
their offerings, or by individuals who want to build up their own
solution using facilities provided by core.

> Unless maybe I'm misunderstanding and what you're suggesting here is
> that the "existing solution" is something like the external PG-specific
> backup tools?  But then the rest doesn't seem to make sense, as only
> maybe one or two of those tools use pg_basebackup internally.

Well, what I'm really talking about is in two pieces: providing some
new facilities via the replication protocol, and making pg_basebackup
able to use those facilities.  Nothing would stop other tools from
using those facilities directly if they wish.

> ... but this is exactly the situation we're in already with all of the
> *other* features around backup (parallel backup, backup management, WAL
> management, etc).  Users want those features, pg_basebackup/PG core
> doesn't provide it, and therefore there's a bunch of other tools which
> have been written that do.  In addition, saying that PG has incremental
> backup but no built-in management of those full-vs-incremental backups
> and telling users that they basically have to build that themselves
> really feels a lot like we're trying to address a check-box requirement
> rather than making something that our users are going to be happy with.

I disagree.  Yes, parallel backup, like incremental backup, needs to
go in core.  And pg_basebackup should be able to do a parallel backup.
I will fight tooth, nail, and claw any suggestion that the server
should know how to do a parallel backup but pg_basebackup should not
have an option to exploit that capability.  And similarly for
incremental.

> I don't think that I was very clear in what my specific concern here
> was.  I'm not asking for pg_basebackup to have parallel backup (at
> least, not in this part of the discussion), I'm asking for the
> incremental block-based protocol that's going to be built-in to core to
> be able to be used in a parallel fashion.
>
> The existing protocol that pg_basebackup uses is basically, connect to
> the server and then say "please give me a tarball of the data directory"
> and that is then streamed on that connection, making that protocol
> impossible to use for parallel backup.  That's fine as far as it goes
> because only pg_basebackup actually uses that protocol (note that nearly
> all of the other tools for doing backups of PostgreSQL don't...).  If
> we're expecting the external tools to use the block-level incremental
> protocol then that protocol really needs to have a way to be
> parallelized, otherwise we're just going to end up with all of the
> individual tools doing their own thing for block-level incremental
> (though perhaps they'd reimplement whatever is done in core but in a way
> that they could parallelize it...), if possible

Re: block-level incremental backup

2019-04-18 Thread Andres Freund
Hi,

> > Wow.  I have to admit that I feel completely opposite of that- I'd
> > *love* to have an independent tool (which ideally uses the same code
> > through the common library, or similar) that can be run to apply WAL.
> >
> > In other words, I don't agree that it's the server's problem at all to
> > solve that, or, at least, I don't believe that it needs to be.
> 
> I mean, I guess I'd love to have that if I could get it by waving a
> magic wand, but I wouldn't love it if I had to write the code or
> maintain it.  The routines for applying WAL currently all assume that
> you have a whole bunch of server infrastructure present; that code
> wouldn't run in a frontend environment, I think.  I wouldn't want to
> have a second copy of every WAL apply routine that might have its own
> set of bugs.

I'll fight tooth and nail not to have a second implementation of replay,
even if it's just portions.  The code we have is complicated and fragile
enough, having a [partial] second version would be way worse.  There's
already plenty improvements we need to make to speed up replay, and a
lot of them require multiple execution threads (be it processes or OS
threads), something not easily feasible in a standalone tool. And
without the already existing concurrent work during replay (primarily
checkpointer doing a lot of the necessary IO), it'd also be pretty
unattractive to use any separate tool.

Unless you just define the server binary as that "independent tool".
Which I think is entirely reasonable. With the 'consistent' and LSN
recovery targets one already can get most of what's needed from such a
tool, anyway.  I'd argue the biggest issue there is that there's no
equivalent to starting postgres with a private socket directory on
windows, and perhaps an option or two making it easier to start postgres
in a "private" mode for things like this.

Greetings,

Andres Freund




Re: block-level incremental backup

2019-04-18 Thread Stephen Frost
Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Wed, Apr 17, 2019 at 5:20 PM Stephen Frost  wrote:
> > As I understand it, the problem is not with backing up an individual
> > database or cluster, but rather dealing with backing up thousands of
> > individual clusters with thousands of tables in each, leading to an
> > awful lot of tables with lots of FSMs/VMs, all of which end up having to
> > get copied and stored wholesale.  I'll point this thread out to him and
> > hopefully he'll have a chance to share more specific information.
> 
> Sounds good.

Ok, done.

> > I can agree with the idea of having multiple options for how to collect
> > up the set of changed blocks, though I continue to feel that a
> > WAL-scanning approach isn't something that we'd have implemented in the
> > backend at all since it doesn't require the backend and a given backend
> > might not even have all of the WAL that is relevant.  I certainly don't
> > think it makes sense to have a backend go get WAL from the archive to
> > then merge the WAL to provide the result to a client asking for it-
> > that's adding entirely unnecessary load to the database server.
> 
> My motivation for wanting to include it in the database server was twofold:
> 
> 1. I was hoping to leverage the background worker machinery.  The
> WAL-scanner would just run all the time in the background, and start
> up and shut down along with the server.  If it's a standalone tool,
> then it can run on a different server or when the server is down, both
> of which are nice.  The downside though is that now you probably have
> to put it in crontab or under systemd or something, instead of just
> setting a couple of GUCs and letting the server handle the rest.  For
> me that downside seems rather significant, but YMMV.

Background workers can be used to do pretty much anything.  I'm not
suggesting that's a bad thing- just that it's such a completely generic
tool that could be used to put anything/everything into the backend, so
I'm not sure how much it makes sense as an argument when it comes to
designing a new capability/feature.  Yes, there's an advantage there
when it comes to configuration since that means we don't need to set up
a cronjob and can, instead, just set a few GUCs...  but it also means
that it *must* be done on the server and there's no option to do it
elsewhere, as you say.

When it comes to "this is something that I can do on the DB server or on
some other server", the usual preference is to use another system for
it, to reduce load on the server.

If it comes down to something that needs to/should be an ongoing
process, then the packaging can package that as a daemon-type tool which
handles the systemd component to it, assuming the stand-alone tool
supports that, which it hopefully would.

> 2. In order for the information produced by the WAL-scanner to be
> useful, it's got to be available to the server when the server is
> asked for an incremental backup.  If the information is constructed by
> a standalone frontend tool, and stored someplace other than under
> $PGDATA, then the server won't have convenient access to it.  I guess
> we could make it the client's job to provide that information to the
> server, but I kind of liked the simplicity of not needing to give the
> server anything more than an LSN.

If the WAL-scanner tool is a stand-alone tool, and it handles picking
out all of the FPIs and incremental page changes for each relation, then
what does the tool to build out the "new" backup really need to tell the
backend?  I feel like it mainly needs to ask the backend for the
non-relation files, which gets into at least one approach that I've
thought about for redesigning the backup protocol:

1. Ask for a list of files and metadata about them
2. Allow asking for individual files
3. Support multiple connections asking for individual files

Quite a few of the existing backup tools for PG use a model along these
lines (or use tools underneath which do).

> > A thought that occurs to me is to have the functions for supporting the
> > WAL merging be included in libcommon and available to both the
> > independent executable that's available for doing WAL merging, and to
> > the backend to be able to WAL merging itself-
> 
> Yeah, that might be possible.

I feel like this would be necessary, as it's certainly delicate and
critical code and having multiple implementations of it will be
difficult to manage.

That said...  we already have independent work going on to do WAL
mergeing (WAL-G, at least), and if we insist that the WAL replay code
only exists in the backend, I strongly suspect we'll end up with
independent implementations of that too.  Sure, we can distance
ourselves from that and say that we don't have to deal with any bugs
from it... but it seems like the better approach would be to have a
common library that provides it.

> > but for a specific
> > purpose: having a way to reduce the amount of WAL that needs to be sent

Re: block-level incremental backup

2019-04-18 Thread Stephen Frost
Greetings,

I wanted to respond to this point specifically as I feel like it'll
really help clear things up when it comes to the point of view I'm
seeing this from.

* Robert Haas (robertmh...@gmail.com) wrote:
> > Perhaps that's what we're both saying too and just talking past each
> > other, but I feel like the approach here is "make it work just for the
> > simple pg_basebackup case and not worry too much about the other tools,
> > since what we do for pg_basebackup will work for them too" while where
> > I'm coming from is "focus on what the other tools need first, and then
> > make pg_basebackup work with that if there's a sensible way to do so."
> 
> I think perhaps the disconnect is that I just don't see how it can
> fail to work for the external tools if it works for pg_basebackup.

The existing backup protocol that pg_basebackup uses *does* *not* *work*
for the external backup tools.  If it worked, they'd use it, but they
don't and that's because you can't do things like a parallel backup,
which we *know* users want because there's a number of tools which
implement that exact capability.

I do *not* want another piece of functionality added in this space which
is limited in the same way because it does *not* help the external
backup tools at all.

> Any given piece of functionality is either available in the
> replication stream, or it's not.  I suspect that for both BART and
> pg_backrest, they won't be able to completely give up on having their
> own backup engines solely because core has incremental backup, but I
> don't know what the alternative to adding features to core one at a
> time is.

This idea that it's either "in the replication system" or "not in the
replication system" is really bad, in my view, because it can be "in the
replication system" and at the same time not at all useful to the
existing external backup tools, but users and others will see the
"checkbox" as ticked and assume that it's available in a useful fashion
by the backend and then get upset when they discover the limitations.

The existing base backup/replication protocol that's used by
pg_basebackup is *not* useful to most of the backup tools, that's quite
clear since they *don't* use it.  Building on to that an incremental
backup solution that is similairly limited isn't going to make things
easier for the external tools.

If the goal is to make things easier for the external tools by providing
capability in the backend / replication protocol then we need to be
looking at what those tools require and not at what would be minimally
sufficient for pg_basebackup.  If we don't care about the external tools
and *just* care about making it work for pg_basebackup, then let's be
clear about that, and accept that it'll have to be, most likely, ripped
out and rewritten when we go to add parallel capabilities, for example,
to pg_basebackup down the road.  That's clearly the case for the
existing "base backup" protocol, so I don't see why it'd be different
for an incremental backup system that is similairly designed and
implemented.

To be clear, I'm all for adding feature to core one at a time, but
there's different ways to implement features and that's really what
we're talking about here- what's the best way to implement this
feature, ideally in a way that it's useful, practically, to both
pg_basebackup and the other external backup utilities.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: block-level incremental backup

2019-04-18 Thread Stephen Frost
Greetings,

Ok, responding to the rest of this email.

* Robert Haas (robertmh...@gmail.com) wrote:
> On Wed, Apr 17, 2019 at 6:43 PM Stephen Frost  wrote:
> > Sadly, I haven't got any great ideas today.  I do know that the WAL-G
> > folks have specifically mentioned issues with the visibility map being
> > large enough across enough of their systems that it kinda sucks to deal
> > with.  Perhaps we could do something like the rsync binary-diff protocol
> > for non-relation files?  This is clearly just hand-waving but maybe
> > there's something reasonable in that idea.
> 
> I guess it all comes down to how complicated you're willing to make
> the client-server protocol.  With the very simple protocol that I
> proposed -- client provides a threshold LSN and server sends blocks
> modified since then -- the client need not have access to the old
> incremental backup to take a new one.

Where is the client going to get the threshold LSN from?

> Of course, if it happens to
> have access to the old backup then it can delta-compress however it
> likes after-the-fact, but that doesn't help with the amount of network
> transfer.

If it doesn't have access to the old backup, then I'm a bit confused as
to how a incremental backup would be possible?  Isn't that a requirement
here?

> That problem could be solved by doing something like what
> you're talking about (with some probably-negligible false match rate)
> but I have no intention of trying to implement anything that
> complicated, and I don't really think it's necessary, at least not for
> a first version.  What I proposed would already allow, for most users,
> a large reduction in transfer and storage costs; what you are talking
> about here would help more, but also be a lot more work and impose
> some additional requirements on the system.  I don't object to you
> implementing the more complex system, but I'll pass.

I was talking about the rsync binary-diff specifically for the files
that aren't easy to deal with in the WAL stream.  I wouldn't think we'd
use it for other files, and there is definitely a question there of if
there's a way to do better than a binary-diff approach for those files.

> > There's something like 6 different backup tools, at least, for
> > PostgreSQL that provide backup management, so I have a really hard time
> > agreeing with this idea that users don't want a PG backup management
> > system.  Maybe that's not what you're suggesting here, but that's what
> > came across to me.
> 
> Let me be a little more clear.  Different users want different things.
> Some people want a canned PostgreSQL backup solution, while other
> people just want access to a reasonable set of facilities from which
> they can construct their own solution.  I believe that the proposal I
> am making here could be used either by backup tool authors to enhance
> their offerings, or by individuals who want to build up their own
> solution using facilities provided by core.

The last thing that I think users really want it so build up their own
solution.  There may be some organizations who would like to provide
their own tool, but that's a bit different.  Personally, I'd *really*
like PG to have a good tool in this area and I've been working, as I've
said before, to try to get to a point where we at least have the option
to add in such a tool that meets our various requirements.

Further, I'm concerned that the approach being presented here won't be
interesting to most of the external tools because it's limited and can't
be used in a parallel fashion.

> > Unless maybe I'm misunderstanding and what you're suggesting here is
> > that the "existing solution" is something like the external PG-specific
> > backup tools?  But then the rest doesn't seem to make sense, as only
> > maybe one or two of those tools use pg_basebackup internally.
> 
> Well, what I'm really talking about is in two pieces: providing some
> new facilities via the replication protocol, and making pg_basebackup
> able to use those facilities.  Nothing would stop other tools from
> using those facilities directly if they wish.

If those facilities are developed and implemented in the same way as the
protocol used by pg_basebackup works, then I strongly suspect that the
existing backup tools will treat it similairly- which is to say, they'll
largely end up ignoring it.

> > ... but this is exactly the situation we're in already with all of the
> > *other* features around backup (parallel backup, backup management, WAL
> > management, etc).  Users want those features, pg_basebackup/PG core
> > doesn't provide it, and therefore there's a bunch of other tools which
> > have been written that do.  In addition, saying that PG has incremental
> > backup but no built-in management of those full-vs-incremental backups
> > and telling users that they basically have to build that themselves
> > really feels a lot like we're trying to address a check-box requirement
> > rather than making something that our

Re: block-level incremental backup

2019-04-19 Thread Stephen Frost
Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> > > Wow.  I have to admit that I feel completely opposite of that- I'd
> > > *love* to have an independent tool (which ideally uses the same code
> > > through the common library, or similar) that can be run to apply WAL.
> > >
> > > In other words, I don't agree that it's the server's problem at all to
> > > solve that, or, at least, I don't believe that it needs to be.
> > 
> > I mean, I guess I'd love to have that if I could get it by waving a
> > magic wand, but I wouldn't love it if I had to write the code or
> > maintain it.  The routines for applying WAL currently all assume that
> > you have a whole bunch of server infrastructure present; that code
> > wouldn't run in a frontend environment, I think.  I wouldn't want to
> > have a second copy of every WAL apply routine that might have its own
> > set of bugs.
> 
> I'll fight tooth and nail not to have a second implementation of replay,
> even if it's just portions.  The code we have is complicated and fragile
> enough, having a [partial] second version would be way worse.  There's
> already plenty improvements we need to make to speed up replay, and a
> lot of them require multiple execution threads (be it processes or OS
> threads), something not easily feasible in a standalone tool. And
> without the already existing concurrent work during replay (primarily
> checkpointer doing a lot of the necessary IO), it'd also be pretty
> unattractive to use any separate tool.

I agree that we don't want another implementation and that there's a lot
that we want to do to improve replay performance.  We've already got
frontend tools which work with multiple execution threads, so I'm not
sure I get the "not easily feasible" bit, and the argument about the
checkpointer seems largely related to that (as in- if we didn't have
multiple threads/processes then things would perform quite badly...  but
we can and do have multiple threads/processes in frontend tools today,
even in pg_basebackup).

You certainly bring up some good concerns though and they make me think
of other bits that would seem like they'd possibly be larger issues for
a frontend tool- like having a large pool of memory for cacheing (aka
shared buffers) the changes.  If what we're talking about here is *just*
replay though, without having the system available for reads, I wonder
if we might want a different solution there.

> Unless you just define the server binary as that "independent tool".

That's certainly an interesting idea.

> Which I think is entirely reasonable. With the 'consistent' and LSN
> recovery targets one already can get most of what's needed from such a
> tool, anyway.  I'd argue the biggest issue there is that there's no
> equivalent to starting postgres with a private socket directory on
> windows, and perhaps an option or two making it easier to start postgres
> in a "private" mode for things like this.

This would mean building in a way to do parallel WAL replay into the
server binary though, as discussed above, and it seems like making that
work in a way that allows us to still be available as a read-only
standby would be quite a bit more difficult.  We could possibly support
parallel WAL replay only when we aren't a replica but from the same
binary.  The concerns mentioned about making it easier to start PG in a
private mode don't seem too bad but I am not entirely sure that the
tools which want to leverage that kind of capability would want to have
to exec out to the PG binary to use it.

A lot of this part of the discussion feels like a tangent though, unless
I'm missing something.  The "WAL compression" tool contemplated
previously would be much simpler and not the full-blown WAL replay
capability, which would be left to the server, unless you're suggesting
that even that should be exclusively the purview of the backend?  Though
that ship's already sailed, given that external projects have
implemented it.  Having a library to provide that which external
projects could leverage would be nicer than having everyone write their
own version.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: block-level incremental backup

2019-04-19 Thread Robert Haas
On Thu, Apr 18, 2019 at 6:39 PM Stephen Frost  wrote:
> Where is the client going to get the threshold LSN from?
>
> If it doesn't have access to the old backup, then I'm a bit confused as
> to how a incremental backup would be possible?  Isn't that a requirement
> here?

I explained this in the very first email that I wrote on this thread,
and then wrote a very extensive further reply on this exact topic to
Peter Eisentraut.  It's a bit disheartening to see you arguing against
my ideas when it's not clear that you've actually read and understood
them.

> > The obvious way of extending this system to parallel backup is to have
> > N connections each streaming a separate tarfile such that when you
> > combine them all you recreate the original data directory.  That would
> > be perfectly compatible with what I'm proposing for incremental
> > backup.  Maybe you have another idea in mind, but I don't know what it
> > is exactly.
>
> So, while that's an obvious approach, it isn't the most sensible- and
> we know that from experience in actually implementing parallel backup of
> PG files.  I'm happy to discuss the approach we use in pgBackRest if
> you'd like to discuss this further, but it seems a bit far afield from
> the topic of discussion here and it seems like you're not interested or
> offering to work on supporting parallel backup in core.

If there's some way of modifying my proposal so that it makes life
better for external backup tools, I'm certainly willing to consider
that, but you're going to have to tell me what you have in mind.  If
that means describing what pgbackrest does, then do it.

My concern here is that you seem to want a lot of complicated stuff
that will require *significant* setup in order for people to be able
to use it.  From what I am able to gather from your remarks so far,
you think people should archive their WAL to a separate machine, and
then the WAL-summarizer should run there, and then data from that
should be fed back to the backup client, which should then give the
server a list of modified files (and presumably, someday, blocks) and
the server then returns that data, which the client then
cross-verifies with checksums and awesome sauce.

Which is all fine, but actually requires quite a bit of set-up and
quite a bit of buy-in to the tool.  And I have no problem with people
having that level of buy-in to the tool.  EnterpriseDB offers a number
of tools which require similar levels of setup and configuration, and
it's not inappropriate for an enterprise-grade backup tool to have all
that stuff.  However, for those who may not want to do all that, my
original proposal lets you take an incremental backup by doing the
following list of steps:

1. Take an incremental backup.

If you'd like, you can also:

0. Enable the WAL-scanning background worker to make incremental
backups much faster.

You do not need a WAL archive, and you do not need EITHER the backup
tool or the server to have access to previous backups, and you do not
need the client to have any access to archived WAL or the summary
files produced from it.  The only thing you need to know the
start-of-backup LSN for the previous backup.

I expect you to reply with a long complaint about how my proposal is
totally inadequate, but actually I think for most people, most of the
time, it would not only be adequate, but extremely convenient.  And
despite your protestations to the contrary, it does not block
parallelism, checksum verification, or any other cool features that
somebody may want to add later.  It'll work just fine with those
things.

And for the record, I am willing to put some effort into parallelism.
I just think that it makes more sense to do the incremental part
first.  I think that incremental backup is likely to have less effect
on parallel backup than the other way around.  What I'm NOT willing to
do is build a whole bunch of infrastructure that will help pgbackrest
do amazing things but will not provide a simple and convenient way of
taking incremental backups using only core tools.  I do care about
having something that's good for pgbackrest and other out-of-core
tools.  I just care about it MUCH LESS than I care about making
PostgreSQL core awesome.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: block-level incremental backup

2019-04-19 Thread Stephen Frost
Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> What I'm NOT willing to
> do is build a whole bunch of infrastructure that will help pgbackrest
> do amazing things but will not provide a simple and convenient way of
> taking incremental backups using only core tools.  I do care about
> having something that's good for pgbackrest and other out-of-core
> tools.  I just care about it MUCH LESS than I care about making
> PostgreSQL core awesome.

Then I misunderstood your original proposal where you talked about
providing something that the various external tools could use.  If you'd
like to *just* provide a mechanism for pg_basebackup to be able to do a
trivial incremental backup, great, but it's not going to be useful or
used by the external tools, just like the existing base backup protocol
isn't used by the external tools because it can't be used in a parallel
fashion.

As such, and with all the other missing bits from pg_basebackup, it
looks likely to me that such a feature is going to be lackluster, at
best, and end up being only marginally interesting, when it could have
been much more and leveraged by all of the existing tools.  I agree that
making a parallel-supporting protocol work is harder but I actually
don't think it would be *that* much more difficult to do.

That's frankly discouraging, but I'm not going to tell you where to
spend your time.

Making PG core awesome when it comes to backup is going to involve so
much more than just marginal improvements to pg_basebackup, but it's
also something that I'm very much supportive of and have invested a
great deal in, by spending time and resources working to build a tool
that gets closer to what an in-core solution would look like than
anything that exists today.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: block-level incremental backup

2019-04-20 Thread Andrey Borodin
Hi!

Sorry for the delay.

> 18 апр. 2019 г., в 21:56, Robert Haas  написал(а):
> 
> On Wed, Apr 17, 2019 at 5:20 PM Stephen Frost  wrote:
>> As I understand it, the problem is not with backing up an individual
>> database or cluster, but rather dealing with backing up thousands of
>> individual clusters with thousands of tables in each, leading to an
>> awful lot of tables with lots of FSMs/VMs, all of which end up having to
>> get copied and stored wholesale.  I'll point this thread out to him and
>> hopefully he'll have a chance to share more specific information.
> 
> Sounds good.

During introduction of WAL-delta backups, we faced two things:
1. Heavy spike in network load. We shift beginning of backup randomly, but 
variation is not very big: night is short and we want to make big backups 
during low rps time. This low variation of time of starts of small backups 
creates big network spike.
2. Incremental backups became very cheap if measured in used resources of a 
single cluster.

1st is not a big problem, actually, bit we realized that we can do incremental 
backups not just at night, but, for example, 4 times a day. Or every hour. Or 
every minute. Why not, if they are cheap enough?

Incremental backup of 1Tb DB made with distance of few minutes (small change 
set) is few Gbs. All of this size is made of FSM (no LSN) and VM (hard to use 
LSN).
Sure, this overhead size is fine if we make daily backup. But at some frequency 
of backups it will be too much.

I think that problem of incrementing FSM and VM is too distant now.
But if I had to implement it right now I'd choose following way: do not backup 
FSM and VM, recreate it during restore. Looks like it is possible, but too much 
AM-specific.
It is hard when you write backup tool in Go and cannot simply link with PG.

> 15 апр. 2019 г., в 18:01, Stephen Frost  написал(а):
> ...the goal here
> isn't actually to make pg_basebackup into an enterprise backup tool,
> ...

BTW, I'm all hands for extensibility and "hackability". But, personally, I'd be 
happy if pg_basebackup would be ubiquitous and sufficient. And tools like WAL-G 
and others became part of a history. There is not fundamental reason why 
external backup tool can be better than backup tool in core. (Unlike many PLs, 
data types, hooks, tuners etc)


Here's 53 mentions of "parallel backup". I want to note that there may be 
parallel read from disk and parallel network transmission. Things between these 
two are neglectable and can be single-threaded. From my POV, it's not about 
threads, it's about saturated IO controllers.
Also I think parallel restore matters more than parallel backup. Backups 
themself can be slow, on many clusters we even throttle disk IO. But users may 
want parallel backup to catch-up standby.

Thanks.

Best regards, Andrey Borodin.



Re: block-level incremental backup

2019-04-20 Thread Robert Haas
On Sat, Apr 20, 2019 at 12:19 AM Stephen Frost  wrote:
> * Robert Haas (robertmh...@gmail.com) wrote:
> > What I'm NOT willing to
> > do is build a whole bunch of infrastructure that will help pgbackrest
> > do amazing things but will not provide a simple and convenient way of
> > taking incremental backups using only core tools.  I do care about
> > having something that's good for pgbackrest and other out-of-core
> > tools.  I just care about it MUCH LESS than I care about making
> > PostgreSQL core awesome.
>
> Then I misunderstood your original proposal where you talked about
> providing something that the various external tools could use.  If you'd
> like to *just* provide a mechanism for pg_basebackup to be able to do a
> trivial incremental backup, great, but it's not going to be useful or
> used by the external tools, just like the existing base backup protocol
> isn't used by the external tools because it can't be used in a parallel
> fashion.

Well, what I meant - and perhaps I wasn't clear enough about this - is
that it could be used by an external solution for *managing* backups,
not so much an external engine for *taking* backups.  But actually, I
really don't see any reason why the latter wouldn't also be possible.
It was already suggested upthread by Anastasia that there should be a
way to ask the server to give only the identity of the modified blocks
without the contents of those blocks; if we provide that, then a tool
can get those and do whatever it likes with them, including fetching
them in parallel by some other means.  Another obvious extension would
be to add a command that says 'give me this file' or 'give me this
file but only this list of blocks' which would give clients lots of
options: they could provide their own lists of blocks to fetch
computed by whatever internal magic they have, or they could request
the server's modified-block map information first and then schedule
fetching those blocks in parallel using this new command.  So it seems
like with some pretty straightforward extensions this can be made
usable by and valuable to people wanting to build external backup
engines, too.  I do not necessarily feel obliged to implement every
feature that might help with that kind of thing just because I've
expressed an interest in this general area, but I might do some of
them, and maybe people like you or Anastasia who want to make these
facilities available to external tools can help with some of the work,
too.

That being said, as long as there is significant demand for
value-added backup features over and above what is in core, there are
probably going to be non-core backup tools that do things their own
way instead of just leaning on whatever the server provides natively.
In a certain sense that's regrettable, because it means that somebody
- or perhaps multiple somebodys - goes to the trouble of doing
something outside core and then somebody else puts something in core
that obsoletes it and therein lies duplication of effort.  On the
other hand, it also allows people to innovate way faster than can be
done in core, it allows competition among different possible designs,
and it's just kinda the way we roll around here.  I can't get very
worked up about it.

One thing I'm definitely not going to do here is abandon my goal of
producing a *simple* incremental backup solution that can be deployed
*easily* by users. I understand from your remarks that such a solution
will not suit everybody.  However, unlike you, I do not believe that
pg_basebackup was a failure.  I certainly agree that it has some
limitations that mean that it is hard to use in large deployments, but
it's also *extremely* convenient for people with a fairly small
database when they just need a quick and easy backup.  Adding some
more features to it - such as incremental backup - will make it useful
to more people in more cases.  There will doubtless still be people
who need more, and that's OK: those people can use a third-party tool.
I will not get anywhere trying to solve every problem at once.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: block-level incremental backup

2019-04-20 Thread Robert Haas
On Sat, Apr 20, 2019 at 12:44 PM Andrey Borodin  wrote:
> Incremental backup of 1Tb DB made with distance of few minutes (small change 
> set) is few Gbs. All of this size is made of FSM (no LSN) and VM (hard to use 
> LSN).
> Sure, this overhead size is fine if we make daily backup. But at some 
> frequency of backups it will be too much.

It seems like if the backups are only a few minutes apart, PITR might
be a better choice than super-frequent incremental backups.  What do
you think about that?

> I think that problem of incrementing FSM and VM is too distant now.
> But if I had to implement it right now I'd choose following way: do not 
> backup FSM and VM, recreate it during restore. Looks like it is possible, but 
> too much AM-specific.

Interesting idea - that's worth some more thought.

> BTW, I'm all hands for extensibility and "hackability". But, personally, I'd 
> be happy if pg_basebackup would be ubiquitous and sufficient. And tools like 
> WAL-G and others became part of a history. There is not fundamental reason 
> why external backup tool can be better than backup tool in core. (Unlike many 
> PLs, data types, hooks, tuners etc)

+1

> Here's 53 mentions of "parallel backup". I want to note that there may be 
> parallel read from disk and parallel network transmission. Things between 
> these two are neglectable and can be single-threaded. From my POV, it's not 
> about threads, it's about saturated IO controllers.
> Also I think parallel restore matters more than parallel backup. Backups 
> themself can be slow, on many clusters we even throttle disk IO. But users 
> may want parallel backup to catch-up standby.

I'm not sure I entirely understand your point here -- are you saying
that parallel backup is important, or that it's not important, or
something in between?  Do you think it's more or less important than
incremental backup?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: block-level incremental backup

2019-04-20 Thread Stephen Frost
Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Sat, Apr 20, 2019 at 12:19 AM Stephen Frost  wrote:
> > * Robert Haas (robertmh...@gmail.com) wrote:
> > > What I'm NOT willing to
> > > do is build a whole bunch of infrastructure that will help pgbackrest
> > > do amazing things but will not provide a simple and convenient way of
> > > taking incremental backups using only core tools.  I do care about
> > > having something that's good for pgbackrest and other out-of-core
> > > tools.  I just care about it MUCH LESS than I care about making
> > > PostgreSQL core awesome.
> >
> > Then I misunderstood your original proposal where you talked about
> > providing something that the various external tools could use.  If you'd
> > like to *just* provide a mechanism for pg_basebackup to be able to do a
> > trivial incremental backup, great, but it's not going to be useful or
> > used by the external tools, just like the existing base backup protocol
> > isn't used by the external tools because it can't be used in a parallel
> > fashion.
> 
> Well, what I meant - and perhaps I wasn't clear enough about this - is
> that it could be used by an external solution for *managing* backups,
> not so much an external engine for *taking* backups.  But actually, I
> really don't see any reason why the latter wouldn't also be possible.
> It was already suggested upthread by Anastasia that there should be a
> way to ask the server to give only the identity of the modified blocks
> without the contents of those blocks; if we provide that, then a tool
> can get those and do whatever it likes with them, including fetching
> them in parallel by some other means.  Another obvious extension would
> be to add a command that says 'give me this file' or 'give me this
> file but only this list of blocks' which would give clients lots of
> options: they could provide their own lists of blocks to fetch
> computed by whatever internal magic they have, or they could request
> the server's modified-block map information first and then schedule
> fetching those blocks in parallel using this new command.  So it seems
> like with some pretty straightforward extensions this can be made
> usable by and valuable to people wanting to build external backup
> engines, too.  I do not necessarily feel obliged to implement every
> feature that might help with that kind of thing just because I've
> expressed an interest in this general area, but I might do some of
> them, and maybe people like you or Anastasia who want to make these
> facilities available to external tools can help with some of the work,
> too.

Yes, if we spend a bit of time thinking about how this could be
implemented in a way that could be used by multiple connections
concurrently then we could provide something that both pg_basebackup and
the external tools could use.  Getting a list first and then supporting
a 'give me this file' API, or 'give me these blocks from this file'
would be very similar to what many of the external tools today.  I agree
that I don't think it'd be hard to do.  I'm suggesting that we do that
instead of, at a protocol level, something similar to what was done with
pg_basebackup which prevents that.

I don't really agree that implementing "give me a list of files" and
"give me this file" is really somehow an 'extension' to the tar-based
approach that pg_basebackup uses today, it's really a rather different
thing, and I mention that as a parallel (hah!) to what we're discussing
here regarding the incremental backup approach.

Having been around for a while working on backup-related things, if I
was to implement the protocol for pg_basebackup today, I'd definitely
implement "give me a list" and "give me this file" rather than the
tar-based approach, because I've learned that people want to be
able to do parallel backups and that's a decent way to do that.  I
wouldn't set out and implement something new that's there's just no hope
of making parallel.  Maybe the first write of pg_basebackup would still
be simple and serial since it's certainly more work to make a frontend
tool like that work in parallel, but at least the protocol would be
ready to support a parallel option being added alter without being
rewritten.

And that's really what I was trying to get at here- if we've got the
choice now to decide what this is going to look like from a protocol
level, it'd be great if we could make it able to support being used in a
parallel fashion, even if pg_basebackup is still single-threaded.

> That being said, as long as there is significant demand for
> value-added backup features over and above what is in core, there are
> probably going to be non-core backup tools that do things their own
> way instead of just leaning on whatever the server provides natively.
> In a certain sense that's regrettable, because it means that somebody
> - or perhaps multiple somebodys - goes to the trouble of doing
> something outside core and then somebody else puts someth

Re: block-level incremental backup

2019-04-21 Thread Andrey Borodin



> 21 апр. 2019 г., в 1:13, Robert Haas  написал(а):
> 
> On Sat, Apr 20, 2019 at 12:44 PM Andrey Borodin  wrote:
>> Incremental backup of 1Tb DB made with distance of few minutes (small change 
>> set) is few Gbs. All of this size is made of FSM (no LSN) and VM (hard to 
>> use LSN).
>> Sure, this overhead size is fine if we make daily backup. But at some 
>> frequency of backups it will be too much.
> 
> It seems like if the backups are only a few minutes apart, PITR might
> be a better choice than super-frequent incremental backups.  What do
> you think about that?
PITR is painfully slow on heavily loaded clusters. I observed restorations when 
5 seconds of WAL were restored in 4 seconds. Backup was only few hours past 
primary node, but could catch up only at night.
And during this process only one of 56 cpu cores was used. And SSD RAID 
throughput was not 100% utilized.

Block level delta backups can be restored very efficiently: if we restore from 
newest to past steps, we write no more than cluster size at last backup.

>> I think that problem of incrementing FSM and VM is too distant now.
>> But if I had to implement it right now I'd choose following way: do not 
>> backup FSM and VM, recreate it during restore. Looks like it is possible, 
>> but too much AM-specific.
> 
> Interesting idea - that's worth some more thought.

Core routines to recreate VM and FSM would be cool :) But this need to be done 
without extra IO, not an easy trick.

>> Here's 53 mentions of "parallel backup". I want to note that there may be 
>> parallel read from disk and parallel network transmission. Things between 
>> these two are neglectable and can be single-threaded. From my POV, it's not 
>> about threads, it's about saturated IO controllers.
>> Also I think parallel restore matters more than parallel backup. Backups 
>> themself can be slow, on many clusters we even throttle disk IO. But users 
>> may want parallel backup to catch-up standby.
> 
> I'm not sure I entirely understand your point here -- are you saying
> that parallel backup is important, or that it's not important, or
> something in between?  Do you think it's more or less important than
> incremental backup?
I think that there is no such thing as parallel backup. Backup creation is 
composite process of many subprocesses.

In my experience, parallel network transmission is cool and very important, it 
makes upload 3 times faster. But my experience is limited to cloud storages. 
Would this hold if storage backend is local FS? I have no idea.
Parallel reading from disk has the same effect. Compression and encryption can 
be single threaded, I think it will not be bottleneck (unless one uses lzma's 
neighborhood on Pareto frontier).

For me, I think the most important thing is incremental backups (with parallel 
steps merge) and then parallel backup.
But there is huge fraction of users, who can benefit from parallel backup and 
do not need incremental backup at all.


Best regards, Andrey Borodin.



Re: block-level incremental backup

2019-04-21 Thread Robert Haas
On Sat, Apr 20, 2019 at 4:32 PM Stephen Frost  wrote:
> Having been around for a while working on backup-related things, if I
> was to implement the protocol for pg_basebackup today, I'd definitely
> implement "give me a list" and "give me this file" rather than the
> tar-based approach, because I've learned that people want to be
> able to do parallel backups and that's a decent way to do that.  I
> wouldn't set out and implement something new that's there's just no hope
> of making parallel.  Maybe the first write of pg_basebackup would still
> be simple and serial since it's certainly more work to make a frontend
> tool like that work in parallel, but at least the protocol would be
> ready to support a parallel option being added alter without being
> rewritten.
>
> And that's really what I was trying to get at here- if we've got the
> choice now to decide what this is going to look like from a protocol
> level, it'd be great if we could make it able to support being used in a
> parallel fashion, even if pg_basebackup is still single-threaded.

I think we're getting closer to a meeting of the minds here, but I
don't think it's intrinsically necessary to rewrite the whole method
of operation of pg_basebackup to implement incremental backup in a
sensible way.  One could instead just do a straightforward extension
to the existing BASE_BACKUP command to enable incremental backup.
Then, to enable parallel full backup and all sorts of out-of-core
hacking, one could expand the command language to allow tools to
access individual steps: START_BACKUP, SEND_FILE_LIST,
SEND_FILE_CONTENTS, STOP_BACKUP, or whatever.  The second thing makes
for an appealing project, but I do not think there is a technical
reason why it has to be done first.  Or for that matter why it has to
be done second.  As I keep saying, incremental backup and full backup
are separate projects and I believe it's completely reasonable for
whoever is doing the work to decide on the order in which they would
like to do the work.

Having said that, I'm curious what people other than Stephen (and
other pgbackrest hackers) think about the relative value of parallel
backup vs. incremental backup.  Stephen appears quite convinced that
parallel backup is full of win and incremental backup is a bit of a
yawn by comparison, and while I certainly would not want to discount
the value of his experience in this area, it sometimes happens on this
mailing list that [ drum roll please ] not everybody agrees about
everything.  So, what do other people think?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: block-level incremental backup

2019-04-22 Thread Konstantin Knizhnik




On 22.04.2019 2:02, Robert Haas wrote:

On Sat, Apr 20, 2019 at 4:32 PM Stephen Frost  wrote:

Having been around for a while working on backup-related things, if I
was to implement the protocol for pg_basebackup today, I'd definitely
implement "give me a list" and "give me this file" rather than the
tar-based approach, because I've learned that people want to be
able to do parallel backups and that's a decent way to do that.  I
wouldn't set out and implement something new that's there's just no hope
of making parallel.  Maybe the first write of pg_basebackup would still
be simple and serial since it's certainly more work to make a frontend
tool like that work in parallel, but at least the protocol would be
ready to support a parallel option being added alter without being
rewritten.

And that's really what I was trying to get at here- if we've got the
choice now to decide what this is going to look like from a protocol
level, it'd be great if we could make it able to support being used in a
parallel fashion, even if pg_basebackup is still single-threaded.

I think we're getting closer to a meeting of the minds here, but I
don't think it's intrinsically necessary to rewrite the whole method
of operation of pg_basebackup to implement incremental backup in a
sensible way.  One could instead just do a straightforward extension
to the existing BASE_BACKUP command to enable incremental backup.
Then, to enable parallel full backup and all sorts of out-of-core
hacking, one could expand the command language to allow tools to
access individual steps: START_BACKUP, SEND_FILE_LIST,
SEND_FILE_CONTENTS, STOP_BACKUP, or whatever.  The second thing makes
for an appealing project, but I do not think there is a technical
reason why it has to be done first.  Or for that matter why it has to
be done second.  As I keep saying, incremental backup and full backup
are separate projects and I believe it's completely reasonable for
whoever is doing the work to decide on the order in which they would
like to do the work.

Having said that, I'm curious what people other than Stephen (and
other pgbackrest hackers) think about the relative value of parallel
backup vs. incremental backup.  Stephen appears quite convinced that
parallel backup is full of win and incremental backup is a bit of a
yawn by comparison, and while I certainly would not want to discount
the value of his experience in this area, it sometimes happens on this
mailing list that [ drum roll please ] not everybody agrees about
everything.  So, what do other people think?



Based on the experience of pg_probackup users I can say that  there is 
no 100% winer and depending on use case either

parallel either incremental backups are preferable.
- If size of database is not so larger and intensity of updates is high 
enough, then parallel backup within one data center is definitely more 
efficient solution.
- If size of database is very large and data is rarely updated or 
database is mostly append-only, then incremental backup is preferable.
- Some customers need to collect at central server backups of databases 
installed at many nodes with slow and unreliable connection (assume DBMS 
installed at locomotives). Definitely parallelism can not help here, 
unlike support of incremental backup.
- Parallel backup more aggressively consumes resources of the system, 
interfering with normal work of application. So performing parallel 
backup may cause significant degradation of application speed.


pg_probackup supports both features: parallel and incremental backups 
and it is up to user how to use it in more efficient way for particular 
configuration.




--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: block-level incremental backup

2019-04-22 Thread Stephen Frost
Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Sat, Apr 20, 2019 at 4:32 PM Stephen Frost  wrote:
> > Having been around for a while working on backup-related things, if I
> > was to implement the protocol for pg_basebackup today, I'd definitely
> > implement "give me a list" and "give me this file" rather than the
> > tar-based approach, because I've learned that people want to be
> > able to do parallel backups and that's a decent way to do that.  I
> > wouldn't set out and implement something new that's there's just no hope
> > of making parallel.  Maybe the first write of pg_basebackup would still
> > be simple and serial since it's certainly more work to make a frontend
> > tool like that work in parallel, but at least the protocol would be
> > ready to support a parallel option being added alter without being
> > rewritten.
> >
> > And that's really what I was trying to get at here- if we've got the
> > choice now to decide what this is going to look like from a protocol
> > level, it'd be great if we could make it able to support being used in a
> > parallel fashion, even if pg_basebackup is still single-threaded.
> 
> I think we're getting closer to a meeting of the minds here, but I
> don't think it's intrinsically necessary to rewrite the whole method
> of operation of pg_basebackup to implement incremental backup in a
> sensible way.  

It wasn't my intent to imply that the whole method of operation of
pg_basebackup would have to change for this.

> One could instead just do a straightforward extension
> to the existing BASE_BACKUP command to enable incremental backup.

Ok, how do you envision that?  As I mentioned up-thread, I am concerned
that we're talking too high-level here and it's making the discussion
more difficult than it would be if we were to put together specific
ideas and then discuss them.

One way I can imagine to extend BASE_BACKUP is by adding LSN as an
optional parameter and then having the database server scan the entire
cluster and send a tarball which contains essentially a 'diff' file of
some kind for each file where we can construct a diff based on the LSN,
and then the complete contents of the file for everything else that
needs to be in the backup.

So, sure, that would work, but it wouldn't be able to be parallelized
and I don't think it'd end up being very exciting for the external tools
because of that, but it would be fine for pg_basebackup.

On the other hand, if you added new commands for 'list of files changed
since this LSN' and 'give me this file' and 'give me this file with the
changes in it since this LSN', then pg_basebackup could work with that
pretty easily in a single-threaded model (maybe with two connections to
the backend, but still in a single process, or maybe just by slurping up
the file list and then asking for each one) and the external tools could
leverage those new capabilities too for their backups, both full backups
and incremental ones.  This also wouldn't have to change how
pg_basebackup does full backups today one bit, so what we're really
talking about here is the direction to take the new code that's being
written, not about rewriting existing code.  I agree that it'd be a bit
more work...  but hopefully not *that* much more, and it would mean we
could later add parallel backup to pg_basebackup more easily too, if we
wanted to.

> Then, to enable parallel full backup and all sorts of out-of-core
> hacking, one could expand the command language to allow tools to
> access individual steps: START_BACKUP, SEND_FILE_LIST,
> SEND_FILE_CONTENTS, STOP_BACKUP, or whatever.  The second thing makes
> for an appealing project, but I do not think there is a technical
> reason why it has to be done first.  Or for that matter why it has to
> be done second.  As I keep saying, incremental backup and full backup
> are separate projects and I believe it's completely reasonable for
> whoever is doing the work to decide on the order in which they would
> like to do the work.

I didn't mean to imply that one had to be done before the other from a
technical standpoint.  I agree that they don't depend on each other.

You're certainly welcome to do what you would like, I simply wanted to
share my experiences and try to help move this in a direction that would
involve less code rewrite in the future and to have a feature that would
be more appealing to the external tools.

> Having said that, I'm curious what people other than Stephen (and
> other pgbackrest hackers) 

While David and I do talk, we haven't really discussed this proposal all
that much, so please don't assume that he shares my thoughts here.  I'd
also like to hear what others think, particularly those who have been
working in this area.

> think about the relative value of parallel
> backup vs. incremental backup.  Stephen appears quite convinced that
> parallel backup is full of win and incremental backup is a bit of a
> yawn by comparison, and while I certainly would not want to di

Re: block-level incremental backup

2019-04-22 Thread Andres Freund
Hi,

On 2019-04-19 20:04:41 -0400, Stephen Frost wrote:
> I agree that we don't want another implementation and that there's a lot
> that we want to do to improve replay performance.  We've already got
> frontend tools which work with multiple execution threads, so I'm not
> sure I get the "not easily feasible" bit, and the argument about the
> checkpointer seems largely related to that (as in- if we didn't have
> multiple threads/processes then things would perform quite badly...  but
> we can and do have multiple threads/processes in frontend tools today,
> even in pg_basebackup).

You need not just multiple execution threads, but basically a new
implementation of shared buffers, locking, process monitoring, with most
of the related infrastructure. You're literally talking about
reimplementing a very substantial portion of the backend.  I'm not sure
I can transport in written words - via a public medium - how bad an idea
it would be to go there.


> You certainly bring up some good concerns though and they make me think
> of other bits that would seem like they'd possibly be larger issues for
> a frontend tool- like having a large pool of memory for cacheing (aka
> shared buffers) the changes.  If what we're talking about here is *just*
> replay though, without having the system available for reads, I wonder
> if we might want a different solution there.

No.


> > Which I think is entirely reasonable. With the 'consistent' and LSN
> > recovery targets one already can get most of what's needed from such a
> > tool, anyway.  I'd argue the biggest issue there is that there's no
> > equivalent to starting postgres with a private socket directory on
> > windows, and perhaps an option or two making it easier to start postgres
> > in a "private" mode for things like this.
> 
> This would mean building in a way to do parallel WAL replay into the
> server binary though, as discussed above, and it seems like making that
> work in a way that allows us to still be available as a read-only
> standby would be quite a bit more difficult.  We could possibly support
> parallel WAL replay only when we aren't a replica but from the same
> binary.

I'm doubtful that we should try to implement parallel WAL apply that
can't support HS - a substantial portion of the the logic to avoid
issues around relfilenode reuse, consistency etc is going to be to be
necessary for non-HS aware apply anyway.  But if somebody had a concrete
proposal for something that's fundamentally only doable without HS, I
could be convinced.


> The concerns mentioned about making it easier to start PG in a
> private mode don't seem too bad but I am not entirely sure that the
> tools which want to leverage that kind of capability would want to have
> to exec out to the PG binary to use it.

Tough luck.  But even leaving infeasability aside, it seems like a quite
bad idea to do this in-process inside a tool that manages backup &
recovery. Creating threads / sub-processes with complicated needs (like
any pared down version of pg to do just recovery would have) from within
a library has substantial complications. So you'd not want to do this
in-process anyway.


> A lot of this part of the discussion feels like a tangent though, unless
> I'm missing something.

I'm replying to:

On 2019-04-17 18:43:10 -0400, Stephen Frost wrote:
> Wow.  I have to admit that I feel completely opposite of that- I'd
> *love* to have an independent tool (which ideally uses the same code
> through the common library, or similar) that can be run to apply WAL.

And I'm basically saying that anything that starts from this premise is
fatally flawed (in the ex falso quodlibet kind of sense ;)).


> The "WAL compression" tool contemplated
> previously would be much simpler and not the full-blown WAL replay
> capability, which would be left to the server, unless you're suggesting
> that even that should be exclusively the purview of the backend?  Though
> that ship's already sailed, given that external projects have
> implemented it.

I'm extremely doubtful of such tools (but it's not what I was responding
too, see above). I'd be extremely surprised if even one of them came
close to being correct. The old FPI removal tool had data corrupting
bugs left and right.


> Having a library to provide that which external
> projects could leverage would be nicer than having everyone write their
> own version.

No, I don't think that's necessarily true. Something complicated that's
hard to get right doesn't have to be provided by core. Even if other
projects decide that their risk/reward assesment is different than core
postgres'. We don't have to take on all kind of work and complexity for
external tools.

Greetings,

Andres Freund




Re: block-level incremental backup

2019-04-22 Thread Robert Haas
On Mon, Apr 22, 2019 at 1:08 PM Stephen Frost  wrote:
> > I think we're getting closer to a meeting of the minds here, but I
> > don't think it's intrinsically necessary to rewrite the whole method
> > of operation of pg_basebackup to implement incremental backup in a
> > sensible way.
>
> It wasn't my intent to imply that the whole method of operation of
> pg_basebackup would have to change for this.

Cool.

> > One could instead just do a straightforward extension
> > to the existing BASE_BACKUP command to enable incremental backup.
>
> Ok, how do you envision that?  As I mentioned up-thread, I am concerned
> that we're talking too high-level here and it's making the discussion
> more difficult than it would be if we were to put together specific
> ideas and then discuss them.
>
> One way I can imagine to extend BASE_BACKUP is by adding LSN as an
> optional parameter and then having the database server scan the entire
> cluster and send a tarball which contains essentially a 'diff' file of
> some kind for each file where we can construct a diff based on the LSN,
> and then the complete contents of the file for everything else that
> needs to be in the backup.

/me scratches head.  Isn't that pretty much what I described in my
original post?  I even described what that "'diff' file of some kind"
would look like in some detail in the paragraph of that emailed
numbered "2.", and I described the reasons for that choice at length
in 
http://postgr.es/m/ca+tgmozrqdv-tb8ny9p+1pqlqkxp5f1afghuohh5qt6ewdk...@mail.gmail.com

I can't figure out how I'm managing to be so unclear about things
about which I thought I'd been rather explicit.

> So, sure, that would work, but it wouldn't be able to be parallelized
> and I don't think it'd end up being very exciting for the external tools
> because of that, but it would be fine for pg_basebackup.

Stop being such a pessimist.  Yes, if we only add the option to the
BASE_BACKUP command, it won't directly be very exciting for external
tools, but a lot of the work that is needed to do things that ARE
exciting for external tools will have been done.  For instance, if the
work to figure out which blocks have been modified via WAL-scanning
gets done, and initially that's only exposed via BASE_BACKUP, it won't
be much work for somebody to write code for a new code that exposes
that information directly through some new replication command.
There's a difference between something that's going in the wrong
direction and something that's going in the right direction but not as
far or as fast as you'd like.  And I'm 99% sure that everything I'm
proposing here falls in the latter category rather than the former.

> On the other hand, if you added new commands for 'list of files changed
> since this LSN' and 'give me this file' and 'give me this file with the
> changes in it since this LSN', then pg_basebackup could work with that
> pretty easily in a single-threaded model (maybe with two connections to
> the backend, but still in a single process, or maybe just by slurping up
> the file list and then asking for each one) and the external tools could
> leverage those new capabilities too for their backups, both full backups
> and incremental ones.  This also wouldn't have to change how
> pg_basebackup does full backups today one bit, so what we're really
> talking about here is the direction to take the new code that's being
> written, not about rewriting existing code.  I agree that it'd be a bit
> more work...  but hopefully not *that* much more, and it would mean we
> could later add parallel backup to pg_basebackup more easily too, if we
> wanted to.

For purposes of implementing parallel pg_basebackup, it would probably
be better if the server rather than the client decided which files to
send via which connection.  If the client decides, then every time the
server finishes sending a file, the client has to request another
file, and that introduces some latency: after the server finishes
sending each file, it has to wait for the client to finish receiving
the data, and it has to wait for the client to tell it what file to
send next.  If the server decides, then it can just send data at top
speed without a break.  So the ideal interface for pg_basebackup would
really be something like:

START_PARALLEL_BACKUP blah blah PARTICIPANTS 4;

...returning a cookie that can be then be used by each participant for
an argument to a new commands:

JOIN_PARALLLEL_BACKUP 'cookie';

However, that is obviously extremely inconvenient for third-party
tools.  It's possible we need both an interface like this -- for use
by parallel pg_basebackup -- and a
START_BACKUP/SEND_FILE_LIST/SEND_FILE_CONTENTS/STOP_BACKUP type
interface for use by external tools.  On the other hand, maybe the
additional overhead caused by managing the list of files to be fetched
on the client side is negligible.  It'd be interesting to see, though,
how busy the server is when running an incremental backup managed by
an external 

Re: block-level incremental backup

2019-04-22 Thread Stephen Frost
Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> On 2019-04-19 20:04:41 -0400, Stephen Frost wrote:
> > I agree that we don't want another implementation and that there's a lot
> > that we want to do to improve replay performance.  We've already got
> > frontend tools which work with multiple execution threads, so I'm not
> > sure I get the "not easily feasible" bit, and the argument about the
> > checkpointer seems largely related to that (as in- if we didn't have
> > multiple threads/processes then things would perform quite badly...  but
> > we can and do have multiple threads/processes in frontend tools today,
> > even in pg_basebackup).
> 
> You need not just multiple execution threads, but basically a new
> implementation of shared buffers, locking, process monitoring, with most
> of the related infrastructure. You're literally talking about
> reimplementing a very substantial portion of the backend.  I'm not sure
> I can transport in written words - via a public medium - how bad an idea
> it would be to go there.

Yes, there'd be some need for locking and process monitoring, though if
we aren't supporting ongoing read queries at the same time, there's a
whole bunch of things that we don't need from the existing backend.

> > > Which I think is entirely reasonable. With the 'consistent' and LSN
> > > recovery targets one already can get most of what's needed from such a
> > > tool, anyway.  I'd argue the biggest issue there is that there's no
> > > equivalent to starting postgres with a private socket directory on
> > > windows, and perhaps an option or two making it easier to start postgres
> > > in a "private" mode for things like this.
> > 
> > This would mean building in a way to do parallel WAL replay into the
> > server binary though, as discussed above, and it seems like making that
> > work in a way that allows us to still be available as a read-only
> > standby would be quite a bit more difficult.  We could possibly support
> > parallel WAL replay only when we aren't a replica but from the same
> > binary.
> 
> I'm doubtful that we should try to implement parallel WAL apply that
> can't support HS - a substantial portion of the the logic to avoid
> issues around relfilenode reuse, consistency etc is going to be to be
> necessary for non-HS aware apply anyway.  But if somebody had a concrete
> proposal for something that's fundamentally only doable without HS, I
> could be convinced.

I'd certainly prefer that we support parallel WAL replay *with* HS, that
just seems like a much larger problem, but I'd be quite happy to be told
that it wouldn't be that much harder.

> > A lot of this part of the discussion feels like a tangent though, unless
> > I'm missing something.
> 
> I'm replying to:
> 
> On 2019-04-17 18:43:10 -0400, Stephen Frost wrote:
> > Wow.  I have to admit that I feel completely opposite of that- I'd
> > *love* to have an independent tool (which ideally uses the same code
> > through the common library, or similar) that can be run to apply WAL.
> 
> And I'm basically saying that anything that starts from this premise is
> fatally flawed (in the ex falso quodlibet kind of sense ;)).

I'd just say that it'd be... difficult. :)

> > The "WAL compression" tool contemplated
> > previously would be much simpler and not the full-blown WAL replay
> > capability, which would be left to the server, unless you're suggesting
> > that even that should be exclusively the purview of the backend?  Though
> > that ship's already sailed, given that external projects have
> > implemented it.
> 
> I'm extremely doubtful of such tools (but it's not what I was responding
> too, see above). I'd be extremely surprised if even one of them came
> close to being correct. The old FPI removal tool had data corrupting
> bugs left and right.

I have concerns about it myself, which is why I'd actually really like
to see something in core that does it, and does it the right way, that
other projects could then leverage (ideally by just linking into the
library without having to rewrite what's in core, though that might not
be an option for things like WAL-G that are in Go and possibly don't
want to link in some C library).

> > Having a library to provide that which external
> > projects could leverage would be nicer than having everyone write their
> > own version.
> 
> No, I don't think that's necessarily true. Something complicated that's
> hard to get right doesn't have to be provided by core. Even if other
> projects decide that their risk/reward assesment is different than core
> postgres'. We don't have to take on all kind of work and complexity for
> external tools.

No, it doesn't have to be provided by core, but I sure would like it to
be and I'd be much more comfortable if it was because then we'd also
take care to not break whatever assumptions are made (or to do so in a
way that can be detected and/or handled) as new code is written.  As
discussed above, as long as it isn't provided by core

Re: block-level incremental backup

2019-04-22 Thread Stephen Frost
Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Mon, Apr 22, 2019 at 1:08 PM Stephen Frost  wrote:
> > > One could instead just do a straightforward extension
> > > to the existing BASE_BACKUP command to enable incremental backup.
> >
> > Ok, how do you envision that?  As I mentioned up-thread, I am concerned
> > that we're talking too high-level here and it's making the discussion
> > more difficult than it would be if we were to put together specific
> > ideas and then discuss them.
> >
> > One way I can imagine to extend BASE_BACKUP is by adding LSN as an
> > optional parameter and then having the database server scan the entire
> > cluster and send a tarball which contains essentially a 'diff' file of
> > some kind for each file where we can construct a diff based on the LSN,
> > and then the complete contents of the file for everything else that
> > needs to be in the backup.
> 
> /me scratches head.  Isn't that pretty much what I described in my
> original post?  I even described what that "'diff' file of some kind"
> would look like in some detail in the paragraph of that emailed
> numbered "2.", and I described the reasons for that choice at length
> in 
> http://postgr.es/m/ca+tgmozrqdv-tb8ny9p+1pqlqkxp5f1afghuohh5qt6ewdk...@mail.gmail.com
> 
> I can't figure out how I'm managing to be so unclear about things
> about which I thought I'd been rather explicit.

There was basically zero discussion about what things would look like at
a protocol level (I went back and skimmed over the thread before sending
my last email to specifically see if I was going to get this response
back..).  I get the idea behind the diff file, the contents of which I
wasn't getting into above.

> > So, sure, that would work, but it wouldn't be able to be parallelized
> > and I don't think it'd end up being very exciting for the external tools
> > because of that, but it would be fine for pg_basebackup.
> 
> Stop being such a pessimist.  Yes, if we only add the option to the
> BASE_BACKUP command, it won't directly be very exciting for external
> tools, but a lot of the work that is needed to do things that ARE
> exciting for external tools will have been done.  For instance, if the
> work to figure out which blocks have been modified via WAL-scanning
> gets done, and initially that's only exposed via BASE_BACKUP, it won't
> be much work for somebody to write code for a new code that exposes
> that information directly through some new replication command.
> There's a difference between something that's going in the wrong
> direction and something that's going in the right direction but not as
> far or as fast as you'd like.  And I'm 99% sure that everything I'm
> proposing here falls in the latter category rather than the former.

I didn't mean to imply that you're doing in the wrong direction here and
I thought I said somewhere in my last email more-or-less exactly the
same, that a great deal of the work needed for block-level incremental
backup would be done, but specifically that this proposal wouldn't allow
external tools to leverage that.  It sounds like what you're suggesting
now is that you're happy to implement the backend code, expose it in a
way that works just for pg_basebackup, and that if someone else wants to
add things to the protocol to make it easier for external tools to
leverage, great.  All I can say is that that's basically how we ended up
in the situation we're in today where pg_basebackup doesn't support
parallel backup but a bunch of external tools do and they don't go
through the backend to get there, even though they'd probably prefer to.

> > On the other hand, if you added new commands for 'list of files changed
> > since this LSN' and 'give me this file' and 'give me this file with the
> > changes in it since this LSN', then pg_basebackup could work with that
> > pretty easily in a single-threaded model (maybe with two connections to
> > the backend, but still in a single process, or maybe just by slurping up
> > the file list and then asking for each one) and the external tools could
> > leverage those new capabilities too for their backups, both full backups
> > and incremental ones.  This also wouldn't have to change how
> > pg_basebackup does full backups today one bit, so what we're really
> > talking about here is the direction to take the new code that's being
> > written, not about rewriting existing code.  I agree that it'd be a bit
> > more work...  but hopefully not *that* much more, and it would mean we
> > could later add parallel backup to pg_basebackup more easily too, if we
> > wanted to.
> 
> For purposes of implementing parallel pg_basebackup, it would probably
> be better if the server rather than the client decided which files to
> send via which connection.  If the client decides, then every time the
> server finishes sending a file, the client has to request another
> file, and that introduces some latency: after the server finishes
> sending each file, it has

Re: block-level incremental backup

2019-04-22 Thread Andres Freund
Hi,

On 2019-04-22 14:26:40 -0400, Stephen Frost wrote:
> I'm disappointed that the concerns about the trouble that end users are
> likely to have with this didn't garner more discussion.

My impression is that endusers are having a lot more trouble due to
important backup/restore features not being in core/pg_basebackup, than
due to external tools having a harder time to implement certain
features. Focusing on external tools being able to provide all those
features, because core hasn't yet, is imo entirely the wrong thing to
concentrate upon.  And it's not like things largely haven't been
implemented in pg_basebackup for fundamental architectural reasons.
It's because we've built like 5 different external tools with randomly
differing featureset and licenses.

Greetings,

Andres Freund




Re: block-level incremental backup

2019-04-22 Thread Stephen Frost
Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> On 2019-04-22 14:26:40 -0400, Stephen Frost wrote:
> > I'm disappointed that the concerns about the trouble that end users are
> > likely to have with this didn't garner more discussion.
> 
> My impression is that endusers are having a lot more trouble due to
> important backup/restore features not being in core/pg_basebackup, than
> due to external tools having a harder time to implement certain
> features.

I had been referring specifically to the concern I raised about
incremental block-level backups being added to pg_basebackup and how
that'll make using pg_basebackup more complicated and therefore more
difficult for end-users to get right, particularly if the end user is
having to handle management of the association between the full backup
and the incremental backups.  I wasn't referring to anything regarding
external tools.

> Focusing on external tools being able to provide all those
> features, because core hasn't yet, is imo entirely the wrong thing to
> concentrate upon.  And it's not like things largely haven't been
> implemented in pg_basebackup for fundamental architectural reasons.
> It's because we've built like 5 different external tools with randomly
> differing featureset and licenses.

There's a few challenges when it comes to adding backup features to
core.  One of the reasons is that core naturally moves slower when it
comes to development than external projects do, as was discusssed
earlier on this thread.  Another is that, when it comes to backup,
specifically, people want to back up their *existing* systems, which
means that they need a backup tool that's going to work with whatever
version of PG they've currently got deployed and that's often a few
years old already.  Certainly when I've thought about features that we'd
like to see and considered if there's something that could be
implemented in core vs. implemented outside of core, the answer often
ends up being "well, if we do it ourselves then we can make it work for
PG 9.2 and above, and have it working for existing users, but if we work
it in as part of core, it won't be available until next year and only
for version 12 and above, and users can only use it once they've
upgraded.."

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: block-level incremental backup

2019-04-22 Thread Robert Haas
On Mon, Apr 22, 2019 at 2:26 PM Stephen Frost  wrote:
> There was basically zero discussion about what things would look like at
> a protocol level (I went back and skimmed over the thread before sending
> my last email to specifically see if I was going to get this response
> back..).  I get the idea behind the diff file, the contents of which I
> wasn't getting into above.

Well, I wrote:

"There should be a way to tell pg_basebackup to request from the
server only those blocks where LSN >= threshold_value."

I guess I assumed that people would interested in the details take
that to mean "and therefore the protocol would grow an option for this
type of request in whatever way is the most straightforward possible
extension of the current functionality is," which is indeed how you
eventually interpreted it when you said we could "extend BASE_BACKUP
is by adding LSN as an optional parameter."

I could have been more explicit, but sometimes people tell me that my
emails are too long.

> external tools to leverage that.  It sounds like what you're suggesting
> now is that you're happy to implement the backend code, expose it in a
> way that works just for pg_basebackup, and that if someone else wants to
> add things to the protocol to make it easier for external tools to
> leverage, great.

Yep, that's more or less it, although I am potentially willing to do
some modest amount of that other work along the way.  I just don't
want to prioritize it higher than getting the actual thing I want to
build built, which I think is a pretty fair position for me to take.

> All I can say is that that's basically how we ended up
> in the situation we're in today where pg_basebackup doesn't support
> parallel backup but a bunch of external tools do and they don't go
> through the backend to get there, even though they'd probably prefer to.

I certainly agree that core should try to do things in a way that is
useful to external tools when that can be done without undue effort,
but only if it can actually be done without undo effort.  Let's see
whether that's the case here:

- Anastasia wants a command added that dumps out whatever the server
knows about what files have changed, which I already agreed was a
reasonable extension of my initial proposal.

- You said that for this to be useful to pgbackrest, it'd have to use
a whole different mechanism that includes commands to request
individual files and blocks within those files, which would be a
significant rewrite of pg_basebackup that you agreed is more closely
related to parallel backup than to the project under discussion on
this thread.  And that even then pgbackrest probably wouldn't use it
because it also does server-side compression and encryption which are
not included in this proposal.

It seems to me that the first one falls into the category a reasonable
additional effort and the second one falls into the category of lots
of extra and unrelated work that wouldn't even get used.

> Thanks for sharing your thoughts on that, certainly having the backend
> able to be more intelligent about streaming files to avoid latency is
> good and possibly the best approach.  Another alternative to reducing
> the latency would be to have a way for the client to request a set of
> files, but I don't know that it'd be better.

I don't know either.  This is an area that needs more thought, I
think, although as discussed, it's more related to parallel backup
than $SUBJECT.

> I'm not really sure why the above is extremely inconvenient for
> third-party tools, beyond just that they've already been written to work
> with an assumption that the server-side of things isn't as intelligent
> as PG is.

Well, one thing you might want to do is have a tool that connects to
the server, enters backup mode, requests information on what blocks
have changed, copies those blocks via direct filesystem access, and
then exits backup mode.  Such a tool would really benefit from a
START_BACKUP / SEND_FILE_LIST / SEND_FILE_CONTENTS / STOP_BACKUP
command language, because it would just skip ever issuing the
SEND_FILE_CONTENTS command in favor of doing that part of the work via
other means.  On the other hand, a START_PARALLEL_BACKUP LSN '1/234'
command is useless to such a tool.

Contrariwise, a tool that has its own magic - perhaps based on
WAL-scanning or something like ptrack - to know which files currently
exist and which blocks are modified could use SEND_FILE_CONTENTS but
not SEND_FILE_LIST.  And a filesystem-snapshot based technique might
use START_BACKUP and STOP_BACKUP but nothing else.

In short, providing granular commands like this lets the client be
really intelligent even if the server isn't, and lets the client have
fine-grained control of the process.  This is very good if you're an
out-of-core tool maintainer and your tool is trying to be smarter than
- or even just differently-designed than - core.

But if what you really want is just a maximally-efficient parallel
backup, you don't nee

Re: block-level incremental backup

2019-04-23 Thread Anastasia Lubennikova

22.04.2019 2:02, Robert Haas wrote:

I think we're getting closer to a meeting of the minds here, but I
don't think it's intrinsically necessary to rewrite the whole method
of operation of pg_basebackup to implement incremental backup in a
sensible way.  One could instead just do a straightforward extension
to the existing BASE_BACKUP command to enable incremental backup.
Then, to enable parallel full backup and all sorts of out-of-core
hacking, one could expand the command language to allow tools to
access individual steps: START_BACKUP, SEND_FILE_LIST,
SEND_FILE_CONTENTS, STOP_BACKUP, or whatever.  The second thing makes
for an appealing project, but I do not think there is a technical
reason why it has to be done first.  Or for that matter why it has to
be done second.  As I keep saying, incremental backup and full backup
are separate projects and I believe it's completely reasonable for
whoever is doing the work to decide on the order in which they would
like to do the work.

Having said that, I'm curious what people other than Stephen (and
other pgbackrest hackers) think about the relative value of parallel
backup vs. incremental backup.  Stephen appears quite convinced that
parallel backup is full of win and incremental backup is a bit of a
yawn by comparison, and while I certainly would not want to discount
the value of his experience in this area, it sometimes happens on this
mailing list that [ drum roll please ] not everybody agrees about
everything.  So, what do other people think?


Personally, I believe that incremental backups are more useful to implement
first since they benefit both backup speed and the space taken by a backup.
Frankly speaking, I'm a bit surprised that the discussion of parallel 
backups

took so much of this thread.
Of course, we must keep it in mind, while designing the API to avoid 
introducing

any architectural obstacles, but any further discussion of parallelism is a
subject of another topic.


I understand Stephen's concerns about the difficulties of incremental backup
management.
Even with an assumption that user is ready to manage backup chains, 
retention,
and other stuff, we must consider the format of backup metadata that 
will allow

us to perform some primitive commands:

1) Tell whether this backup full or incremental.

2) Tell what backup is a parent of this incremental backup.
Probably, we can limit it to just returning "start_lsn", which later can be
compared to "stop_lsn" of parent backup.

3) Take an incremental backup based on this backup.
Here we must help a backup manager to retrieve the LSN to pass it to
pg_basebackup.

4) Restore an incremental backup into a directory (on top of already 
restored

full backup).
One may use it to perform "merge" or "restore" of the incremental backup,
depending on the destination directory.
I wonder if it is possible to integrate it into any existing tool, or we 
end up

with something like pg_basebackup/pg_baserestore as in case of
pg_dump/pg_restore.

Have you designed these? I may only recall "pg_combinebackup" from the very
first message in this thread, which looks more like a sketch to explain the
idea, rather than the thought-out feature design. I also found a page
https://wiki.postgresql.org/wiki/Incremental_backup that raises the same
questions.
I'm volunteering to write a draft patch or, more likely, set of patches, 
which

will allow us to discuss the subject in more detail.
And to do that I wish we agree on the API and data format (at least 
broadly).

Looking forward to hearing your thoughts.


As I see it, ideally the backup management tools should concentrate more on
managing multiple backups, while all the logic of taking a single backup 
(of any
kind) should be integrated into the core. It means that any out-of-core 
client
won't have to walk the PGDATA directory and care about all the postgres 
specific
knowledge of data files consisting of blocks with headers and LSNs and 
so on. It

simply requests data and gets it.
Understandably, it won't be implemented in one take and what is more 
probably,

it is not reachable fully.
Still, it will be great to do our best to provide such tools (both 
existing and

future) with conveniently formatted data and API to get it.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: block-level incremental backup

2019-04-23 Thread Adam Brusselback
I hope it's alright to throw in my $0.02 as a user. I've been following
this (and the other thread on reading WAL to find modified blocks,
prefaulting, whatever else) since the start with great excitement and would
love to see the built-in backup capabilities in Postgres greatly improved.
I know this is not completely on-topic for just incremental backups, so I
apologize in advance. It just seemed like the most apt place to chime in.


Just to preface where I am coming from, I have been using pgBackRest for
the past couple years and used wal-e prior to that.  I am not a big *nix
user other than all my servers, do all my development on Windows / use
primarily Java. The command line is not where I feel most comfortable
despite my best efforts over the last 5-6 years. Prior to Postgres, I used
SQL Server for quite a few years at previous companies but was more a
junior / intermediate skill set back then. I just wanted to put that out
there so you can see where my bias's are.




With all that said, I would not be comfortable using pg_basebackup as my
main backup tool simply because I’d have to cobble together numerous tools
to get backups stored in a safe (not on the same server) location, I’d have
to manage expiring backups and the WAL which is no longer needed, along
with the rest of the stuff that makes these backup management tools useful.


The command line scares me, and even if I was able to get all that working,
I would not feel warm and fuzzy I didn’t mess something up horribly and I
may hit an edge case which destroys backups, silently corrupts data, etc.

I love that there are tools that manage all of it; backups, wal archiving,
remote storage, integrate with cloud storage (S3 and the like), manages the
retention of these backups with all their dependencies for me, and has all
the restore options necessary built in as well.


Block level incremental backup would be amazing for my use case. I have
small updates / deletes that happen to data all over some of my largest
tables. With pgBackRest, since the diff/incremental backups are at the file
level, I can have a single update / delete which touched a random spot in a
table and now requires that whole 1gb file to be backed up again. That
said, even if pg_basebackup was the only tool that did incremental block
level backup tomorrow, I still wouldn’t start using it directly. I went
into the issues I’d have to deal with if I used pg_basebackup above, and
incremental backups without a management tool make me think using it
correctly would be much harder.


I know this thread is just about incremental backup, and that pretty much
everything in core is built up from small features into larger more complex
ones. I understand that and am not trying to dump on any efforts, I am
super excited to see work being done in this area! I just wanted to share
my perspective on how crucial good backup management is to me (and I’m sure
a few others may share my sentiment considering how popular all the
external tools are).

I would never put a system in production unless I have some backup
management in place. If core builds a backup management tool which uses
pg_basebackup as building blocks for its solution…awesome! That may be
something I’d use.  If pg_basebackup can be improved so it can be used as
the basis most external backup management tools can build on top of, that’s
also great. All the external tools which practically every Postgres company
have built show that it’s obviously a need for a lot of users. Core will
never solve every single problem for all users, I know that. It would just
be great to see some of the fundamental features of backup management baked
into core in an extensible way.

With that, there could be a recommended way to set up backups
(full/incremental, parallel, compressed), point in time recovery, backup
retention, and perform restores (to a point in time, on a replica server,
etc) with just the tooling within core with a nice and simple user
interface, and great performance.

If those features core supports in the internal tooling are built in an
extensible way (as has been discussed), there could be much less
duplication of work implementing the same base features over and over for
each external tool. Those companies can focus on more value-added features
to their own products that core would never support, or on improving the
tooling/performance/features core provides.


Well, this is way longer and a lot less coherent than I was hoping, so I
apologize for that. Hopefully my stream of thoughts made a little bit of
sense to someone.


-Adam


Re: block-level incremental backup

2019-04-24 Thread Stephen Frost
Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Mon, Apr 22, 2019 at 2:26 PM Stephen Frost  wrote:
> > There was basically zero discussion about what things would look like at
> > a protocol level (I went back and skimmed over the thread before sending
> > my last email to specifically see if I was going to get this response
> > back..).  I get the idea behind the diff file, the contents of which I
> > wasn't getting into above.
> 
> Well, I wrote:
> 
> "There should be a way to tell pg_basebackup to request from the
> server only those blocks where LSN >= threshold_value."
> 
> I guess I assumed that people would interested in the details take
> that to mean "and therefore the protocol would grow an option for this
> type of request in whatever way is the most straightforward possible
> extension of the current functionality is," which is indeed how you
> eventually interpreted it when you said we could "extend BASE_BACKUP
> is by adding LSN as an optional parameter."

Looking at it from what I'm sitting, I brought up two ways that we
could extend the protocol to "request from the server only those blocks
where LSN >= threshold_value" with one being the modification to
BASE_BACKUP and the other being a new set of commands that could be
parallelized.  If I had assumed that you'd be thinking the same way I am
about extending the backup protocol, I wouldn't have said anything now
and then would have complained after you wrote a patch that just
extended the BASE_BACKUP command, at which point I likely would have
been told that it's now been done and that I should have mentioned it
earlier.

> > external tools to leverage that.  It sounds like what you're suggesting
> > now is that you're happy to implement the backend code, expose it in a
> > way that works just for pg_basebackup, and that if someone else wants to
> > add things to the protocol to make it easier for external tools to
> > leverage, great.
> 
> Yep, that's more or less it, although I am potentially willing to do
> some modest amount of that other work along the way.  I just don't
> want to prioritize it higher than getting the actual thing I want to
> build built, which I think is a pretty fair position for me to take.

At least in part then it seems like we're viewing the level of effort
around what I'm talking about quite differently, and I feel like that's
largely because every time I mention parallel anything there's this
assumption that I'm asking you to parallelize pg_basebackup or write a
whole bunch more code to provide a fully optimized server-side parallel
implementation for backups.  That really wasn't what I was going for.  I
was thinking it would be a modest amount of additional work add
incremental backup via a few new commands, instead of through the
BASE_BACKUP protocol command, that would make parallelization possible.

Now, through this discussion, you've brought up some really good points
about how the initial thoughts I had around how we could add some
relatively simple commands, as part of this work, to make it easier for
someone to later add parallel support to pg_basebackup (either full or
incremental), or for external tools to leverage, might not be the best
solution when it comes to having parallel backup in core, and therefore
wouldn't actually end up being useful towards that end.  That's
certainly a fair point and possibly enough to justify not spending even
the modest time I was thinking it'd need, but I'm not convinced.  Now,
that said, if you are convinced that's the case, and you're doing the
work, then it's certainly your prerogative to go in the direction you're
convinced of.  I don't mean any of this discussion to imply that I'd
object to a commit that extended BASE_BACKUP in the way outlined above,
but I understood the question to be "what do people think of this idea?"
and to that I'm still of the opinion that spending a modest amount of
time to provide a way to parallelize an incremental backup is worth it,
even if it isn't optimal and isn't the direct goal of this effort.

There's a tangent on all of this that's pretty key though, which is the
question around just how the blocks are identified.  If the WAL scanning
is done to figure out the blocks, then that's quite a bit different from
the other idea of "open this relation and scan it, but only give me the
blocks after this LSN".  It's the latter case that I've been mostly
thinking about in this thread, which is part of why I was thinking it'd
be a modest amount of work to have protocol commands that accepted a
file (or perhaps a relation..) to scan and return blocks from instead of
baking this into BASE_BACKUP which by definition just serially scans the
data directory and returns things as it finds them.  For the case where
we have WAL scanning happening and modfiles which are being read and
used to figure out the blocks to send, it seems like it might be more
complicated and therefore potentially quite a bit more work to have a
parallel ve

Re: block-level incremental backup

2019-04-24 Thread Robert Haas
On Wed, Apr 24, 2019 at 9:28 AM Stephen Frost  wrote:
> Looking at it from what I'm sitting, I brought up two ways that we
> could extend the protocol to "request from the server only those blocks
> where LSN >= threshold_value" with one being the modification to
> BASE_BACKUP and the other being a new set of commands that could be
> parallelized.  If I had assumed that you'd be thinking the same way I am
> about extending the backup protocol, I wouldn't have said anything now
> and then would have complained after you wrote a patch that just
> extended the BASE_BACKUP command, at which point I likely would have
> been told that it's now been done and that I should have mentioned it
> earlier.

Fair enough.

> At least in part then it seems like we're viewing the level of effort
> around what I'm talking about quite differently, and I feel like that's
> largely because every time I mention parallel anything there's this
> assumption that I'm asking you to parallelize pg_basebackup or write a
> whole bunch more code to provide a fully optimized server-side parallel
> implementation for backups.  That really wasn't what I was going for.  I
> was thinking it would be a modest amount of additional work add
> incremental backup via a few new commands, instead of through the
> BASE_BACKUP protocol command, that would make parallelization possible.

I'm not sure about that.  It doesn't seem crazy difficult, but there
are a few wrinkles.  One is that if the client is requesting files one
at a time, it's got to have a list of all the files that it needs to
request, and that means that it has to ask the server to make a
preparatory pass over the whole PGDATA directory to get a list of all
the files that exist.  That overhead is not otherwise needed.  Another
is that the list of files might be really large, and that means that
the client would either use a lot of memory to hold that great big
list, or need to deal with spilling the list to a spool file
someplace, or else have a server protocol that lets the list be
fetched in incrementally in chunks.  A third is that, as you mention
further on, it means that the client has to care a lot more about
exactly how the server is figuring out which blocks have been
modified.  If it just says BASE_BACKUP ..., the server an be
internally reading each block and checking the LSN, or using
WAL-scanning or ptrack or whatever and the client doesn't need to know
or care.  But if the client is asking for a list of modified files or
blocks, then that presumes the information is available, and not too
expensively, without actually reading the files.  Fourth, MAX_RATE
probably won't actually limit to the correct rate overall if the limit
is applied separately to each file.

I'd be afraid that a patch that tried to handle all that as part of
this project would get rejected on the grounds that it was trying to
solve too many unrelated problems.  Also, though not everybody has to
agree on what constitutes a "modest amount of additional work," I
would not describe solving all of those problems as a modest effort,
but rather a pretty substantial one.

> There's a tangent on all of this that's pretty key though, which is the
> question around just how the blocks are identified.  If the WAL scanning
> is done to figure out the blocks, then that's quite a bit different from
> the other idea of "open this relation and scan it, but only give me the
> blocks after this LSN".  It's the latter case that I've been mostly
> thinking about in this thread, which is part of why I was thinking it'd
> be a modest amount of work to have protocol commands that accepted a
> file (or perhaps a relation..) to scan and return blocks from instead of
> baking this into BASE_BACKUP which by definition just serially scans the
> data directory and returns things as it finds them.  For the case where
> we have WAL scanning happening and modfiles which are being read and
> used to figure out the blocks to send, it seems like it might be more
> complicated and therefore potentially quite a bit more work to have a
> parallel version of that.

Yeah.  I don't entirely agree that the first one is simple, as per the
above, but I definitely agree that the second one is more complicated
than the first one.

> > Well, one thing you might want to do is have a tool that connects to
> > the server, enters backup mode, requests information on what blocks
> > have changed, copies those blocks via direct filesystem access, and
> > then exits backup mode.  Such a tool would really benefit from a
> > START_BACKUP / SEND_FILE_LIST / SEND_FILE_CONTENTS / STOP_BACKUP
> > command language, because it would just skip ever issuing the
> > SEND_FILE_CONTENTS command in favor of doing that part of the work via
> > other means.  On the other hand, a START_PARALLEL_BACKUP LSN '1/234'
> > command is useless to such a tool.
>
> That's true, but I hardly ever hear people talking about how wonderful
> it is that pgBackRest uses SSH to grab the data

Re: block-level incremental backup

2019-04-24 Thread Stephen Frost
Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Wed, Apr 24, 2019 at 9:28 AM Stephen Frost  wrote:
> > At least in part then it seems like we're viewing the level of effort
> > around what I'm talking about quite differently, and I feel like that's
> > largely because every time I mention parallel anything there's this
> > assumption that I'm asking you to parallelize pg_basebackup or write a
> > whole bunch more code to provide a fully optimized server-side parallel
> > implementation for backups.  That really wasn't what I was going for.  I
> > was thinking it would be a modest amount of additional work add
> > incremental backup via a few new commands, instead of through the
> > BASE_BACKUP protocol command, that would make parallelization possible.
> 
> I'm not sure about that.  It doesn't seem crazy difficult, but there
> are a few wrinkles.  One is that if the client is requesting files one
> at a time, it's got to have a list of all the files that it needs to
> request, and that means that it has to ask the server to make a
> preparatory pass over the whole PGDATA directory to get a list of all
> the files that exist.  That overhead is not otherwise needed.  Another
> is that the list of files might be really large, and that means that
> the client would either use a lot of memory to hold that great big
> list, or need to deal with spilling the list to a spool file
> someplace, or else have a server protocol that lets the list be
> fetched in incrementally in chunks.

So, I had a thought about that when I was composing the last email and
while I'm still unsure about it, maybe it'd be useful to mention it
here- do we really need a list of every *file*, or could we reduce that
down to a list of relations + forks for the main data directory, and
then always include whatever other directories/files are appropriate?

When it comes to operating in chunks, well, if we're getting a list of
relations instead of files, we do have this thing called cursors..

> A third is that, as you mention
> further on, it means that the client has to care a lot more about
> exactly how the server is figuring out which blocks have been
> modified.  If it just says BASE_BACKUP ..., the server an be
> internally reading each block and checking the LSN, or using
> WAL-scanning or ptrack or whatever and the client doesn't need to know
> or care.  But if the client is asking for a list of modified files or
> blocks, then that presumes the information is available, and not too
> expensively, without actually reading the files.

I would think the client would be able to just ask for the list of
modified files, when it comes to building up the list of files to ask
for, which could potentially be done based on mtime instead of by WAL
scanning or by scanning the files themselves.  Don't get me wrong, I'd
prefer that we work based on the WAL, since I have more confidence in
that, but certainly quite a few of the tools do work off mtime these
days and while it's not perfect, the risk/reward there is pretty
palatable to a lot of people.

> Fourth, MAX_RATE
> probably won't actually limit to the correct rate overall if the limit
> is applied separately to each file.

Sure, I hadn't been thinking about MAX_RATE and that would certainly
complicate things if we're offering to provide MAX_RATE-type
capabilities as part of this new set of commands.

> I'd be afraid that a patch that tried to handle all that as part of
> this project would get rejected on the grounds that it was trying to
> solve too many unrelated problems.  Also, though not everybody has to
> agree on what constitutes a "modest amount of additional work," I
> would not describe solving all of those problems as a modest effort,
> but rather a pretty substantial one.

I suspect some of that's driven by how they get solved and if we decide
we have to solve all of them.  With things like MAX_RATE + incremental
backups, I wonder how that's going to end up working, when you have the
option to apply the limit to the network, or to the disk I/O.  You might
have addressed that elsewhere, I've not looked, and I'm not too
particular about it personally either, but a definition could be "max
rate at which we'll read the file you asked for on this connection" and
that would be pretty straight-forward, I'd think.

> > > Well, one thing you might want to do is have a tool that connects to
> > > the server, enters backup mode, requests information on what blocks
> > > have changed, copies those blocks via direct filesystem access, and
> > > then exits backup mode.  Such a tool would really benefit from a
> > > START_BACKUP / SEND_FILE_LIST / SEND_FILE_CONTENTS / STOP_BACKUP
> > > command language, because it would just skip ever issuing the
> > > SEND_FILE_CONTENTS command in favor of doing that part of the work via
> > > other means.  On the other hand, a START_PARALLEL_BACKUP LSN '1/234'
> > > command is useless to such a tool.
> >
> > That's true, but I hardly ever hear peo

Re: block-level incremental backup

2019-04-25 Thread Robert Haas
On Wed, Apr 24, 2019 at 12:57 PM Stephen Frost  wrote:
> So, I had a thought about that when I was composing the last email and
> while I'm still unsure about it, maybe it'd be useful to mention it
> here- do we really need a list of every *file*, or could we reduce that
> down to a list of relations + forks for the main data directory, and
> then always include whatever other directories/files are appropriate?

I'm not quite sure what the difference is here.  I agree that we could
try to compact the list of file names by saying 16384 (24 segments)
instead of 16384, 16384.1, ..., 16384.23, but I doubt that saves
anything meaningful.  I don't see how we can leave anything out
altogether.  If there's a filename called boaty.mcboatface in the
server directory, I think we've got to back it up, and that won't
happen unless the client knows that it is there, and it won't know
unless we include it in a list.

> When it comes to operating in chunks, well, if we're getting a list of
> relations instead of files, we do have this thing called cursors..

Sure... but they don't work for replication commands and I am
definitely not volunteering to change that.

> I would think the client would be able to just ask for the list of
> modified files, when it comes to building up the list of files to ask
> for, which could potentially be done based on mtime instead of by WAL
> scanning or by scanning the files themselves.  Don't get me wrong, I'd
> prefer that we work based on the WAL, since I have more confidence in
> that, but certainly quite a few of the tools do work off mtime these
> days and while it's not perfect, the risk/reward there is pretty
> palatable to a lot of people.

That approach, as with a few others that have been suggested, requires
that the client have access to the previous backup, which makes me
uninterested in implementing it.  I want a version of incremental
backup where the client needs to know the LSN of the previous backup
and nothing else.  That way, if you store your actual backups on a
tape drive in an airless vault at the bottom of the Pacific Ocean, you
can still take incremental backup against them, as long as you
remember to note the LSNs before you ship the backups to the vault.
Woohoo!  It also allows for the wire protocol to be very simple and
the client to be very simple; neither of those things is essential,
but both are nice.

Also, I think using mtimes is just asking to get burned.  Yeah, almost
nobody will, but an LSN-based approach is more granular (block level)
and more reliable (can't be fooled by resetting a clock backward, or
by a filesystem being careless with file metadata), so I think it
makes sense to focus on getting that to work.  It's worth keeping in
mind that there may be somewhat different expectations for an external
tool vs. a core feature.  Stupid as it may sound, I think people using
an external tool are more likely to do things read the directions, and
those directions can say things like "use a reasonable filesystem and
don't set your clock backward."  When stuff goes into core, people
assume that they should be able to run it on any filesystem on any
hardware where they can get it to work and it should just work.  And
you also get a lot more users, so even if the percentage of people not
reading the directions were to stay constant, the actual number of
such people will go up a lot. So picking what we seem to both agree to
be the most robust way of detecting changes seems like the way to go
from here.

> I suspect some of that's driven by how they get solved and if we decide
> we have to solve all of them.  With things like MAX_RATE + incremental
> backups, I wonder how that's going to end up working, when you have the
> option to apply the limit to the network, or to the disk I/O.  You might
> have addressed that elsewhere, I've not looked, and I'm not too
> particular about it personally either, but a definition could be "max
> rate at which we'll read the file you asked for on this connection" and
> that would be pretty straight-forward, I'd think.

I mean, it's just so people can tell pg_basebackup what rate they want
via a command-line option and have it happen like that.  They don't
care about the rates for individual files.

> > Issue #1: If you manually add files to your backup, remove files from
> > your backup, or change files in your backup, bad things will happen.
> > There is fundamentally nothing we can do to prevent this completely,
> > but it may be possible to make the system more resilient against
> > ham-handed modifications, at least to the extent of detecting them.
> > That's maybe a topic for another thread, but it's an interesting one:
> > Andres and I were brainstorming about it at some point.
>
> I'd certainly be interested in hearing about ways we can improve on
> that.  I'm alright with it being on another thread as it's a broader
> concern than just what we're talking about here.

Might be a good topic to chat about at PGCon.

> > 

Re: block-level incremental backup

2019-07-10 Thread Anastasia Lubennikova

23.04.2019 14:08, Anastasia Lubennikova wrote:
I'm volunteering to write a draft patch or, more likely, set of 
patches, which

will allow us to discuss the subject in more detail.
And to do that I wish we agree on the API and data format (at least 
broadly).
Looking forward to hearing your thoughts. 


Though the previous discussion stalled,
I still hope that we could agree on basic points such as a map file 
format and protocol extension,

which is necessary to start implementing the feature.

- Proof Of Concept patch -

In attachments, you can find a prototype of incremental pg_basebackup, 
which consists of 2 features:


1) To perform incremental backup one should call pg_basebackup with a 
new argument:


pg_basebackup -D 'basedir' --prev-backup-start-lsn 'lsn'

where lsn is a start_lsn of parent backup (can be found in 
"backup_label" file)


It calls BASE_BACKUP replication command with a new argument 
PREV_BACKUP_START_LSN 'lsn'.


For datafiles, only pages with LSN > prev_backup_start_lsn will be 
included in the backup.
They are saved into 'filename.partial' file, 'filename.blockmap' file 
contains an array of BlockNumbers.
For example, if we backuped blocks 1,3,5, filename.partial will contain 
3 blocks, and 'filename.blockmap' will contain array {1,3,5}.


Non-datafiles use the same format as before.

2) To merge incremental backup into a full backup call

pg_basebackup -D 'basedir' --incremental-pgdata 'incremental_basedir' 
--merge-backups


It will move all files from 'incremental_basedir' to 'basedir' handling 
'.partial' files correctly.



- Questions to discuss -

Please note that it is just a proof-of-concept patch and it can be 
optimized in many ways.

Let's concentrate on issues that affect the protocol or data format.

1) Whether we collect block maps using simple "read everything page by 
page" approach
or WAL scanning or any other page tracking algorithm, we must choose a 
map format.

I implemented the simplest one, while there are more ideas:

- We can have a map not per file, but per relation or maybe per tablespace,
which will make implementation more complex, but probably more optimal.
The only problem I see with existing implementation is that even if only 
a few blocks changed,

we still must pad it to 512 bytes per tar format requirements.

- We can save LSNs into the block map.

typedef struct BlockMapItem {
    BlockNumber blkno;
    XLogRecPtr lsn;
} BlockMapItem;

In my implementation, invalid prev_backup_start_lsn means fallback to 
regular basebackup
without any block maps. Alternatively, we can define another meaning of 
this value and send a block map for all files.

Backup utilities can use these maps to speed up backup merge or restore.

2) We can implement BASE_BACKUP SEND_FILELIST replication command,
which will return a list of filenames with file sizes and block maps if 
lsn was provided.


To avoid changing format, we can simply send tar headers for each file:
- tarHeader("filename.blockmap") followed by blockmap for relation files 
if prev_backup_start_lsn is provided;
- tarHeader("filename") without actual file content for non relation 
files or for all files in "FULL" backup


The caller can parse messages and use them for any purpose, for example, 
to perform a parallel backup.


Thoughts?

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 13e0d23..e757bba 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10459,7 +10459,7 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 			ti->oid = pstrdup(de->d_name);
 			ti->path = pstrdup(buflinkpath.data);
 			ti->rpath = relpath ? pstrdup(relpath) : NULL;
-			ti->size = infotbssize ? sendTablespace(fullpath, true) : -1;
+			ti->size = infotbssize ? sendTablespace(fullpath, true, InvalidXLogRecPtr) : -1;
 
 			if (tablespaces)
 *tablespaces = lappend(*tablespaces, ti);
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index c2978a9..3560da1 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -41,6 +41,7 @@
 #include "utils/ps_status.h"
 #include "utils/relcache.h"
 #include "utils/timestamp.h"
+#include "utils/pg_lsn.h"
 
 
 typedef struct
@@ -52,13 +53,22 @@ typedef struct
 	bool		includewal;
 	uint32		maxrate;
 	bool		sendtblspcmapfile;
+	XLogRecPtr	prev_backup_start_lsn;
 } basebackup_options;
 
 
 static int64 sendDir(const char *path, int basepathlen, bool sizeonly,
-	 List *tablespaces, bool sendtblspclinks);
+	 List *tablespaces, bool sendtblspclinks, XLogRecPtr prev_backup_start_lsn);
 static bool sendFile(const char *readfilename, const char *tarfilename,
-	 struct stat *statbuf, bool missing_ok, Oid dboid);
+	 struct stat *statbuf, bool missing_ok, Oid dboid

Re: block-level incremental backup

2019-07-11 Thread Jeevan Chalke
Hi Anastasia,

On Wed, Jul 10, 2019 at 11:47 PM Anastasia Lubennikova <
a.lubennik...@postgrespro.ru> wrote:

> 23.04.2019 14:08, Anastasia Lubennikova wrote:
> > I'm volunteering to write a draft patch or, more likely, set of
> > patches, which
> > will allow us to discuss the subject in more detail.
> > And to do that I wish we agree on the API and data format (at least
> > broadly).
> > Looking forward to hearing your thoughts.
>
> Though the previous discussion stalled,
> I still hope that we could agree on basic points such as a map file
> format and protocol extension,
> which is necessary to start implementing the feature.
>

It's great that you too come up with the PoC patch. I didn't look at your
changes in much details but we at EnterpriseDB too working on this feature
and started implementing it.

Attached series of patches I had so far... (which needed further
optimization and adjustments though)

Here is the overall design (as proposed by Robert) we are trying to
implement:

1. Extend the BASE_BACKUP command that can be used with replication
connections. Add a new [ LSN 'lsn' ] option.

2. Extend pg_basebackup with a new --lsn=LSN option that causes it to send
the option added to the server in #1.

Here are the implementation details when we have a valid LSN

sendFile() in basebackup.c is the function which mostly does the thing for
us. If the filename looks like a relation file, then we'll need to consider
sending only a partial file. The way to do that is probably:

A. Read the whole file into memory.

B. Check the LSN of each block. Build a bitmap indicating which blocks have
an LSN greater than or equal to the threshold LSN.

C. If more than 90% of the bits in the bitmap are set, send the whole file
just as if this were a full backup. This 90% is a constant now; we might
make it a GUC later.

D. Otherwise, send a file with .partial added to the name. The .partial
file contains an indication of which blocks were changed at the beginning,
followed by the data blocks. It also includes a checksum/CRC.
Currently, a .partial file format looks like:
 - start with a 4-byte magic number
 - then store a 4-byte CRC covering the header
 - then a 4-byte count of the number of blocks included in the file
 - then the block numbers, each as a 4-byte quantity
 - then the data blocks


We are also working on combining these incremental back-ups with the full
backup and for that, we are planning to add a new utility called
pg_combinebackup. Will post the details on that later once we have on the
same page for taking backup.

Thanks
-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation


IncrBackup.tar.gz
Description: application/gzip


Re: block-level incremental backup

2019-07-16 Thread Jeevan Chalke
On Thu, Jul 11, 2019 at 5:00 PM Jeevan Chalke <
jeevan.cha...@enterprisedb.com> wrote:

> Hi Anastasia,
>
> On Wed, Jul 10, 2019 at 11:47 PM Anastasia Lubennikova <
> a.lubennik...@postgrespro.ru> wrote:
>
>> 23.04.2019 14:08, Anastasia Lubennikova wrote:
>> > I'm volunteering to write a draft patch or, more likely, set of
>> > patches, which
>> > will allow us to discuss the subject in more detail.
>> > And to do that I wish we agree on the API and data format (at least
>> > broadly).
>> > Looking forward to hearing your thoughts.
>>
>> Though the previous discussion stalled,
>> I still hope that we could agree on basic points such as a map file
>> format and protocol extension,
>> which is necessary to start implementing the feature.
>>
>
> It's great that you too come up with the PoC patch. I didn't look at your
> changes in much details but we at EnterpriseDB too working on this feature
> and started implementing it.
>
> Attached series of patches I had so far... (which needed further
> optimization and adjustments though)
>
> Here is the overall design (as proposed by Robert) we are trying to
> implement:
>
> 1. Extend the BASE_BACKUP command that can be used with replication
> connections. Add a new [ LSN 'lsn' ] option.
>
> 2. Extend pg_basebackup with a new --lsn=LSN option that causes it to send
> the option added to the server in #1.
>
> Here are the implementation details when we have a valid LSN
>
> sendFile() in basebackup.c is the function which mostly does the thing for
> us. If the filename looks like a relation file, then we'll need to consider
> sending only a partial file. The way to do that is probably:
>
> A. Read the whole file into memory.
>
> B. Check the LSN of each block. Build a bitmap indicating which blocks
> have an LSN greater than or equal to the threshold LSN.
>
> C. If more than 90% of the bits in the bitmap are set, send the whole file
> just as if this were a full backup. This 90% is a constant now; we might
> make it a GUC later.
>
> D. Otherwise, send a file with .partial added to the name. The .partial
> file contains an indication of which blocks were changed at the beginning,
> followed by the data blocks. It also includes a checksum/CRC.
> Currently, a .partial file format looks like:
>  - start with a 4-byte magic number
>  - then store a 4-byte CRC covering the header
>  - then a 4-byte count of the number of blocks included in the file
>  - then the block numbers, each as a 4-byte quantity
>  - then the data blocks
>
>
> We are also working on combining these incremental back-ups with the full
> backup and for that, we are planning to add a new utility called
> pg_combinebackup. Will post the details on that later once we have on the
> same page for taking backup.
>

For combining a full backup with one or more incremental backup, we are
adding
a new utility called pg_combinebackup in src/bin.

Here is the overall design as proposed by Robert.

pg_combinebackup starts from the LAST backup specified and work backward. It
must NOT start with the full backup and work forward. This is important both
for reasons of efficiency and of correctness. For example, if you start by
copying over the full backup and then later apply the incremental backups on
top of it then you'll copy data and later end up overwriting it or removing
it. Any files that are leftover at the end that aren't in the final
incremental backup even as .partial files need to be removed, or the result
is
wrong. We should aim for a system where every block in the output directory
is
written exactly once and nothing ever has to be created and then removed.

To make that work, we should start by examining the final incremental
backup.
We should proceed with one file at a time. For each file:

1. If the complete file is present in the incremental backup, then just
copy it
to the output directory - and move on to the next file.

2. Otherwise, we have a .partial file. Work backward through the backup
chain
until we find a complete version of the file. That might happen when we get
\back to the full backup at the start of the chain, but it might also happen
sooner - at which point we do not need to and should not look at earlier
backups for that file. During this phase, we should read only the HEADER of
each .partial file, building a map of which blocks we're ultimately going to
need to read from each backup. We can also compute the offset within each
file
where that block is stored at this stage, again using the header
information.

3. Now, we can write the output file - reading each block in turn from the
correct backup and writing it to the write output file, using the map we
constructed in the previous step. We should probably keep all of the input
files open over steps 2 and 3 and then close them at the end because
repeatedly closing and opening them is going to be expensive. When that's
done,
go on to the next file and start over at step 1.


We are already started working on this design.

-- 
Jeevan Cha

Re: block-level incremental backup

2019-07-17 Thread Ibrar Ahmed
On Wed, Jul 17, 2019 at 10:22 AM Jeevan Chalke <
jeevan.cha...@enterprisedb.com> wrote:

>
>
> On Thu, Jul 11, 2019 at 5:00 PM Jeevan Chalke <
> jeevan.cha...@enterprisedb.com> wrote:
>
>> Hi Anastasia,
>>
>> On Wed, Jul 10, 2019 at 11:47 PM Anastasia Lubennikova <
>> a.lubennik...@postgrespro.ru> wrote:
>>
>>> 23.04.2019 14:08, Anastasia Lubennikova wrote:
>>> > I'm volunteering to write a draft patch or, more likely, set of
>>> > patches, which
>>> > will allow us to discuss the subject in more detail.
>>> > And to do that I wish we agree on the API and data format (at least
>>> > broadly).
>>> > Looking forward to hearing your thoughts.
>>>
>>> Though the previous discussion stalled,
>>> I still hope that we could agree on basic points such as a map file
>>> format and protocol extension,
>>> which is necessary to start implementing the feature.
>>>
>>
>> It's great that you too come up with the PoC patch. I didn't look at your
>> changes in much details but we at EnterpriseDB too working on this feature
>> and started implementing it.
>>
>> Attached series of patches I had so far... (which needed further
>> optimization and adjustments though)
>>
>> Here is the overall design (as proposed by Robert) we are trying to
>> implement:
>>
>> 1. Extend the BASE_BACKUP command that can be used with replication
>> connections. Add a new [ LSN 'lsn' ] option.
>>
>> 2. Extend pg_basebackup with a new --lsn=LSN option that causes it to
>> send the option added to the server in #1.
>>
>> Here are the implementation details when we have a valid LSN
>>
>> sendFile() in basebackup.c is the function which mostly does the thing
>> for us. If the filename looks like a relation file, then we'll need to
>> consider sending only a partial file. The way to do that is probably:
>>
>> A. Read the whole file into memory.
>>
>> B. Check the LSN of each block. Build a bitmap indicating which blocks
>> have an LSN greater than or equal to the threshold LSN.
>>
>> C. If more than 90% of the bits in the bitmap are set, send the whole
>> file just as if this were a full backup. This 90% is a constant now; we
>> might make it a GUC later.
>>
>> D. Otherwise, send a file with .partial added to the name. The .partial
>> file contains an indication of which blocks were changed at the beginning,
>> followed by the data blocks. It also includes a checksum/CRC.
>> Currently, a .partial file format looks like:
>>  - start with a 4-byte magic number
>>  - then store a 4-byte CRC covering the header
>>  - then a 4-byte count of the number of blocks included in the file
>>  - then the block numbers, each as a 4-byte quantity
>>  - then the data blocks
>>
>>
>> We are also working on combining these incremental back-ups with the full
>> backup and for that, we are planning to add a new utility called
>> pg_combinebackup. Will post the details on that later once we have on the
>> same page for taking backup.
>>
>
> For combining a full backup with one or more incremental backup, we are
> adding
> a new utility called pg_combinebackup in src/bin.
>
> Here is the overall design as proposed by Robert.
>
> pg_combinebackup starts from the LAST backup specified and work backward.
> It
> must NOT start with the full backup and work forward. This is important
> both
> for reasons of efficiency and of correctness. For example, if you start by
> copying over the full backup and then later apply the incremental backups
> on
> top of it then you'll copy data and later end up overwriting it or removing
> it. Any files that are leftover at the end that aren't in the final
> incremental backup even as .partial files need to be removed, or the
> result is
> wrong. We should aim for a system where every block in the output
> directory is
> written exactly once and nothing ever has to be created and then removed.
>
> To make that work, we should start by examining the final incremental
> backup.
> We should proceed with one file at a time. For each file:
>
> 1. If the complete file is present in the incremental backup, then just
> copy it
> to the output directory - and move on to the next file.
>
> 2. Otherwise, we have a .partial file. Work backward through the backup
> chain
> until we find a complete version of the file. That might happen when we get
> \back to the full backup at the start of the chain, but it might also
> happen
> sooner - at which point we do not need to and should not look at earlier
> backups for that file. During this phase, we should read only the HEADER of
> each .partial file, building a map of which blocks we're ultimately going
> to
> need to read from each backup. We can also compute the offset within each
> file
> where that block is stored at this stage, again using the header
> information.
>
> 3. Now, we can write the output file - reading each block in turn from the
> correct backup and writing it to the write output file, using the map we
> constructed in the previous step. We should probably keep all of the input
> files ope

Re: block-level incremental backup

2019-07-17 Thread Jeevan Chalke
On Wed, Jul 17, 2019 at 2:15 PM Ibrar Ahmed  wrote:

>
> At what stage you will apply the WAL generated in between the START/STOP
> backup.
>

In this design, we are not touching any WAL related code. The WAL files will
get copied with each backup either full or incremental. And thus, the last
incremental backup will have the final WAL files which will be copied as-is
in the combined full-backup and they will get apply automatically if that
the data directory is used to start the server.


> --
> Ibrar Ahmed
>

-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation


Re: block-level incremental backup

2019-07-17 Thread Ibrar Ahmed
On Wed, Jul 17, 2019 at 6:43 PM Jeevan Chalke <
jeevan.cha...@enterprisedb.com> wrote:

> On Wed, Jul 17, 2019 at 2:15 PM Ibrar Ahmed  wrote:
>
>>
>> At what stage you will apply the WAL generated in between the START/STOP
>> backup.
>>
>
> In this design, we are not touching any WAL related code. The WAL files
> will
> get copied with each backup either full or incremental. And thus, the last
> incremental backup will have the final WAL files which will be copied as-is
> in the combined full-backup and they will get apply automatically if that
> the data directory is used to start the server.
>

Ok, so you keep all the WAL files since the first backup, right?

>
>
>> --
>> Ibrar Ahmed
>>
>
> --
> Jeevan Chalke
> Technical Architect, Product Development
> EnterpriseDB Corporation
>
>

-- 
Ibrar Ahmed


Re: block-level incremental backup

2019-07-17 Thread Jeevan Chalke
On Wed, Jul 17, 2019 at 7:38 PM Ibrar Ahmed  wrote:

>
>
> On Wed, Jul 17, 2019 at 6:43 PM Jeevan Chalke <
> jeevan.cha...@enterprisedb.com> wrote:
>
>> On Wed, Jul 17, 2019 at 2:15 PM Ibrar Ahmed 
>> wrote:
>>
>>>
>>> At what stage you will apply the WAL generated in between the START/STOP
>>> backup.
>>>
>>
>> In this design, we are not touching any WAL related code. The WAL files
>> will
>> get copied with each backup either full or incremental. And thus, the last
>> incremental backup will have the final WAL files which will be copied
>> as-is
>> in the combined full-backup and they will get apply automatically if that
>> the data directory is used to start the server.
>>
>
> Ok, so you keep all the WAL files since the first backup, right?
>

The WAL files will anyway be copied while taking a backup (full or
incremental),
but only last incremental backup's WAL files are copied to the combined
synthetic full backup.


>>
>>> --
>>> Ibrar Ahmed
>>>
>>
>> --
>> Jeevan Chalke
>> Technical Architect, Product Development
>> EnterpriseDB Corporation
>>
>>
>
> --
> Ibrar Ahmed
>


-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation


Re: block-level incremental backup

2019-07-20 Thread vignesh C
Hi Jeevan,

The idea is very nice.
When Insert/update/delete and truncate/drop happens at various
combinations, How the incremental backup handles the copying of the
blocks?


On Wed, Jul 17, 2019 at 8:12 PM Jeevan Chalke
 wrote:
>
>
>
> On Wed, Jul 17, 2019 at 7:38 PM Ibrar Ahmed  wrote:
>>
>>
>>
>> On Wed, Jul 17, 2019 at 6:43 PM Jeevan Chalke 
>>  wrote:
>>>
>>> On Wed, Jul 17, 2019 at 2:15 PM Ibrar Ahmed  wrote:


 At what stage you will apply the WAL generated in between the START/STOP 
 backup.
>>>
>>>
>>> In this design, we are not touching any WAL related code. The WAL files will
>>> get copied with each backup either full or incremental. And thus, the last
>>> incremental backup will have the final WAL files which will be copied as-is
>>> in the combined full-backup and they will get apply automatically if that
>>> the data directory is used to start the server.
>>
>>
>> Ok, so you keep all the WAL files since the first backup, right?
>
>
> The WAL files will anyway be copied while taking a backup (full or 
> incremental),
> but only last incremental backup's WAL files are copied to the combined
> synthetic full backup.
>
>>>

 --
 Ibrar Ahmed
>>>
>>>
>>> --
>>> Jeevan Chalke
>>> Technical Architect, Product Development
>>> EnterpriseDB Corporation
>>>
>>
>>
>> --
>> Ibrar Ahmed
>
>
>
> --
> Jeevan Chalke
> Technical Architect, Product Development
> EnterpriseDB Corporation
>


--
Regards,
vignesh
EnterpriseDB: http://www.enterprisedb.com




Re: block-level incremental backup

2019-07-23 Thread Jeevan Ladhe
Hi Vignesh,

This backup technology is extending the pg_basebackup itself, which means
we can
still take online backups. This is internally done using pg_start_backup and
pg_stop_backup. pg_start_backup performs a checkpoint, and this checkpoint
is
used in the recovery process while starting the cluster from a backup
image. What
incremental backup will just modify (as compared to traditional
pg_basebackup)
is - After doing the checkpoint, instead of copying the entire relation
files,
it takes an input LSN and scan all the blocks in all relation files, and
store
the blocks having LSN >= InputLSN. This means it considers all the changes
that are already written into relation files including insert/update/delete
etc
up to the checkpoint performed by pg_start_backup internally, and as Jeevan
Chalke
mentioned upthread the incremental backup will also contain copy of WAL
files.
Once this incremental backup is combined with the parent backup by means of
new
combine process (that will be introduced as part of this feature itself)
should
ideally look like a full pg_basebackup. Note that any changes done by these
insert/delete/update operations while the incremental backup was being taken
will be still available via WAL files and as normal restore process, will be
replayed from the checkpoint onwards up to a consistent point.

My two cents!

Regards,
Jeevan Ladhe

On Sat, Jul 20, 2019 at 11:22 PM vignesh C  wrote:

> Hi Jeevan,
>
> The idea is very nice.
> When Insert/update/delete and truncate/drop happens at various
> combinations, How the incremental backup handles the copying of the
> blocks?
>
>
> On Wed, Jul 17, 2019 at 8:12 PM Jeevan Chalke
>  wrote:
> >
> >
> >
> > On Wed, Jul 17, 2019 at 7:38 PM Ibrar Ahmed 
> wrote:
> >>
> >>
> >>
> >> On Wed, Jul 17, 2019 at 6:43 PM Jeevan Chalke <
> jeevan.cha...@enterprisedb.com> wrote:
> >>>
> >>> On Wed, Jul 17, 2019 at 2:15 PM Ibrar Ahmed 
> wrote:
> 
> 
>  At what stage you will apply the WAL generated in between the
> START/STOP backup.
> >>>
> >>>
> >>> In this design, we are not touching any WAL related code. The WAL
> files will
> >>> get copied with each backup either full or incremental. And thus, the
> last
> >>> incremental backup will have the final WAL files which will be copied
> as-is
> >>> in the combined full-backup and they will get apply automatically if
> that
> >>> the data directory is used to start the server.
> >>
> >>
> >> Ok, so you keep all the WAL files since the first backup, right?
> >
> >
> > The WAL files will anyway be copied while taking a backup (full or
> incremental),
> > but only last incremental backup's WAL files are copied to the combined
> > synthetic full backup.
> >
> >>>
> 
>  --
>  Ibrar Ahmed
> >>>
> >>>
> >>> --
> >>> Jeevan Chalke
> >>> Technical Architect, Product Development
> >>> EnterpriseDB Corporation
> >>>
> >>
> >>
> >> --
> >> Ibrar Ahmed
> >
> >
> >
> > --
> > Jeevan Chalke
> > Technical Architect, Product Development
> > EnterpriseDB Corporation
> >
>
>
> --
> Regards,
> vignesh
> EnterpriseDB: http://www.enterprisedb.com
>
>
>


Re: block-level incremental backup

2019-07-23 Thread vignesh C
Thanks Jeevan.

1) If relation file has changed due to truncate or vacuum.
During incremental backup the new files will be copied.
There are chances that both the old  file and new file
will be present. I'm not sure if cleaning up of the
old file is handled.
2) Just a small thought on building the bitmap,
can the bitmap be built and maintained as
and when the changes are happening in the system.
If we are building the bitmap while doing the incremental backup,
Scanning through each file might take more time.
This can be a configurable parameter, the system can run
without capturing this information by default, but if there are some
of them who will be taking incremental backup frequently this
configuration can be enabled which should track the modified blocks.

What is your thought on this?
-- 
Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com


On Tue, Jul 23, 2019 at 11:19 PM Jeevan Ladhe
 wrote:
>
> Hi Vignesh,
>
> This backup technology is extending the pg_basebackup itself, which means we 
> can
> still take online backups. This is internally done using pg_start_backup and
> pg_stop_backup. pg_start_backup performs a checkpoint, and this checkpoint is
> used in the recovery process while starting the cluster from a backup image. 
> What
> incremental backup will just modify (as compared to traditional pg_basebackup)
> is - After doing the checkpoint, instead of copying the entire relation files,
> it takes an input LSN and scan all the blocks in all relation files, and store
> the blocks having LSN >= InputLSN. This means it considers all the changes
> that are already written into relation files including insert/update/delete 
> etc
> up to the checkpoint performed by pg_start_backup internally, and as Jeevan 
> Chalke
> mentioned upthread the incremental backup will also contain copy of WAL files.
> Once this incremental backup is combined with the parent backup by means of 
> new
> combine process (that will be introduced as part of this feature itself) 
> should
> ideally look like a full pg_basebackup. Note that any changes done by these
> insert/delete/update operations while the incremental backup was being taken
> will be still available via WAL files and as normal restore process, will be
> replayed from the checkpoint onwards up to a consistent point.
>
> My two cents!
>
> Regards,
> Jeevan Ladhe
>
> On Sat, Jul 20, 2019 at 11:22 PM vignesh C  wrote:
>>
>> Hi Jeevan,
>>
>> The idea is very nice.
>> When Insert/update/delete and truncate/drop happens at various
>> combinations, How the incremental backup handles the copying of the
>> blocks?
>>
>>
>> On Wed, Jul 17, 2019 at 8:12 PM Jeevan Chalke
>>  wrote:
>> >
>> >
>> >
>> > On Wed, Jul 17, 2019 at 7:38 PM Ibrar Ahmed  wrote:
>> >>
>> >>
>> >>
>> >> On Wed, Jul 17, 2019 at 6:43 PM Jeevan Chalke 
>> >>  wrote:
>> >>>
>> >>> On Wed, Jul 17, 2019 at 2:15 PM Ibrar Ahmed  
>> >>> wrote:
>> 
>> 
>>  At what stage you will apply the WAL generated in between the 
>>  START/STOP backup.
>> >>>
>> >>>
>> >>> In this design, we are not touching any WAL related code. The WAL files 
>> >>> will
>> >>> get copied with each backup either full or incremental. And thus, the 
>> >>> last
>> >>> incremental backup will have the final WAL files which will be copied 
>> >>> as-is
>> >>> in the combined full-backup and they will get apply automatically if that
>> >>> the data directory is used to start the server.
>> >>
>> >>
>> >> Ok, so you keep all the WAL files since the first backup, right?
>> >
>> >
>> > The WAL files will anyway be copied while taking a backup (full or 
>> > incremental),
>> > but only last incremental backup's WAL files are copied to the combined
>> > synthetic full backup.
>> >
>> >>>
>> 
>>  --
>>  Ibrar Ahmed
>> >>>
>> >>>
>> >>> --
>> >>> Jeevan Chalke
>> >>> Technical Architect, Product Development
>> >>> EnterpriseDB Corporation
>> >>>
>> >>
>> >>
>> >> --
>> >> Ibrar Ahmed
>> >
>> >
>> >
>> > --
>> > Jeevan Chalke
>> > Technical Architect, Product Development
>> > EnterpriseDB Corporation
>> >
>>
>>
>> --
>> Regards,
>> vignesh
>>
>>
>>




Re: block-level incremental backup

2019-07-25 Thread Jeevan Ladhe
Hi Vignesh,

Please find my comments inline below:

1) If relation file has changed due to truncate or vacuum.
> During incremental backup the new files will be copied.
> There are chances that both the old  file and new file
> will be present. I'm not sure if cleaning up of the
> old file is handled.
>

When an incremental backup is taken it either copies the file in its
entirety if
a file is changed more than 90%, or writes .partial with changed blocks
bitmap
and actual data. For the files that are unchanged, it writes 0 bytes and
still
creates a .partial file for unchanged files too. This means there is a
.partitial
file for all the files that are to be looked up in full backup.
While composing a synthetic backup from incremental backup the
pg_combinebackup
tool will only look for those relation files in full(parent) backup which
are
having .partial files in the incremental backup. So, if vacuum/truncate
happened
between full and incremental backup, then the incremental backup image will
not
have a 0-length .partial file for that relation, and so the synthetic backup
that is restored using pg_combinebackup will not have that file as well.


> 2) Just a small thought on building the bitmap,
> can the bitmap be built and maintained as
> and when the changes are happening in the system.
> If we are building the bitmap while doing the incremental backup,
> Scanning through each file might take more time.
> This can be a configurable parameter, the system can run
> without capturing this information by default, but if there are some
> of them who will be taking incremental backup frequently this
> configuration can be enabled which should track the modified blocks.


IIUC, this will need changes in the backend. Honestly, I think backup is a
maintenance task and hampering the backend for this does not look like a
good
idea. But, having said that even if we have to provide this as a switch for
some
of the users, it will need a different infrastructure than what we are
building
here for constructing bitmap, where we scan all the files one by one. Maybe
for
the initial version, we can go with the current proposal that Robert has
suggested,
and add this switch at a later point as an enhancement.
- My thoughts.

Regards,
Jeevan Ladhe


Re: block-level incremental backup

2019-07-26 Thread vignesh C
On Fri, Jul 26, 2019 at 11:21 AM Jeevan Ladhe 
wrote:

> Hi Vignesh,
>
> Please find my comments inline below:
>
> 1) If relation file has changed due to truncate or vacuum.
>> During incremental backup the new files will be copied.
>> There are chances that both the old  file and new file
>> will be present. I'm not sure if cleaning up of the
>> old file is handled.
>>
>
> When an incremental backup is taken it either copies the file in its
> entirety if
> a file is changed more than 90%, or writes .partial with changed blocks
> bitmap
> and actual data. For the files that are unchanged, it writes 0 bytes and
> still
> creates a .partial file for unchanged files too. This means there is a
> .partitial
> file for all the files that are to be looked up in full backup.
> While composing a synthetic backup from incremental backup the
> pg_combinebackup
> tool will only look for those relation files in full(parent) backup which
> are
> having .partial files in the incremental backup. So, if vacuum/truncate
> happened
> between full and incremental backup, then the incremental backup image
> will not
> have a 0-length .partial file for that relation, and so the synthetic
> backup
> that is restored using pg_combinebackup will not have that file as well.
>
>
Thanks Jeevan for the update, I feel this logic is good.
It will handle the case of deleting the old relation files.

>
>
>> 2) Just a small thought on building the bitmap,
>> can the bitmap be built and maintained as
>> and when the changes are happening in the system.
>> If we are building the bitmap while doing the incremental backup,
>> Scanning through each file might take more time.
>> This can be a configurable parameter, the system can run
>> without capturing this information by default, but if there are some
>> of them who will be taking incremental backup frequently this
>> configuration can be enabled which should track the modified blocks.
>
>
> IIUC, this will need changes in the backend. Honestly, I think backup is a
> maintenance task and hampering the backend for this does not look like a
> good
> idea. But, having said that even if we have to provide this as a switch
> for some
> of the users, it will need a different infrastructure than what we are
> building
> here for constructing bitmap, where we scan all the files one by one.
> Maybe for
> the initial version, we can go with the current proposal that Robert has
> suggested,
> and add this switch at a later point as an enhancement.
>
>
That sounds fair to me.


Regards,
vignesh
EnterpriseDB: http://www.enterprisedb.com


Re: block-level incremental backup

2019-07-29 Thread Robert Haas
On Wed, Jul 10, 2019 at 2:17 PM Anastasia Lubennikova
 wrote:
> In attachments, you can find a prototype of incremental pg_basebackup,
> which consists of 2 features:
>
> 1) To perform incremental backup one should call pg_basebackup with a
> new argument:
>
> pg_basebackup -D 'basedir' --prev-backup-start-lsn 'lsn'
>
> where lsn is a start_lsn of parent backup (can be found in
> "backup_label" file)
>
> It calls BASE_BACKUP replication command with a new argument
> PREV_BACKUP_START_LSN 'lsn'.
>
> For datafiles, only pages with LSN > prev_backup_start_lsn will be
> included in the backup.
> They are saved into 'filename.partial' file, 'filename.blockmap' file
> contains an array of BlockNumbers.
> For example, if we backuped blocks 1,3,5, filename.partial will contain
> 3 blocks, and 'filename.blockmap' will contain array {1,3,5}.

I think it's better to keep both the information about changed blocks
and the contents of the changed blocks in a single file.  The list of
changed blocks is probably quite short, and I don't really want to
double the number of files in the backup if there's no real need. I
suspect it's just overall a bit simpler to keep everything together.
I don't think this is a make-or-break thing, and welcome contrary
arguments, but that's my preference.

> 2) To merge incremental backup into a full backup call
>
> pg_basebackup -D 'basedir' --incremental-pgdata 'incremental_basedir'
> --merge-backups
>
> It will move all files from 'incremental_basedir' to 'basedir' handling
> '.partial' files correctly.

This, to me, looks like it's much worse than the design that I
proposed originally.  It means that:

1. You can't take an incremental backup without having the full backup
available at the time you want to take the incremental backup.

2. You're always storing a full backup, which means that you need more
disk space, and potentially much more I/O while taking the backup.
You save on transfer bandwidth, but you add a lot of disk reads and
writes, costs which have to be paid even if the backup is never
restored.

> 1) Whether we collect block maps using simple "read everything page by
> page" approach
> or WAL scanning or any other page tracking algorithm, we must choose a
> map format.
> I implemented the simplest one, while there are more ideas:

I think we should start simple.

I haven't had a chance to look at Jeevan's patch at all, or yours in
any detail, as yet, so these are just some very preliminary comments.
It will be good, however, if we can agree on who is going to do what
part of this as we try to drive this forward together.  I'm sorry that
I didn't communicate EDB's plans to work on this more clearly;
duplicated effort serves nobody well.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: block-level incremental backup

2019-07-29 Thread Jeevan Ladhe
Hi Jeevan


I reviewed first two patches -


0001-Add-support-for-command-line-option-to-pass-LSN.patch and

0002-Add-TAP-test-to-test-LSN-option.patch


from the set of incremental backup patches, and the changes look good to me.


I had some concerns around the way we are working around with the fact that

pg_lsn_in() accepts the lsn with 0 as a valid lsn and I think that itself is

contradictory to the definition of InvalidXLogRecPtr. I have started a
separate

new thread[1] for the same.


Also, I observe that now commit 21f428eb, has already moved the lsn decoding

logic to a separate function pg_lsn_in_internal(), so the function

decode_lsn_internal() from patch 0001 will go away and the dependent code
needs

to be modified.


I shall review the rest of the patches, and post the comments.


Regards,

Jeevan Ladhe


[1]
https://www.postgresql.org/message-id/CAOgcT0NOM9oR0Hag_3VpyW0uF3iCU=bdufspfk9jrwxrcwq...@mail.gmail.com

On Thu, Jul 11, 2019 at 5:00 PM Jeevan Chalke <
jeevan.cha...@enterprisedb.com> wrote:

> Hi Anastasia,
>
> On Wed, Jul 10, 2019 at 11:47 PM Anastasia Lubennikova <
> a.lubennik...@postgrespro.ru> wrote:
>
>> 23.04.2019 14:08, Anastasia Lubennikova wrote:
>> > I'm volunteering to write a draft patch or, more likely, set of
>> > patches, which
>> > will allow us to discuss the subject in more detail.
>> > And to do that I wish we agree on the API and data format (at least
>> > broadly).
>> > Looking forward to hearing your thoughts.
>>
>> Though the previous discussion stalled,
>> I still hope that we could agree on basic points such as a map file
>> format and protocol extension,
>> which is necessary to start implementing the feature.
>>
>
> It's great that you too come up with the PoC patch. I didn't look at your
> changes in much details but we at EnterpriseDB too working on this feature
> and started implementing it.
>
> Attached series of patches I had so far... (which needed further
> optimization and adjustments though)
>
> Here is the overall design (as proposed by Robert) we are trying to
> implement:
>
> 1. Extend the BASE_BACKUP command that can be used with replication
> connections. Add a new [ LSN 'lsn' ] option.
>
> 2. Extend pg_basebackup with a new --lsn=LSN option that causes it to send
> the option added to the server in #1.
>
> Here are the implementation details when we have a valid LSN
>
> sendFile() in basebackup.c is the function which mostly does the thing for
> us. If the filename looks like a relation file, then we'll need to consider
> sending only a partial file. The way to do that is probably:
>
> A. Read the whole file into memory.
>
> B. Check the LSN of each block. Build a bitmap indicating which blocks
> have an LSN greater than or equal to the threshold LSN.
>
> C. If more than 90% of the bits in the bitmap are set, send the whole file
> just as if this were a full backup. This 90% is a constant now; we might
> make it a GUC later.
>
> D. Otherwise, send a file with .partial added to the name. The .partial
> file contains an indication of which blocks were changed at the beginning,
> followed by the data blocks. It also includes a checksum/CRC.
> Currently, a .partial file format looks like:
>  - start with a 4-byte magic number
>  - then store a 4-byte CRC covering the header
>  - then a 4-byte count of the number of blocks included in the file
>  - then the block numbers, each as a 4-byte quantity
>  - then the data blocks
>
>
> We are also working on combining these incremental back-ups with the full
> backup and for that, we are planning to add a new utility called
> pg_combinebackup. Will post the details on that later once we have on the
> same page for taking backup.
>
> Thanks
> --
> Jeevan Chalke
> Technical Architect, Product Development
> EnterpriseDB Corporation
>
>


Re: block-level incremental backup

2019-07-29 Thread Jeevan Chalke
On Tue, Jul 30, 2019 at 1:58 AM Robert Haas  wrote:

>
> I haven't had a chance to look at Jeevan's patch at all, or yours in
> any detail, as yet, so these are just some very preliminary comments.
> It will be good, however, if we can agree on who is going to do what
> part of this as we try to drive this forward together.  I'm sorry that
> I didn't communicate EDB's plans to work on this more clearly;
> duplicated effort serves nobody well.
>

I had a look over Anastasia's PoC patch to understand the approach she has
taken and here are my observations.

1.
The patch first creates a .blockmap file for each relation file containing
an array of all modified block numbers. This is done by reading all blocks
(in a chunk of 4 (32kb in total) in a loop) from a file and checking the
page
LSN with given LSN. Later, to create .partial file, a relation file is
opened
again and all blocks are read in a chunk of 4 in a loop. If found modified,
it is copied into another memory and after scanning all 4 blocks, all copied
blocks are sent to the .partial file.

In this approach, each file is opened and read twice which looks more
expensive
to me. Whereas in my patch, I do that just once. However, I read the entire
file in memory to check which blocks are modified but in Anastasia's design
max TAR_SEND_SIZE (32kb) will be read at a time but, in a loop. I need to do
that as we wanted to know how heavily the file got modified so that we can
send the entire file if it was modified beyond the threshold (currently
90%).

2.
Also, while sending modified blocks, they are copied in another buffer,
instead
they can be just sent from the read files contents (in BLCKSZ block size).
Here, the .blockmap created earlier was not used. In my implementation, we
are
sending just a .partial file with a header containing all required details
like
the number of blocks changes along with the block numbers including CRC
followed by the blocks itself.

3.
I tried compiling Anastasia's patch, but getting an error. So could not see
or
test how it goes. Also, like a normal backup option, the incremental backup
option needs to verify the checksum if requested.

4.
While combining full and incremental backup, files from the incremental
backup
are just copied into the full backup directory. While the design I posted
earlier, we are trying another way round to avoid over-writing and other
issues
as I explained earlier.

I am almost done writing the patch for pg_combinebackup and will post soon.


>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>
>
Thanks
-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company


Re: block-level incremental backup

2019-07-30 Thread Ibrar Ahmed
On Tue, Jul 30, 2019 at 1:28 AM Robert Haas  wrote:

> On Wed, Jul 10, 2019 at 2:17 PM Anastasia Lubennikova
>  wrote:
> > In attachments, you can find a prototype of incremental pg_basebackup,
> > which consists of 2 features:
> >
> > 1) To perform incremental backup one should call pg_basebackup with a
> > new argument:
> >
> > pg_basebackup -D 'basedir' --prev-backup-start-lsn 'lsn'
> >
> > where lsn is a start_lsn of parent backup (can be found in
> > "backup_label" file)
> >
> > It calls BASE_BACKUP replication command with a new argument
> > PREV_BACKUP_START_LSN 'lsn'.
> >
> > For datafiles, only pages with LSN > prev_backup_start_lsn will be
> > included in the backup.
> > They are saved into 'filename.partial' file, 'filename.blockmap' file
> > contains an array of BlockNumbers.
> > For example, if we backuped blocks 1,3,5, filename.partial will contain
> > 3 blocks, and 'filename.blockmap' will contain array {1,3,5}.
>
> I think it's better to keep both the information about changed blocks
> and the contents of the changed blocks in a single file.  The list of
> changed blocks is probably quite short, and I don't really want to
> double the number of files in the backup if there's no real need. I
> suspect it's just overall a bit simpler to keep everything together.
> I don't think this is a make-or-break thing, and welcome contrary
> arguments, but that's my preference.
>

I had experience working on a similar product and I agree with Robert to
keep
the changed block info and the changed block in a single file make more
sense.
+1

>
> > 2) To merge incremental backup into a full backup call
> >
> > pg_basebackup -D 'basedir' --incremental-pgdata 'incremental_basedir'
> > --merge-backups
> >
> > It will move all files from 'incremental_basedir' to 'basedir' handling
> > '.partial' files correctly.
>
> This, to me, looks like it's much worse than the design that I
> proposed originally.  It means that:
>
> 1. You can't take an incremental backup without having the full backup
> available at the time you want to take the incremental backup.
>
> 2. You're always storing a full backup, which means that you need more
> disk space, and potentially much more I/O while taking the backup.
> You save on transfer bandwidth, but you add a lot of disk reads and
> writes, costs which have to be paid even if the backup is never
> restored.
>
> > 1) Whether we collect block maps using simple "read everything page by
> > page" approach
> > or WAL scanning or any other page tracking algorithm, we must choose a
> > map format.
> > I implemented the simplest one, while there are more ideas:
>
> I think we should start simple.
>
> I haven't had a chance to look at Jeevan's patch at all, or yours in
> any detail, as yet, so these are just some very preliminary comments.
> It will be good, however, if we can agree on who is going to do what
> part of this as we try to drive this forward together.  I'm sorry that
> I didn't communicate EDB's plans to work on this more clearly;
> duplicated effort serves nobody well.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>
>

-- 
Ibrar Ahmed


Re: block-level incremental backup

2019-07-31 Thread vignesh C
On Tue, Jul 30, 2019 at 1:58 AM Robert Haas  wrote:
>
> On Wed, Jul 10, 2019 at 2:17 PM Anastasia Lubennikova
>  wrote:
> > In attachments, you can find a prototype of incremental pg_basebackup,
> > which consists of 2 features:
> >
> > 1) To perform incremental backup one should call pg_basebackup with a
> > new argument:
> >
> > pg_basebackup -D 'basedir' --prev-backup-start-lsn 'lsn'
> >
> > where lsn is a start_lsn of parent backup (can be found in
> > "backup_label" file)
> >
> > It calls BASE_BACKUP replication command with a new argument
> > PREV_BACKUP_START_LSN 'lsn'.
> >
> > For datafiles, only pages with LSN > prev_backup_start_lsn will be
> > included in the backup.
>>
One thought, if the file is not modified no need to check the lsn.
>>
> > They are saved into 'filename.partial' file, 'filename.blockmap' file
> > contains an array of BlockNumbers.
> > For example, if we backuped blocks 1,3,5, filename.partial will contain
> > 3 blocks, and 'filename.blockmap' will contain array {1,3,5}.
>
> I think it's better to keep both the information about changed blocks
> and the contents of the changed blocks in a single file.  The list of
> changed blocks is probably quite short, and I don't really want to
> double the number of files in the backup if there's no real need. I
> suspect it's just overall a bit simpler to keep everything together.
> I don't think this is a make-or-break thing, and welcome contrary
> arguments, but that's my preference.
>
I feel Robert's suggestion is good.
We can probably keep one meta file for each backup with some basic information
of all the files being backed up, this metadata file will be useful in the
below case:
Table dropped before incremental backup
Table truncated and Insert/Update/Delete operations before incremental backup

I feel if we have the metadata, we can add some optimization to decide the
above scenario with the metadata information to identify the file deletion
and avoiding write and delete for pg_combinebackup which Jeevan has told in
his previous mail.

Probably it can also help us to decide which work the worker needs to do
if we are planning to backup in parallel.

Regards,
vignesh
EnterpriseDB: http://www.enterprisedb.com




Re: block-level incremental backup

2019-07-31 Thread Robert Haas
On Wed, Jul 31, 2019 at 1:59 PM vignesh C  wrote:
> I feel Robert's suggestion is good.
> We can probably keep one meta file for each backup with some basic information
> of all the files being backed up, this metadata file will be useful in the
> below case:
> Table dropped before incremental backup
> Table truncated and Insert/Update/Delete operations before incremental backup

There's really no need for this with the design I proposed.  The files
that should exist when you restore in incremental backup are exactly
the set of files that exist in the final incremental backup, except
that any .partial files need to be replaced with a correct
reconstruction of the underlying file.  You don't need to know what
got dropped or truncated; you only need to know what's supposed to be
there at the end.

You may be thinking, as I once did, that restoring an incremental
backup would consist of restoring the full backup first and then
layering the incrementals over it, but if you read what I proposed, it
actually works the other way around: you restore the files that are
present in the incremental, and as needed, pull pieces of them from
earlier incremental and/or full backups.  I think this is a *much*
better design than doing it the other way; it avoids any risk of
getting the wrong answer due to truncations or drops, and it also is
faster, because you only read older backups to the extent that you
actually need their contents.

I think it's a good idea to try to keep all the information about a
single file being backup in one place. It's just less confusing.  If,
for example, you have a metadata file that tells you which files are
dropped - that is, which files you DON'T have - then what happen if
one of those files is present in the data directory after all?  Well,
then you have inconsistent information and are confused, and maybe
your code won't even notice the inconsistency.  Similarly, if the
metadata file is separate from the block data, then what happens if
one file is missing, or isn't from the same backup as the other file?
That shouldn't happen, of course, but if it does, you'll get confused.
There's no perfect solution to these kinds of problems: if we suppose
that the backup can be corrupted by having missing or extra files, why
not also corruption within a single file? Still, on balance I tend to
think that keeping related stuff together minimizes the surface area
for bugs.  I realize that's arguable, though.

One consideration that goes the other way: if you have a manifest file
that says what files are supposed to be present in the backup, then
you can detect a disappearing file, which is impossible with the
design I've proposed (and with the current full backup machinery).
That might be worth fixing, but it's a separate feature that has
little to do with incremental backup.

> Probably it can also help us to decide which work the worker needs to do
> if we are planning to backup in parallel.

I don't think we need a manifest file for parallel backup.  One
process or thread can scan the directory tree, make a list of which
files are present, and then hand individual files off to other
processes or threads. In short, the directory listing serves as the
manifest.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: block-level incremental backup

2019-08-01 Thread Jeevan Chalke
On Tue, Jul 30, 2019 at 9:39 AM Jeevan Chalke <
jeevan.cha...@enterprisedb.com> wrote:

>
>
>
> I am almost done writing the patch for pg_combinebackup and will post soon.
>

Attached patch which implements the pg_combinebackup utility used to combine
full basebackup with one or more incremental backups.

I have tested it manually and it works for all best cases.

Let me know if you have any inputs/suggestions/review comments?

Thanks
-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index 8d91f35..f3e90b6 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -200,6 +200,7 @@ Complete list of usable sgml source files in this directory.
 
 
 
+
 
 
 
diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml
index 00782e0..92d9d13 100644
--- a/doc/src/sgml/ref/pg_basebackup.sgml
+++ b/doc/src/sgml/ref/pg_basebackup.sgml
@@ -415,7 +415,7 @@ PostgreSQL documentation
 which are modified after this given LSN will be backed up. The file
 which has these partial blocks has .partial as an extension. Backup
 taken in this manner has to be combined with the full backup with the
-pg_combinebackup utility.
+ utility.

   
  
diff --git a/doc/src/sgml/ref/pg_combinebackup.sgml b/doc/src/sgml/ref/pg_combinebackup.sgml
new file mode 100644
index 000..ed87931
--- /dev/null
+++ b/doc/src/sgml/ref/pg_combinebackup.sgml
@@ -0,0 +1,202 @@
+
+
+
+ 
+  pg_combinebackup
+ 
+
+ 
+  pg_combinebackup
+  1
+  Application
+ 
+
+ 
+  pg_combinebackup
+  create a synthetic backup from a full backup and one or more incremental backups
+ 
+
+ 
+  
+   pg_combinebackup
+   option
+  
+ 
+
+ 
+  Description
+  
+   pg_combinebackup combines one or more incremental
+   backups with the full base-backup to generate a synthetic backup.
+  
+ 
+
+ 
+  Options
+
+   
+The following command-line options are available:
+
+
+ 
+  -f directory
+  --full-backup=directory
+  
+   
+Specifies the directory where the full backup is stored.
+   
+  
+ 
+
+ 
+  -i directory
+  --incr-backup=directory
+  
+   
+Specifies the directory where the incremental backup is stored.
+   
+  
+ 
+
+ 
+  -o directory
+  --output-backup=directory
+  
+   
+Specifies the output directory where the combined full synthetic backup
+to be stored.
+   
+  
+ 
+
+ 
+  -n
+  --no-clean
+  
+   
+By default, when pg_combinebackup aborts with an
+error, it removes the output data directories it might have created
+before discovering that it cannot finish the job. This option inhibits
+tidying-up and is thus useful for debugging.
+   
+  
+ 
+
+ 
+  -r max-retries
+  
+   
+Max number of retries on copy command, with progressive wait.
+Default is 3.
+   
+  
+ 
+
+ 
+  -s interval
+  
+   
+Sleep interval before retry (in seconds).
+Should be in between 1 and 60, default is 5.
+   
+  
+ 
+
+ 
+  -v
+  --verbose
+  
+   
+Enable verbose output. Lists all partial files processed and its
+checksum status.
+   
+  
+ 
+
+ 
+   -V
+   --version
+   
+   
+Print the pg_combinebackup version and exit.
+   
+   
+ 
+
+ 
+  -?
+  --help
+   
+
+ Show help about pg_combinebackup command line
+ arguments, and exit.
+
+   
+  
+
+   
+ 
+
+ 
+  Environment
+  
+   
+PG_COLOR
+
+ 
+  Specifies whether to use color in diagnostics messages.  Possible values
+  are always, auto,
+  never.
+ 
+
+   
+  
+ 
+
+ 
+  Notes
+  
+   Output directory, full backup directory, and at-least one incremental backup
+   directory must be specified.
+  
+
+  
+   PREVIOUS WAL LOCATION of the incremental backup must
+   match with the START WAL LOCATION of the previous full
+   or incremental backup in a given sequence.
+  
+ 
+
+ 
+  Examples
+
+  
+   To combine a full backup with two incremental backups and store it in the
+   output directory:
+
+$ pg_combinebackup -f /data/full/data -i /data/incr/data1 -i /data/incr/data2 -o /data/full/fulldata
+
+  
+
+  
+   To combine a full backup with an incremental backups and store it in the
+   output directory along with verious options like, verbose, no-clean, and
+   maximum 3 retries: 
+
+$ pg_combinebackup -v --no-clean -r 3 -f /data/full/data -i /data/incr/data1 -o /data/full/fulldata
+
+  
+ 
+
+ 
+  See Also
+
+  
+   
+  
+ 
+
+
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index cef09dd..3513ab4 100644
--- a/do

Re: block-level incremental backup

2019-08-02 Thread vignesh C
On Thu, Aug 1, 2019 at 5:06 PM Jeevan Chalke
 wrote:
>
> On Tue, Jul 30, 2019 at 9:39 AM Jeevan Chalke 
>  wrote:
>>
>> I am almost done writing the patch for pg_combinebackup and will post soon.
>
>
> Attached patch which implements the pg_combinebackup utility used to combine
> full basebackup with one or more incremental backups.
>
> I have tested it manually and it works for all best cases.
>
> Let me know if you have any inputs/suggestions/review comments?
>
Some comments:
1) There will be some link files created for tablespace, we might
require some special handling for it

2)
+ while (numretries <= maxretries)
+ {
+ rc = system(copycmd);
+ if (rc == 0)
+ return;
+
+ pg_log_info("could not copy, retrying after %d seconds",
+ sleeptime);
+ pg_usleep(numretries++ * sleeptime * 100L);
+ }
Retry functionality is hanlded only for copying of full files, should
we handle retry for copying of partial files

3)
+ maxretries = atoi(optarg);
+ if (maxretries < 0)
+ {
+ pg_log_error("invalid value for maxretries");
+ fprintf(stderr, _("%s: -r maxretries must be >= 0\n"), progname);
+ exit(1);
+ }
+ break;
+ case 's':
+ sleeptime = atoi(optarg);
+ if (sleeptime <= 0 || sleeptime > 60)
+ {
+ pg_log_error("invalid value for sleeptime");
+ fprintf(stderr, _("%s: -s sleeptime must be between 1 and 60\n"), progname);
+ exit(1);
+ }
+ break;
we can have some range for maxretries similar to sleeptime

4)
+ fp = fopen(filename, "r");
+ if (fp == NULL)
+ {
+ pg_log_error("could not read file \"%s\": %m", filename);
+ exit(1);
+ }
+
+ labelfile = malloc(statbuf.st_size + 1);
+ if (fread(labelfile, 1, statbuf.st_size, fp) != statbuf.st_size)
+ {
+ pg_log_error("corrupted file \"%s\": %m", filename);
+ free(labelfile);
+ exit(1);
+ }
Should we check for malloc failure

5) Should we add display of progress as backup may take some time,
this can be added as enhancement. We can get other's opinion on this.

6)
+ if (nIncrDir == MAX_INCR_BK_COUNT)
+ {
+ pg_log_error("too many incremental backups to combine");
+ fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
+ exit(1);
+ }
+
+ IncrDirs[nIncrDir] = optarg;
+ nIncrDir++;
+ break;

If the backup count increases providing the input may be difficult,
Shall user provide all the incremental backups from a parent folder
and can we handle the ordering of incremental backup internally

7)
+ if (isPartialFile)
+ {
+ if (verbose)
+ pg_log_info("combining partial file \"%s.partial\"", fn);
+
+ combine_partial_files(fn, IncrDirs, nIncrDir, subdirpath, outfn);
+ }
+ else
+ copy_whole_file(infn, outfn);

Add verbose for copying whole file

8) We can also check if approximate space is available in disk before
starting combine backup, this can be added as enhancement. We can get
other's opinion on this.

9)
+ printf(_("  -i, --incr-backup=DIRECTORY incremental backup directory
(maximum %d)\n"), MAX_INCR_BK_COUNT);
+ printf(_("  -o, --output-dir=DIRECTORY  combine backup into directory\n"));
+ printf(_("\nGeneral options:\n"));
+ printf(_("  -n, --no-clean  do not clean up after errors\n"));

Combine backup into directory can be combine backup directory

10)
+/* Max number of incremental backups to be combined. */
+#define MAX_INCR_BK_COUNT 10
+
+/* magic number in incremental backup's .partial file */

MAX_INCR_BK_COUNT can be increased little, some applications use 1
full backup at the beginning of the month and use 30 incremental
backups rest of the days in the month

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com




Re: block-level incremental backup

2019-08-05 Thread Robert Haas
On Fri, Aug 2, 2019 at 9:13 AM vignesh C  wrote:
> + rc = system(copycmd);

I don't think this patch should be calling system() in the first place.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: block-level incremental backup

2019-08-05 Thread Stephen Frost
Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Fri, Aug 2, 2019 at 9:13 AM vignesh C  wrote:
> > + rc = system(copycmd);
> 
> I don't think this patch should be calling system() in the first place.

+1.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: block-level incremental backup

2019-08-06 Thread Ibrar Ahmed
I have not looked at the patch in detail, but just some nits from my side.

On Fri, Aug 2, 2019 at 6:13 PM vignesh C  wrote:

> On Thu, Aug 1, 2019 at 5:06 PM Jeevan Chalke
>  wrote:
> >
> > On Tue, Jul 30, 2019 at 9:39 AM Jeevan Chalke <
> jeevan.cha...@enterprisedb.com> wrote:
> >>
> >> I am almost done writing the patch for pg_combinebackup and will post
> soon.
> >
> >
> > Attached patch which implements the pg_combinebackup utility used to
> combine
> > full basebackup with one or more incremental backups.
> >
> > I have tested it manually and it works for all best cases.
> >
> > Let me know if you have any inputs/suggestions/review comments?
> >
> Some comments:
> 1) There will be some link files created for tablespace, we might
> require some special handling for it
>
> 2)
> + while (numretries <= maxretries)
> + {
> + rc = system(copycmd);
> + if (rc == 0)
> + return;
>
> Use API to copy the file instead of "system", better to use the secure
copy.


> + pg_log_info("could not copy, retrying after %d seconds",
> + sleeptime);
> + pg_usleep(numretries++ * sleeptime * 100L);
> + }
> Retry functionality is hanlded only for copying of full files, should
> we handle retry for copying of partial files
>
> The log and the sleep time does not match, you are multiplying sleeptime
with numretries++ and logging only "sleeptime"

Why we are retiring here, capture proper copy error and act accordingly.
Blindly retiring does not make sense.

3)
> + maxretries = atoi(optarg);
> + if (maxretries < 0)
> + {
> + pg_log_error("invalid value for maxretries");
> + fprintf(stderr, _("%s: -r maxretries must be >= 0\n"), progname);
> + exit(1);
> + }
> + break;
> + case 's':
> + sleeptime = atoi(optarg);
> + if (sleeptime <= 0 || sleeptime > 60)
> + {
> + pg_log_error("invalid value for sleeptime");
> + fprintf(stderr, _("%s: -s sleeptime must be between 1 and 60\n"),
> progname);
> + exit(1);
> + }
> + break;
> we can have some range for maxretries similar to sleeptime
>
> 4)
> + fp = fopen(filename, "r");
> + if (fp == NULL)
> + {
> + pg_log_error("could not read file \"%s\": %m", filename);
> + exit(1);
> + }
> +
> + labelfile = malloc(statbuf.st_size + 1);
> + if (fread(labelfile, 1, statbuf.st_size, fp) != statbuf.st_size)
> + {
> + pg_log_error("corrupted file \"%s\": %m", filename);
> + free(labelfile);
> + exit(1);
> + }
> Should we check for malloc failure
>
> Use pg_malloc instead of malloc


> 5) Should we add display of progress as backup may take some time,
> this can be added as enhancement. We can get other's opinion on this.
>
> Yes, we should, but this is not the right time to do that.


> 6)
> + if (nIncrDir == MAX_INCR_BK_COUNT)
> + {
> + pg_log_error("too many incremental backups to combine");
> + fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
> progname);
> + exit(1);
> + }
> +
> + IncrDirs[nIncrDir] = optarg;
> + nIncrDir++;
> + break;
>
> If the backup count increases providing the input may be difficult,
> Shall user provide all the incremental backups from a parent folder
> and can we handle the ordering of incremental backup internally
>
> Why we have that limit at first place?


> 7)
> + if (isPartialFile)
> + {
> + if (verbose)
> + pg_log_info("combining partial file \"%s.partial\"", fn);
> +
> + combine_partial_files(fn, IncrDirs, nIncrDir, subdirpath, outfn);
> + }
> + else
> + copy_whole_file(infn, outfn);
>
> Add verbose for copying whole file
>
> 8) We can also check if approximate space is available in disk before
> starting combine backup, this can be added as enhancement. We can get
> other's opinion on this.
>
> 9)
> + printf(_("  -i, --incr-backup=DIRECTORY incremental backup directory
> (maximum %d)\n"), MAX_INCR_BK_COUNT);
> + printf(_("  -o, --output-dir=DIRECTORY  combine backup into
> directory\n"));
> + printf(_("\nGeneral options:\n"));
> + printf(_("  -n, --no-clean  do not clean up after
> errors\n"));
>
> Combine backup into directory can be combine backup directory
>
> 10)
> +/* Max number of incremental backups to be combined. */
> +#define MAX_INCR_BK_COUNT 10
> +
> +/* magic number in incremental backup's .partial file */
>
> MAX_INCR_BK_COUNT can be increased little, some applications use 1
> full backup at the beginning of the month and use 30 incremental
> backups rest of the days in the month
>
> Regards,
> Vignesh
> EnterpriseDB: http://www.enterprisedb.com
>
>
>

-- 
Ibrar Ahmed


Re: block-level incremental backup

2019-08-06 Thread Ibrar Ahmed
On Tue, Aug 6, 2019 at 11:31 PM Ibrar Ahmed  wrote:

>
> I have not looked at the patch in detail, but just some nits from my side.
>
> On Fri, Aug 2, 2019 at 6:13 PM vignesh C  wrote:
>
>> On Thu, Aug 1, 2019 at 5:06 PM Jeevan Chalke
>>  wrote:
>> >
>> > On Tue, Jul 30, 2019 at 9:39 AM Jeevan Chalke <
>> jeevan.cha...@enterprisedb.com> wrote:
>> >>
>> >> I am almost done writing the patch for pg_combinebackup and will post
>> soon.
>> >
>> >
>> > Attached patch which implements the pg_combinebackup utility used to
>> combine
>> > full basebackup with one or more incremental backups.
>> >
>> > I have tested it manually and it works for all best cases.
>> >
>> > Let me know if you have any inputs/suggestions/review comments?
>> >
>> Some comments:
>> 1) There will be some link files created for tablespace, we might
>> require some special handling for it
>>
>> 2)
>> + while (numretries <= maxretries)
>> + {
>> + rc = system(copycmd);
>> + if (rc == 0)
>> + return;
>>
>> Use API to copy the file instead of "system", better to use the secure
> copy.
>
Ah, it is a local copy, simple copy API is enough.

>
>
>> + pg_log_info("could not copy, retrying after %d seconds",
>> + sleeptime);
>> + pg_usleep(numretries++ * sleeptime * 100L);
>> + }
>> Retry functionality is hanlded only for copying of full files, should
>> we handle retry for copying of partial files
>>
>> The log and the sleep time does not match, you are multiplying sleeptime
> with numretries++ and logging only "sleeptime"
>
> Why we are retiring here, capture proper copy error and act accordingly.
> Blindly retiring does not make sense.
>
> 3)
>> + maxretries = atoi(optarg);
>> + if (maxretries < 0)
>> + {
>> + pg_log_error("invalid value for maxretries");
>> + fprintf(stderr, _("%s: -r maxretries must be >= 0\n"), progname);
>> + exit(1);
>> + }
>> + break;
>> + case 's':
>> + sleeptime = atoi(optarg);
>> + if (sleeptime <= 0 || sleeptime > 60)
>> + {
>> + pg_log_error("invalid value for sleeptime");
>> + fprintf(stderr, _("%s: -s sleeptime must be between 1 and 60\n"),
>> progname);
>> + exit(1);
>> + }
>> + break;
>> we can have some range for maxretries similar to sleeptime
>>
>> 4)
>> + fp = fopen(filename, "r");
>> + if (fp == NULL)
>> + {
>> + pg_log_error("could not read file \"%s\": %m", filename);
>> + exit(1);
>> + }
>> +
>> + labelfile = malloc(statbuf.st_size + 1);
>> + if (fread(labelfile, 1, statbuf.st_size, fp) != statbuf.st_size)
>> + {
>> + pg_log_error("corrupted file \"%s\": %m", filename);
>> + free(labelfile);
>> + exit(1);
>> + }
>> Should we check for malloc failure
>>
>> Use pg_malloc instead of malloc
>
>
>> 5) Should we add display of progress as backup may take some time,
>> this can be added as enhancement. We can get other's opinion on this.
>>
>> Yes, we should, but this is not the right time to do that.
>
>
>> 6)
>> + if (nIncrDir == MAX_INCR_BK_COUNT)
>> + {
>> + pg_log_error("too many incremental backups to combine");
>> + fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
>> progname);
>> + exit(1);
>> + }
>> +
>> + IncrDirs[nIncrDir] = optarg;
>> + nIncrDir++;
>> + break;
>>
>> If the backup count increases providing the input may be difficult,
>> Shall user provide all the incremental backups from a parent folder
>> and can we handle the ordering of incremental backup internally
>>
>> Why we have that limit at first place?
>
>
>> 7)
>> + if (isPartialFile)
>> + {
>> + if (verbose)
>> + pg_log_info("combining partial file \"%s.partial\"", fn);
>> +
>> + combine_partial_files(fn, IncrDirs, nIncrDir, subdirpath, outfn);
>> + }
>> + else
>> + copy_whole_file(infn, outfn);
>>
>> Add verbose for copying whole file
>>
>> 8) We can also check if approximate space is available in disk before
>> starting combine backup, this can be added as enhancement. We can get
>> other's opinion on this.
>>
>> 9)
>> + printf(_("  -i, --incr-backup=DIRECTORY incremental backup directory
>> (maximum %d)\n"), MAX_INCR_BK_COUNT);
>> + printf(_("  -o, --output-dir=DIRECTORY  combine backup into
>> directory\n"));
>> + printf(_("\nGeneral options:\n"));
>> + printf(_("  -n, --no-clean  do not clean up after
>> errors\n"));
>>
>> Combine backup into directory can be combine backup directory
>>
>> 10)
>> +/* Max number of incremental backups to be combined. */
>> +#define MAX_INCR_BK_COUNT 10
>> +
>> +/* magic number in incremental backup's .partial file */
>>
>> MAX_INCR_BK_COUNT can be increased little, some applications use 1
>> full backup at the beginning of the month and use 30 incremental
>> backups rest of the days in the month
>>
>> Regards,
>> Vignesh
>> EnterpriseDB: http://www.enterprisedb.com
>>
>>
>>
>
> --
> Ibrar Ahmed
>


-- 
Ibrar Ahmed


Re: block-level incremental backup

2019-08-07 Thread Jeevan Chalke
On Mon, Aug 5, 2019 at 7:13 PM Robert Haas  wrote:

> On Fri, Aug 2, 2019 at 9:13 AM vignesh C  wrote:
> > + rc = system(copycmd);
>
> I don't think this patch should be calling system() in the first place.
>

So, do you mean we should just do fread() and fwrite() for the whole file?

I thought it is better if it was done by the OS itself instead of reading
1GB
into the memory and writing the same to the file.


> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company


Re: block-level incremental backup

2019-08-07 Thread Ibrar Ahmed
On Wed, Aug 7, 2019 at 2:47 PM Jeevan Chalke 
wrote:

>
>
> On Mon, Aug 5, 2019 at 7:13 PM Robert Haas  wrote:
>
>> On Fri, Aug 2, 2019 at 9:13 AM vignesh C  wrote:
>> > + rc = system(copycmd);
>>
>> I don't think this patch should be calling system() in the first place.
>>
>
> So, do you mean we should just do fread() and fwrite() for the whole file?
>
> I thought it is better if it was done by the OS itself instead of reading
> 1GB
> into the memory and writing the same to the file.
>
> It is not necessary to read the whole 1GB into Ram.


>
>> --
>> Robert Haas
>> EnterpriseDB: http://www.enterprisedb.com
>> The Enterprise PostgreSQL Company
>>
>
>
> --
> Jeevan Chalke
> Technical Architect, Product Development
> EnterpriseDB Corporation
> The Enterprise PostgreSQL Company
>
>

-- 
Ibrar Ahmed


Re: block-level incremental backup

2019-08-08 Thread Jeevan Ladhe
Hi Jeevan,

I have reviewed the backup part at code level and still looking into the
restore(combine) and functional part of it. But, here are my comments so
far:

The patches need rebase.

+   if (!XLogRecPtrIsInvalid(previous_lsn))
+   appendStringInfo(labelfile, "PREVIOUS WAL LOCATION: %X/%X\n",
+(uint32) (previous_lsn >> 32), (uint32)
previous_lsn);

May be we should rename to something like:
"INCREMENTAL BACKUP START WAL LOCATION" or simply "INCREMENTAL BACKUP START
LOCATION"
to make it more intuitive?



+typedef struct

+{

+   uint32  magic;

+   pg_crc32c   checksum;

+   uint32  nblocks;

+   uint32  blocknumbers[FLEXIBLE_ARRAY_MEMBER];

+} partial_file_header;


File header structure is defined in both the files basebackup.c and
pg_combinebackup.c. I think it is better to move this to
replication/basebackup.h.



+   boolisrelfile = false;

I think we can avoid having flag isrelfile in sendFile().
Something like this:

if (startincrptr && OidIsValid(dboid) && looks_like_rel_name(filename))
{
//include the code here that is under "if (isrelfile)" block.
}
else
{
_tarWriteHeader(tarfilename, NULL, statbuf, false);
while ((cnt = fread(buf, 1, Min(sizeof(buf), statbuf->st_size - len), fp))
> 0)
{
...
}
}



Also, having isrelfile as part of following condition:
{code}
+   while (!isrelfile &&
+  (cnt = fread(buf, 1, Min(sizeof(buf), statbuf->st_size - len),
fp)) > 0)
{code}

is confusing, because even the relation files in full backup are going to be
backed up by this loop only, but still, the condition reads '(!isrelfile
&&...)'.



verify_page_checksum()
{
while(1)
{

break;
}
}

IMHO, while labels are not advisable in general, it may be better to use a
label
here rather than a while(1) loop, so that we can move to the label in case
we
want to retry once. I think here it opens doors for future bugs if someone
happens to add code here, ending up adding some condition and then the
break becomes conditional. That will leave us in an infinite loop.



+/* magic number in incremental backup's .partial file */
+#define INCREMENTAL_BACKUP_MAGIC   0x494E4352

Similar to structure partial_file_header, I think above macro can also be
moved
to basebackup.h instead of defining it twice.



In sendFile():

+   buf = (char *) malloc(RELSEG_SIZE * BLCKSZ);

I think this is a huge memory request (1GB) and may fail on busy/loaded
server at
times. We should check for failures of malloc, maybe throw some error on
getting ENOMEM as errno.



+   /* Perform incremenatl backup stuff here. */
+   if ((cnt = fread(buf, 1, Min(RELSEG_SIZE * BLCKSZ,
statbuf->st_size), fp)) > 0)
+   {

Here, should not we expect statbuf->st_size < (RELSEG_SIZE * BLCKSZ), and it
should be safe to read just statbuf_st_size always I guess? But, I am ok
with
having this extra guard here.



In sendFile(), I am sorry if I am missing something, but I am not able to
understand why 'cnt' and 'i' should have different values when they are
being
passed to verify_page_checksum(). I think passing only one of them should be
sufficient.



+   XLogRecPtr  pglsn;
+
+   for (i = 0; i < cnt / BLCKSZ; i++)
+   {

Maybe we should just have a variable no_of_blocks to store a number of
blocks,
rather than calculating this say RELSEG_SIZE(i.e. 131072) times in the worst
case.


+   len += cnt;
+   throttle(cnt);
+   }

Sorry if I am missing something, but, should not it be just:

len = cnt;



As I said earlier in my previous email, we now do not need
+decode_lsn_internal()
as it is already taken care by the introduction of function
pg_lsn_in_internal().

Regards,
Jeevan Ladhe


Re: block-level incremental backup

2019-08-09 Thread Robert Haas
On Wed, Aug 7, 2019 at 5:46 AM Jeevan Chalke
 wrote:
> So, do you mean we should just do fread() and fwrite() for the whole file?
>
> I thought it is better if it was done by the OS itself instead of reading 1GB
> into the memory and writing the same to the file.

Well, 'cp' is just a C program.  If they can write code to copy a
file, so can we, and then we're not dependent on 'cp' being installed,
working properly, being in the user's path or at the hard-coded
pathname we expect, etc.  There's an existing copy_file() function in
src/backed/storage/file/copydir.c which I'd probably look into
adapting for frontend use.  I'm not sure whether it would be important
to adapt the data-flushing code that's present in that routine or
whether we could get by with just the loop to read() and write() data.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: block-level incremental backup

2019-08-09 Thread Robert Haas
On Thu, Aug 8, 2019 at 8:37 PM Jeevan Ladhe
 wrote:
> +   if (!XLogRecPtrIsInvalid(previous_lsn))
> +   appendStringInfo(labelfile, "PREVIOUS WAL LOCATION: %X/%X\n",
> +(uint32) (previous_lsn >> 32), (uint32) 
> previous_lsn);
>
> May be we should rename to something like:
> "INCREMENTAL BACKUP START WAL LOCATION" or simply "INCREMENTAL BACKUP START 
> LOCATION"
> to make it more intuitive?

So, I think that you are right that PREVIOUS WAL LOCATION might not be
entirely clear, but at least in my view, INCREMENTAL BACKUP START WAL
LOCATION is definitely not clear.  This backup is an incremental
backup, and it has a start WAL location, so you'd end up with START
WAL LOCATION and INCREMENTAL BACKUP START WAL LOCATION and those sound
like they ought to both be the same thing, but they're not.  Perhaps
something like REFERENCE WAL LOCATION or REFERENCE WAL LOCATION FOR
INCREMENTAL BACKUP would be clearer.

> File header structure is defined in both the files basebackup.c and
> pg_combinebackup.c. I think it is better to move this to 
> replication/basebackup.h.

Or some other header, but yeah, definitely don't duplicate the struct
definition (or any other kind of definition).

> IMHO, while labels are not advisable in general, it may be better to use a 
> label
> here rather than a while(1) loop, so that we can move to the label in case we
> want to retry once. I think here it opens doors for future bugs if someone
> happens to add code here, ending up adding some condition and then the
> break becomes conditional. That will leave us in an infinite loop.

I'm not sure which style is better here, but I don't really buy this argument.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: block-level incremental backup

2019-08-09 Thread Jeevan Ladhe
Hi Robert,

On Fri, Aug 9, 2019 at 6:40 PM Robert Haas  wrote:

> On Thu, Aug 8, 2019 at 8:37 PM Jeevan Ladhe
>  wrote:
> > +   if (!XLogRecPtrIsInvalid(previous_lsn))
> > +   appendStringInfo(labelfile, "PREVIOUS WAL LOCATION: %X/%X\n",
> > +(uint32) (previous_lsn >> 32), (uint32)
> previous_lsn);
> >
> > May be we should rename to something like:
> > "INCREMENTAL BACKUP START WAL LOCATION" or simply "INCREMENTAL BACKUP
> START LOCATION"
> > to make it more intuitive?
>
> So, I think that you are right that PREVIOUS WAL LOCATION might not be
> entirely clear, but at least in my view, INCREMENTAL BACKUP START WAL
> LOCATION is definitely not clear.  This backup is an incremental
> backup, and it has a start WAL location, so you'd end up with START
> WAL LOCATION and INCREMENTAL BACKUP START WAL LOCATION and those sound
> like they ought to both be the same thing, but they're not.  Perhaps
> something like REFERENCE WAL LOCATION or REFERENCE WAL LOCATION FOR
> INCREMENTAL BACKUP would be clearer.
>

Agree, how about INCREMENTAL BACKUP REFERENCE WAL LOCATION ?


> > File header structure is defined in both the files basebackup.c and
> > pg_combinebackup.c. I think it is better to move this to
> replication/basebackup.h.
>
> Or some other header, but yeah, definitely don't duplicate the struct
> definition (or any other kind of definition).
>

Thanks.


> > IMHO, while labels are not advisable in general, it may be better to use
> a label
> > here rather than a while(1) loop, so that we can move to the label in
> case we
> > want to retry once. I think here it opens doors for future bugs if
> someone
> > happens to add code here, ending up adding some condition and then the
> > break becomes conditional. That will leave us in an infinite loop.
>
> I'm not sure which style is better here, but I don't really buy this
> argument.


No issues. I am ok either way.

Regards,
Jeevan Ladhe


Re: block-level incremental backup

2019-08-12 Thread Jeevan Chalke
On Fri, Aug 9, 2019 at 6:36 PM Robert Haas  wrote:

> On Wed, Aug 7, 2019 at 5:46 AM Jeevan Chalke
>  wrote:
> > So, do you mean we should just do fread() and fwrite() for the whole
> file?
> >
> > I thought it is better if it was done by the OS itself instead of
> reading 1GB
> > into the memory and writing the same to the file.
>
> Well, 'cp' is just a C program.  If they can write code to copy a
> file, so can we, and then we're not dependent on 'cp' being installed,
> working properly, being in the user's path or at the hard-coded
> pathname we expect, etc.  There's an existing copy_file() function in
> src/backed/storage/file/copydir.c which I'd probably look into
> adapting for frontend use.  I'm not sure whether it would be important
> to adapt the data-flushing code that's present in that routine or
> whether we could get by with just the loop to read() and write() data.
>

Agree that we can certainly use open(), read(), write(), and close() here,
but
given that pg_basebackup.c and basbackup.c are using file operations, I
think
using fopen(), fread(), fwrite(), and fclose() will be better here, at-least
for consistetncy.

Let me know if we still want to go with native OS calls.


>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company


Re: block-level incremental backup

2019-08-12 Thread Jeevan Chalke
On Fri, Aug 9, 2019 at 11:56 PM Jeevan Ladhe 
wrote:

> Hi Robert,
>
> On Fri, Aug 9, 2019 at 6:40 PM Robert Haas  wrote:
>
>> On Thu, Aug 8, 2019 at 8:37 PM Jeevan Ladhe
>>  wrote:
>> > +   if (!XLogRecPtrIsInvalid(previous_lsn))
>> > +   appendStringInfo(labelfile, "PREVIOUS WAL LOCATION:
>> %X/%X\n",
>> > +(uint32) (previous_lsn >> 32), (uint32)
>> previous_lsn);
>> >
>> > May be we should rename to something like:
>> > "INCREMENTAL BACKUP START WAL LOCATION" or simply "INCREMENTAL BACKUP
>> START LOCATION"
>> > to make it more intuitive?
>>
>> So, I think that you are right that PREVIOUS WAL LOCATION might not be
>> entirely clear, but at least in my view, INCREMENTAL BACKUP START WAL
>> LOCATION is definitely not clear.  This backup is an incremental
>> backup, and it has a start WAL location, so you'd end up with START
>> WAL LOCATION and INCREMENTAL BACKUP START WAL LOCATION and those sound
>> like they ought to both be the same thing, but they're not.  Perhaps
>> something like REFERENCE WAL LOCATION or REFERENCE WAL LOCATION FOR
>> INCREMENTAL BACKUP would be clearer.
>>
>
> Agree, how about INCREMENTAL BACKUP REFERENCE WAL LOCATION ?
>

+1 for INCREMENTAL BACKUP REFERENCE WA.


>

-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company


  1   2   3   >