On 10.04.2019 19:51, Robert Haas wrote:
On Wed, Apr 10, 2019 at 10:22 AM Konstantin Knizhnik
<k.knizh...@postgrespro.ru> wrote:
Some times ago I have implemented alternative version of ptrack utility
(not one used in pg_probackup)
which detects updated block at file level. It is very simple and may be
it can be sometimes integrated in master.
I don't think this is completely crash-safe.  It looks like it
arranges to msync() the ptrack file at appropriate times (although I
haven't exhaustively verified the logic), but it uses MS_ASYNC, so
it's possible that the ptrack file could get updated on disk either
before or after the relation file itself.  I think before is probably
OK -- it just risks having some blocks look modified when they aren't
really -- but after seems like it is very much not OK.  And changing
this to use MS_SYNC would probably be really expensive.  Likely a
better approach would be to hook into the new fsync queue machinery
that Thomas Munro added to PostgreSQL 12.

I do not think that MS_SYNC or fsync queue is needed here.
If power failure or OS crash cause loose of some writes to ptrack map, then in any case {ostgres will perform recovery and updating pages from WAL cause once again marking them in ptrack map. So as in case of CLOG and many other Postgres files it is not critical to loose some writes because them will be restored from WAL. And before truncating WAL, Postgres performs checkpoint which flushes all changes to the disk, including ptrack map updates.


It looks like your system maps all the blocks in the system into a
fixed-size map using hashing.  If the number of modified blocks
between the full backup and the incremental backup is large compared
to the size of the ptrack map, you'll start to get a lot of
false-positives.  It will look as if much of the database needs to be
backed up.  For example, in your sample configuration, you have
ptrack_map_size = 1000003. If you've got a 100GB database with 20%
daily turnover, that's about 2.6 million blocks.  If you set bump a
random entry ~2.6 million times in a map with 1000003 entries, on the
average ~92% of the entries end up getting bumped, so you will get
very little benefit from incremental backup.  This problem drops off
pretty fast if you raise the size of the map, but it's pretty critical
that your map is large enough for the database you've got, or you may
as well not bother.
This is why ptrack block size should be larger than page size.
Assume that it is 1Mb. 1MB is considered to be optimal amount of disk IO, when frequent seeks are not degrading read speed (it is most critical for HDD). In other words reading 10 random pages (20%) from this 1Mb block will takes almost the same amount of time (or even longer) than reading all this 1Mb in one operation.

There will be just 100000 used entries in ptrack map with very small probability of collision. Actually I have chosen this size (1000003) for ptrack map because with 1Mb block size is allows to map without noticable number of collisions 1Tb database which seems to be enough for most Postgres installations. But increasing ptrack map size 10 and even 100 times should not also cause problems with modern RAM sizes.


It also appears that your system can't really handle resizing of the
map in any friendly way.  So if your data size grows, you may be faced
with either letting the map become progressively less effective, or
throwing it out and losing all the data you have.

None of that is to say that what you're presenting here has no value,
but I think it's possible to do better (and I think we should try).

Definitely I didn't consider proposed patch as perfect solution and certainly it requires improvements (and may be complete redesign). I just want to present this approach (maintaining hash of block's LSN in mapped memory) and keeping track of modified blocks at file level (unlike current ptrack implementation which logs changes in all places in Postgres code where data is updated).

Also, despite to the fact that this patch may be considered as raw prototype, I have spent some time thinking about all aspects of this approach including fault tolerance and false positives.



Reply via email to