Re: [BackupPC-devel] BackupPC 4.0 features - reference counting

Jeffrey J. Kosowsky Wed, 02 Mar 2011 18:19:24 -0800

Craig Barratt wrote at about 00:32:15 -0800 on Wednesday, March 2, 2011:
 > In 3.x hardlinks are used for reference counting of pool files.
 > 
 > In 4.x hardlinks are not used.  Reference counting is done using simple
 > flat file databases.  For every file (digest) in the pool a reference
 > count is maintained.  This count is the number of backup instances that
 > refer (ie: use) this file.  The reference counts need to be updated as
 > backups are completed and as backups are removed.  The sole purpose of
 > the reference counts is to determine when a pool file is no longer used
 > by any backups and can therefore be removed.
 > 
 > The main risk of using application-level reference counting, rather
 > than file-system reference counting with hardlinks, is the risk of
 > inconsistency due to bugs, abnormally terminated backup processes,
 > race conditions and a file system error could cause the entire
 > reference count database to be corrupted.


One possible extension that might add some robustness would be to
allow the N most recent reference databases to be saved and to record
which files got added/deleted between reference database
versions. Presumably this "delta" already exists implicitly since it
is what is used to update the current reference on each run of
BaclupPC_nightly. 

Then, if the current database corrupted, the current status could be
recreated from an earlier version merged with the deltas.

 > 
 > However, the benefits significantly outweigh the drawbacks:
 > 
 >  - eliminating hardlinks means the backup storage is much easier
 >    to replicate, copy or restore.

TOTALLY AWESOME. This should reduce the traffic on the BackuppPC
newslist by about 25% just by eliminating this FAQ and complaint.

 > 
 >  - determining which pool files can be deleted is much more
 >    efficient, since only the reference count database needs
 >    to be searched for reference counts of 0.  It is no longer
 >    necessary to stat() every file in the pool, which is very
 >    time consuming on large pools.

Love these improvements in performance.
 > 
 > It is not necessary to update the reference counts in real time, so
 > the implementation is a lot simpler and more efficient.  In fact,
 > the reference count updating is done as part of the BackupPC_nightly
 > process.
 > 
 > The reference count database is stored in 128 different files,
 > based on the first byte of the digest anded with 0xfe.  Therefore
 > the file:
 > 
 >     CPOOL_DIR/4e/poolCnt
 > 
 > stores all the reference counts for digests that start with 0x4e or
 > 0x4f.  The file itself is the result of using Storable::store() on a
 > hash whose key is the digest and value is the reference count.  This
 > is a compact format for storing the perl data structure.  The entire
 > file is read or written in a batch like manner - it is not intended
 > for dynamic updates of individual entries.

Why use only 7 bits (and with 0xfe) rather than 8 bits (and with 0xff)?
 > 
 > When backups are done or backups are deleted, a file is created
 > that records the changes in reference counts.  For example, if a
 > backup is being done, and a new file is matched to an existing
 > pool file, then the reference count for that pool file needs to be
 > incremented.  Similarly, if a backup is deleted so that a given
 > pool file is no longer referenced, then that reference count needs
 > to be decremented.  Remember that backups are stored as reverse-time
 > deltas in 4.x, so there are a few subtle issues about how reference
 > counts change.  For example, if a file was present in the prior backup,
 > but has been removed prior to the current backup, then the reference
 > count doesn't change - the file is simply moved from the current backup
 > to the prior backup.

Are you counting the number of times the file appears in a backup
(filled or unfilled) or are you counting the number of times the file
appears in an attrib file. It seems to me that the 2nd notion might be
easier to deal with, since then there is a clear 1-1 correspondance
between the count and the number of times the file is directly
referenced in the pc tree.
 > 
 > Those "pool reference delta" files are stored in each PC's backup
 > directory, and also in the trash directory.  There could be many of
 > these as backups are done and others are deleted.  Their name has
 > the form "tpoolCntDelta.PID.NNN" as they are being written, where
 > PID is the process ID of the writing process, and NNN is a number
 > to ensure the file is unique.  Once the file is closed, it is renamed
 > to "poolCntDelta.PID.NNN".  Each PC directory and the trash directory
 > could have several or many of these files.
 > 
 > The script bin/BackupPC_refCountUpdate reads all the poolCntDelta*
 > files in the PC and trash directories, and updates the poolCnt
 > files below CPOOL_DIR and POOL_DIR.  If it encounters any errors
 > it does its best to restore all the files to their original
 > form.

Is this memory-intensive? i.e. does efficiency require large
poolCntDelta files and potentially multiple poolCnt files to be stored
in memory or are they sorted in a way that allows the files to be
processed chunk-by-chunk.

I ask because I have been able to run BackupPC on arm-based NAS's with
as little as 64MB of RAM and I am wondering whether 4.x offers
speedups in terms of disk accesses at the expense of more in-RAM storage.

 > A script bin/BackupPC_fsck can be used to verify the reference
 > counts and/or to fix them.  BackupPC cannot be running when
 > BackupPC_fsck is used.

Does this script essentially have to go through all the attrib files
across all shares/backups/hosts and and tabulate the counts?

 > 
 > Craig
 > 
 > ------------------------------------------------------------------------------
 > Free Software Download: Index, Search & Analyze Logs and other IT data in 
 > Real-Time with Splunk. Collect, index and harness all the fast moving IT 
 > data 
 > generated by your applications, servers and devices whether physical, virtual
 > or in the cloud. Deliver compliance at lower cost and gain new business 
 > insights. http://p.sf.net/sfu/splunk-dev2dev 
 > _______________________________________________
 > BackupPC-devel mailing list
 > [email protected]
 > List:    https://lists.sourceforge.net/lists/listinfo/backuppc-devel
 > Wiki:    http://backuppc.wiki.sourceforge.net
 > Project: http://backuppc.sourceforge.net/

------------------------------------------------------------------------------
Free Software Download: Index, Search & Analyze Logs and other IT data in 
Real-Time with Splunk. Collect, index and harness all the fast moving IT data 
generated by your applications, servers and devices whether physical, virtual
or in the cloud. Deliver compliance at lower cost and gain new business 
insights. http://p.sf.net/sfu/splunk-dev2dev 
_______________________________________________
BackupPC-devel mailing list
[email protected]
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-devel
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/

Re: [BackupPC-devel] BackupPC 4.0 features - reference counting

Reply via email to