In the past, we have had multiple discussions about adding full file checksums (e.g., md5, SHA-1) and/or path names to pool files to allow for integrity checking and reverse file look-up from the pc directory.
On the other hand, I know some people are not interested in that feature or overhead. So, I would like to suggest the following compromise solution for discussion and improvement: 1. Add three new "first byte" character types corresponding 1-1 to the existing 3 ones (0x78, 0xd6, 0xd7). Though you may only need 2 since it seems like 0xd6 is obsolete(?) 2. For pool files with the new first byte characters, extend the envelope footer at the end of the file to include space for the checksum (128 bits if md5sum) and for the pool file name (32 hex chars plus say another 32 bits to code the chain number - 4 billion chain collisions should leave enough room - famous last words). Total would be 288 bits if this schema is used. 3. Modify the handful of routines in FileZIO.pm (and maybe also RsyncDigest.pm) that raw read/write pool files to recognize the new first byte character flags. 4. Create access routines that can read/write the new footer information. 5. Modify BackupPC_nightly to change the pool path in the footer whenever there is chain renumbering of a file with the new first byte types (should not be intensive since chain renumbering is relatively rare). 6. Either write the trailer information as new pool files are created by modifying the relevant routines (again only a couple) and/or create a separate routine that can recurse through the pool directories and create the new footer information in a batch way. 7. Create Config variables to allow the user to turn on/off writing and tracking the new footer information. Checksums and pool paths could be turned on/off separately for those worried about the overhead of the checksum (adding the pool path has trivial overhead). (Note a zero checksum or a zero path pool would signal that info is not available.) 8. More generally, but not necessary, it may be good to design the footers corresponding to these new first bytes to be extensible in the future to add other information if ever desired (e.g., other checksums, file-level encryption keys etc.) This would require some forethought and would add a little overhead in the storage and access routines. I believe that this proposal has several advantages: A. Users not interested in this functionality wouldn't be affected. They wouldn't turn on the functionality so none of their pool files would have the new first byte flags. In particular, there would be *no* change to their pool and no added backup overhead (even the tests for the new first byte would come after the existing ones). B. Changes are pretty small, limited in extent, and easy to code. I am happy to help but would prefer to leave #3 to someone who knows the code best to make sure all routines are patched. Also, I don't want to start patching basic routines unless there is consensus that this can be merged into the tree since I don't want to create a fork. Also, discussion would be helpful to make sure we have a robust and potentially extensible design. C. Presence of path names greatly facilitates pool backup. Backups would now happen as follows. - Prevent BackupPC_nightly from running... - Rsync pool (without hard links) - For the pc directory, just rsync the directory structure (or otherwise copy) and copy over files with only 1 link (almost exclusively zero length files anyway. - Run a simple perl routine that recurses through the pc directory. For each non-directory file with >1 link (this is *very* fast using perl find), use the file itself to read it's pool path name from the footer and print out a two column link list of the file's pc path name and it's pool path (this is a very simple routine to code) - On the new backup directory, run a simple shell or perl script that reads the file and creates the links The total process would be about as fast as just doing an rsync without hard links on $Topdir and there would be no scaling issues do to hard links. D. Presence of checksums allows for file integrity checking either as needed or on a regular basis. Of course, I know that the rsync(d) method includes md4sums but that is limited to rsync(d). Also, the checksums are only inserted on the second backup. Finally, newer rsync versions use md5sums so rsyncs md4sums will (hopefully) soon be obsolete. E. Pool entries with or without the added footer could co-exist in a single pool - just that the information wouldn't be available for use if the new first bytes aren't present. The look-up routines would just return an error code signalling not available. Also, existing pool entries could be converted at any time to the new format without affecting pool integrity or touching the pc hierarchy. ------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july _______________________________________________ BackupPC-users mailing list BackupPC-users@lists.sourceforge.net List: https://lists.sourceforge.net/lists/listinfo/backuppc-users Wiki: http://backuppc.wiki.sourceforge.net Project: http://backuppc.sourceforge.net/