Hello community, here is the log from the commit of package duperemove for openSUSE:Factory checked in at 2014-12-09 09:14:30 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Comparing /work/SRC/openSUSE:Factory/duperemove (Old) and /work/SRC/openSUSE:Factory/.duperemove.new (New) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Package is "duperemove" Changes: -------- --- /work/SRC/openSUSE:Factory/duperemove/duperemove.changes 2014-11-19 20:30:29.000000000 +0100 +++ /work/SRC/openSUSE:Factory/.duperemove.new/duperemove.changes 2014-12-09 09:14:07.000000000 +0100 @@ -1,0 +2,13 @@ +Tue Dec 9 04:12:43 UTC 2014 - mfas...@suse.com + +- Update to duperemove v0.09.beta5 + - Documentation updates + - FAQ and README are more relevant now + - added man pages for show-shared-extents and hashstats programs + - updated duperemove man page, and duperemove usage() function + - Have show-shared-extents take a file list as arguments. + - Change default of --lookup-extents option back to 'no' + - Write hash type into hashfile header, check against what hash we were + compiled with. + +------------------------------------------------------------------- Old: ---- duperemove-v0.09.beta3.tar.gz New: ---- duperemove-v0.09.beta5.tar.gz ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Other differences: ------------------ ++++++ duperemove.spec ++++++ --- /var/tmp/diff_new_pack.s8n353/_old 2014-12-09 09:14:09.000000000 +0100 +++ /var/tmp/diff_new_pack.s8n353/_new 2014-12-09 09:14:09.000000000 +0100 @@ -17,13 +17,13 @@ %define modname duperemove -%define tar_version v0.09.beta3 +%define tar_version v0.09.beta5 Name: duperemove BuildRequires: gcc-c++ BuildRequires: glib2-devel BuildRequires: libgcrypt-devel -Version: 0.09~beta3 +Version: 0.09~beta5 Release: 0 Summary: Software to find duplicate extents in files and remove them License: GPL-2.0 @@ -50,8 +50,6 @@ %build make CFLAGS="%optflags" -make hashstats CFLAGS="%optflags" -make btrfs-extent-same CFLAGS="%optflags" %install mkdir -p %{buildroot}/%{_sbindir} @@ -61,6 +59,8 @@ cp %{_builddir}/%{modname}-%{tar_version}/%{samename} %{buildroot}/%{_sbindir} mkdir -p %{buildroot}%{_mandir}/man8 cp %{_builddir}/%{modname}-%{tar_version}/%{modname}.8 %{buildroot}/%{_mandir}/man8/ +cp %{_builddir}/%{modname}-%{tar_version}/hashstats.8 %{buildroot}/%{_mandir}/man8/ +cp %{_builddir}/%{modname}-%{tar_version}/show-shared-extents.8 %{buildroot}/%{_mandir}/man8/ cp %{_builddir}/%{modname}-%{tar_version}/%{samename}.8 %{buildroot}/%{_mandir}/man8/ %files -n btrfs-extent-same @@ -70,10 +70,12 @@ %files %defattr(-, root, root) -%doc LICENSE README +%doc LICENSE README.md FAQ.md %{_sbindir}/duperemove %{_sbindir}/hashstats %{_sbindir}/show-shared-extents %{_mandir}/man?/%{modname}.8.gz +%{_mandir}/man?/hashstats.8.gz +%{_mandir}/man?/show-shared-extents.8.gz %changelog ++++++ duperemove-v0.09.beta3.tar.gz -> duperemove-v0.09.beta5.tar.gz ++++++ diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/duperemove-v0.09.beta3/FAQ.md new/duperemove-v0.09.beta5/FAQ.md --- old/duperemove-v0.09.beta3/FAQ.md 1970-01-01 01:00:00.000000000 +0100 +++ new/duperemove-v0.09.beta5/FAQ.md 2014-12-09 05:05:30.000000000 +0100 @@ -0,0 +1,53 @@ +# Duperemove: Frequently Asked Questions + +### Is there an upper limit to the amount of data duperemove can process? + +v0.08 of duperemove has been tested on small numbers of VMS or iso +files (5-10) it can probably scale up to 50 or so. + +v0.09 is much faster at hashing and cataloging extents and therefore +can handle a larger data set. My own testing is typically with a +filesystem of about 750 gigabytes and millions of files. + + +### Why does it not print out all duplicate extents? + +Internally duperemove is classifying extents based on various criteria +like length, number of identical extents, etc. The printout we give is +based on the results of that classification. + + +### How can I find out my space savings after a dedupe? + +Duperemove will print out an estimate of the saved space after a +dedupe operation for you. You can also do a df before the dedupe +operation, then a df about 60 seconds after the operation. It is +common for btrfs space reporting to be 'behind' while delayed updates +get processed, so an immediate df after deduping might not show any +savings. + + +### Why is the total deduped data report an estimate? + +At the moment duperemove can detect that some underlying extents are +shared with other files, but it can not resolve which files those +extents are shared with. + +Imagine duperemove is examing a series of files and it notes a shared +data region in one of them. That data could be shared with a file +outside of the series. Since duperemove can't resolve that information +it will account the shared data against our dedupe operation while in +reality, the kernel might deduplicate it further for us. + + +### Why are my files showing dedupe but my disk space is not shrinking? + +This is a little complicated, but it comes down to a feature in Btrfs +called _bookending_. The Btrfs wiki explains this in [detail] +(http://en.wikipedia.org/wiki/Btrfs#Extents). + +Essentially though, the underlying representation of an extent in +Btrfs can not be split (with small exception). So sometimes we can end +up in a situation where a file extent gets partially deduped (and the +extents marked as shared) but the underlying extent item is not freed +or truncated. diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/duperemove-v0.09.beta3/Makefile new/duperemove-v0.09.beta5/Makefile --- old/duperemove-v0.09.beta3/Makefile 2014-11-17 20:07:48.000000000 +0100 +++ new/duperemove-v0.09.beta5/Makefile 2014-12-09 05:05:30.000000000 +0100 @@ -1,9 +1,9 @@ -RELEASE=v0.09.beta3 +RELEASE=v0.09.beta5 CC = gcc CFLAGS = -Wall -ggdb -MANPAGES=duperemove.8 btrfs-extent-same.8 +MANPAGES=duperemove.8 btrfs-extent-same.8 hashstats.8 show-shared-extents.8 CFILES=duperemove.c hash-tree.c results-tree.c rbtree.c dedupe.c filerec.c \ btrfs-util.c util.c serialize.c memstats.c @@ -16,8 +16,8 @@ HEADERS=csum.h hash-tree.h results-tree.h kernel.h list.h rbtree.h dedupe.h \ btrfs-ioctl.h filerec.h btrfs-util.h debug.h util.h serialize.h \ memstats.h -DIST_SOURCES:=$(DIST_CFILES) $(HEADERS) LICENSE Makefile rbtree.txt README \ - TODO $(MANPAGES) SubmittingPatches +DIST_SOURCES:=$(DIST_CFILES) $(HEADERS) LICENSE Makefile rbtree.txt README.md \ + TODO $(MANPAGES) SubmittingPatches FAQ.md DIST=duperemove-$(RELEASE) DIST_TARBALL=$(DIST).tar.gz TEMP_INSTALL_DIR:=$(shell mktemp -du -p .) diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/duperemove-v0.09.beta3/README new/duperemove-v0.09.beta5/README --- old/duperemove-v0.09.beta3/README 2014-11-17 20:07:48.000000000 +0100 +++ new/duperemove-v0.09.beta5/README 1970-01-01 01:00:00.000000000 +0100 @@ -1,95 +0,0 @@ -Duperemove - -Duperemove is a simple tool for finding duplicated extents and -submitting them for deduplication. When given a list of files it will -hash their contents on a block by block basis and compare those hashes -to each other, finding and categorizing extents that match each -other. When given the -d option, duperemove will submit those -extents for deduplication using the btrfs-extent-same ioctl. - -Duperemove has two major modes of operation one of which is a subset -of the other. - - -Readonly / Non-deduplicating Mode - -When run without -d (the default) duperemove will print out one or -more tables of matching extents it has determined would be ideal -candidates for deduplication. As a result, readonly mode is useful for -seeing what duperemove might do when run with '-d'. The output could -also be used by some other software to submit the extents for -deduplication at a later time. - -It is important to note that this mode will not print out *all* -instances of matching extents, just those it would consider for -deduplication. - -Generally, duperemove does not concern itself with the underlying -representation of the extents it processes. Some of them could be -compressed, undergoing I/O, or even have already been deduplicated. In -dedupe mode, the kernel handles those details and therefore we try not -to replicate that work. - - -Deduping Mode - -This functions similarly to readonly mode with the exception that the -duplicated extents found in our "read, hash, and compare" step will -actually be submitted for deduplication. An estimate of the total data -deduplicated will be printed after the operation is complete. This -estimate is calculated by comparing the total amount of shared bytes -in each file before and after the dedupe. - - -See the duperemove man page for further details about running duperemove. - - -REQUIREMENTS - -Kernel: Duperemove needs a kernel version equal to or greater than 2.6.33. - -Libraries: Duperemove uses libgcrypt for hashing. - - -FAQ - -* Is there an upper limit to the amount of data duperemove can process? - -Right now duperemove has been tested on small numbers of VMS or iso -files (5-10). I don't believe there should be a major problem scaling -that up to 50 or so. - - -* Why does it not print out all duplicate extents? - -Internally duperemove is classifying extents based on various criteria -like length, number of identical extents, etc. The printout we give is -based on the results of that classification. - - -* How can I find out my space savings after a dedupe? - -Duperemove will print out an estimate of the saved space after a -dedupe operation for you. You can also do a df before the dedupe -operation, then a df about 60 seconds after the operation. It is -common for btrfs space reporting to be 'behind' while delayed updates -get processed, so an immediate df after deduping might not show any -savings. - - -* Why is the total deduped data report an estimate? - -At the moment duperemove can detect that some underlying extents are -shared with other files, but it can not resolve which files those -extents are shared with. - -Imagine duperemove is examing a series of files and it notes a shared -data region in one of them. That data could be shared with a file -outside of the series. Since duperemove can't resolve that information -it will account the shared data against our dedupe operation while in -reality, the kernel might deduplicate it further for us. - - -USAGE EXAMPLES - -TODO diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/duperemove-v0.09.beta3/README.md new/duperemove-v0.09.beta5/README.md --- old/duperemove-v0.09.beta3/README.md 1970-01-01 01:00:00.000000000 +0100 +++ new/duperemove-v0.09.beta5/README.md 2014-12-09 05:05:30.000000000 +0100 @@ -0,0 +1,121 @@ +# Duperemove + +Duperemove is a simple tool for finding duplicated extents and +submitting them for deduplication. When given a list of files it will +hash their contents on a block by block basis and compare those hashes +to each other, finding and categorizing extents that match each +other. When given the -d option, duperemove will submit those +extents for deduplication using the btrfs-extent-same ioctl. + +Duperemove has two major modes of operation one of which is a subset +of the other. + + +## Readonly / Non-deduplicating Mode + +When run without -d (the default) duperemove will print out one or +more tables of matching extents it has determined would be ideal +candidates for deduplication. As a result, readonly mode is useful for +seeing what duperemove might do when run with '-d'. The output could +also be used by some other software to submit the extents for +deduplication at a later time. + +It is important to note that this mode will not print out *all* +instances of matching extents, just those it would consider for +deduplication. + +Generally, duperemove does not concern itself with the underlying +representation of the extents it processes. Some of them could be +compressed, undergoing I/O, or even have already been deduplicated. In +dedupe mode, the kernel handles those details and therefore we try not +to replicate that work. + + +## Deduping Mode + +This functions similarly to readonly mode with the exception that the +duplicated extents found in our "read, hash, and compare" step will +actually be submitted for deduplication. An estimate of the total data +deduplicated will be printed after the operation is complete. This +estimate is calculated by comparing the total amount of shared bytes +in each file before and after the dedupe. + + +See the duperemove man page for further details about running duperemove. + + +# Requirements + +The latest stable code can be found in [v0.09-branch](https://github.com/markfasheh/duperemove/tree/v0.09-branch). + +Kernel: Duperemove needs a kernel version equal to or greater than 3.13 + +Libraries: Duperemove uses glib2 and optionally libgcrypt for hashing. + + +# FAQ + +Please see the FAQ file [provided in the duperemove +source](https://github.com/markfasheh/duperemove/blob/master/FAQ.md) + +# Usage Examples + +Duperemove takes a list of files and directories to scan for +dedupe. If a directory is specified, all regular files within it will +be scanned. Duperemove can also be told to recursively scan +directories with the '-r' switch. If '-h' is provided, duperemove will +print numbers in powers of 1024 (e.g., "128K"). + +Assume this abitrary layout for the following examples. + + . + ├── dir1 + │ ├── file3 + │ ├── file4 + │ └── subdir1 + │ └── file5 + ├── file1 + └── file2 + +This will dedupe files 'file1' and 'file2': + + duperemove -dh file1 file2 + +This does the same but adds any files in dir1 (file3 and file4): + + duperemove -dh file1 file2 dir1 + +This will dedupe exactly the same as above but will recursively walk +dir1, thus adding file5. + + duperemove -dhr file1 file2 dir1/ + + +An actual run, output will differ according to duperemove version. + + duperemove -dhr file1 file2 dir1 + Using 128K blocks + Using hash: SHA256 + Using 2 threads for file hashing phase + csum: file1 [1/5] + csum: file2 [2/5] + csum: dir1/file3 [3/5] + csum: dir1/subdir1/file5 [4/5] + csum: dir1/file4 [5/5] + Hashed 80 blocks, resulting in 17 unique hashes. Calculating duplicate + extents - this may take some time. + [########################################] + Search completed with no errors. + Simple read and compare of file data found 2 instances of extents that might + benefit from deduplication. + Start Length Filename (2 extents) + 0.0 2.0M "file2" + 0.0 2.0M "dir1//file4" + Start Length Filename (3 extents) + 0.0 2.0M "file1" + 0.0 2.0M "dir1//file3" + 0.0 2.0M "dir1//subdir1/file5" + Dedupe 1 extents with target: (0.0, 2.0M), "file2" + Dedupe 2 extents with target: (0.0, 2.0M), "file1" + Kernel processed data (excludes target files): 6.0M + Comparison of extent info shows a net change in shared extents of: 10.0M diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/duperemove-v0.09.beta3/csum-gcrypt.c new/duperemove-v0.09.beta5/csum-gcrypt.c --- old/duperemove-v0.09.beta3/csum-gcrypt.c 2014-11-17 20:07:48.000000000 +0100 +++ new/duperemove-v0.09.beta5/csum-gcrypt.c 2014-12-09 05:05:30.000000000 +0100 @@ -29,6 +29,8 @@ GCRY_THREAD_OPTION_PTHREAD_IMPL; unsigned int digest_len = 0; +#define HASH_TYPE "SHA256 " +char hash_type[8]; void checksum_block(char *buf, int len, unsigned char *digest) { @@ -59,6 +61,8 @@ if (!digest_len) return 1; + strncpy(hash_type, HASH_TYPE, 8); + abort_on(digest_len == 0 || digest_len > DIGEST_LEN_MAX); return 0; diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/duperemove-v0.09.beta3/csum-mhash.c new/duperemove-v0.09.beta5/csum-mhash.c --- old/duperemove-v0.09.beta3/csum-mhash.c 2014-11-17 20:07:48.000000000 +0100 +++ new/duperemove-v0.09.beta5/csum-mhash.c 2014-12-09 05:05:30.000000000 +0100 @@ -27,6 +27,8 @@ #define HASH_FUNC MHASH_SHA256 uint32_t digest_len = 0; +#define HASH_TYPE "SHA256 " +char hash_type[8]; void checksum_block(char *buf, int len, unsigned char *digest) { @@ -43,6 +45,8 @@ if (!digest_len) return 1; + strncpy(hash_type, HASH_TYPE, 8); + abort_on(digest_len == 0 || digest_len > DIGEST_LEN_MAX); return 0; diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/duperemove-v0.09.beta3/csum-test.c new/duperemove-v0.09.beta5/csum-test.c --- old/duperemove-v0.09.beta3/csum-test.c 2014-11-17 20:07:48.000000000 +0100 +++ new/duperemove-v0.09.beta5/csum-test.c 2014-12-09 05:05:30.000000000 +0100 @@ -38,7 +38,7 @@ { char *fname = argv[1]; int fd, ret; - size_t len; + ssize_t len; struct stat s; struct running_checksum *csum; diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/duperemove-v0.09.beta3/csum.h new/duperemove-v0.09.beta5/csum.h --- old/duperemove-v0.09.beta3/csum.h 2014-11-17 20:07:48.000000000 +0100 +++ new/duperemove-v0.09.beta5/csum.h 2014-12-09 05:05:30.000000000 +0100 @@ -6,6 +6,7 @@ #define DIGEST_LEN_MAX 32 extern unsigned int digest_len; +extern char hash_type[8]; /* Init / debug */ int init_hash(void); diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/duperemove-v0.09.beta3/duperemove.8 new/duperemove-v0.09.beta5/duperemove.8 --- old/duperemove-v0.09.beta3/duperemove.8 2014-11-17 20:07:48.000000000 +0100 +++ new/duperemove-v0.09.beta5/duperemove.8 2014-12-09 05:05:30.000000000 +0100 @@ -93,22 +93,26 @@ .TP \fB\--lookup-extents=[yes|no]\fR -While checksumming a file, duperemove will lookup file extent -state. This information is later used to optimize the search for -duplicate extents. This defaults to on. However, if you use duperemove on a -subvolume that has been snapshotted you will want to read below. - -On btrfs, extents which have been snapshotted are reported as -shared. Internally duperemove considers shared extents as -deduped. When run on a subvolume with snapshots then, duperemove may -skip some or all extents, depending on when the most recent snapshot -was taken. - -The workaround is to run duperemove at least once with -\fB\--lookup-extents=no\fR so that it considers all extents for -dedupe. You can then run with extent lookups on until your next snapshot. +Defaults to no. While checksumming a file, duperemove can optionally +lookup file extent state to see whether a given file block is already +shared. This information can later be used to optimize the search for +duplicate extents. There are some caveats to this, so please read +below. + +On btrfs, extents which have been snapshotted are reported as shared, +as more than one inode points to them. A deduped extent also gets +reported as shared for the same reasons. Internally duperemove can not +yet make the distinction between the two. If \fB--lookup-extents\fR is +turned on, duperemove will consider a shared extent to have already +been deduped. On a snapshotted file system this might cause all or +most of the extents to be skipped for dedupe. + +If you are not making snapshots on the fs you are deduping, this +option will allow duperemove to make better decisions on which extents +to dedupe. -We plan to remove this restriction in a future version of duperemove. +A future version of duperemove will remove this restriction, allowing +us to default this option to on. .TP \fB\-?, --help\fR @@ -116,39 +120,7 @@ .SH "FAQ" -.B "Is there an upper limit to the amount of data duperemove can process?" - -Right now duperemove has been tested on small numbers of VMS or iso -files (5-10). I don't believe there should be a major problem scaling -that up to 50 or so. - -.B "Why does it not print out all duplicate extents?" - -Internally duperemove is classifying extents based on various criteria -like length, number of identical extents, etc. The printout we give is -based on the results of that classification. - -.B "How can I find out my space savings after a dedupe?" - -\fBDuperemove\fR will print out an estimate of the saved space after a -dedupe operation. You can also do a \fBdf\fR before the dedupe -operation, then a \fBdf\fR about 60 seconds after the operation. It is -common for \fIbtrfs\fR space reporting to be 'behind' while delayed -updates get processed, so an immediate df after deduping might not -show any savings. - -.B "Why is the total deduped data report an estimate?" - -At the moment \fBduperemove\fR can detect that some underlying extents are -shared with other files, but it can not resolve which files those -extents are shared with. - -Imagine \fBduperemove\fR is examing a series of files and it notes a -shared data region in one of them. That data could be shared with a -file outside of the series. Since \fBduperemove\fR can't resolve that -information it will account the shared data against our dedupe -operation while in reality, the kernel might deduplicate it further -for us. +Please see the \fBFAQ.md\fR file which should have been included with your duperemove package. .SH "NOTES" Deduplication is currently only supported by the \fIbtrfs\fR filesystem. diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/duperemove-v0.09.beta3/duperemove.c new/duperemove-v0.09.beta5/duperemove.c --- old/duperemove-v0.09.beta3/duperemove.c 2014-11-17 20:07:48.000000000 +0100 +++ new/duperemove-v0.09.beta5/duperemove.c 2014-12-09 05:05:30.000000000 +0100 @@ -70,11 +70,10 @@ static dev_t one_fs_dev = 0; static int write_hashes = 0; -static int scramble_filenames = 0; static int read_hashes = 0; static char *serialize_fname = NULL; static unsigned int hash_threads = 0; -static int do_lookup_extents = 1; +static int do_lookup_extents = 0; static int fancy_status = 0; @@ -597,16 +596,9 @@ printf("\t-h\t\tPrint numbers in human-readable format.\n"); printf("\t-x\t\tDon't cross filesystem boundaries.\n"); printf("\t-v\t\tBe verbose.\n"); - printf("\t--hash-threads=N\n\t\t\tUse N threads for hashing phase. " - "Default is automatically detected.\n"); - printf("\t--read-hashes=hashfile\n\t\t\tRead hashes from a hashfile. " - "A file list is not required with this option.\n"); - printf("\t--write-hashes=hashfile\n\t\t\tWrite hashes to a hashfile. " - "These can be read in at a later date and deduped from.\n"); - printf("\t--lookup-extents=[yes|no]\n\t\t\tLookup extent info during " - "checksum phase. Defaults to yes.\n"); printf("\t--debug\t\tPrint debug messages, forces -v if selected.\n"); printf("\t--help\t\tPrints this help text.\n"); + printf("\nPlease see the duperemove(8) manpage for more options.\n"); } static int add_file(const char *name, int dirfd); @@ -786,7 +778,6 @@ HELP_OPTION, VERSION_OPTION, WRITE_HASHES_OPTION, - WRITE_HASHES_SCRAMBLE_OPTION, READ_HASHES_OPTION, HASH_THREADS_OPTION, LOOKUP_EXTENTS_OPTION, @@ -804,7 +795,6 @@ { "help", 0, 0, HELP_OPTION }, { "version", 0, 0, VERSION_OPTION }, { "write-hashes", 1, 0, WRITE_HASHES_OPTION }, - { "write-hashes-scramble", 1, 0, WRITE_HASHES_SCRAMBLE_OPTION }, { "read-hashes", 1, 0, READ_HASHES_OPTION }, { "hash-threads", 1, 0, HASH_THREADS_OPTION }, { "lookup-extents", 1, 0, LOOKUP_EXTENTS_OPTION }, @@ -846,8 +836,6 @@ case 'h': human_readable = 1; break; - case WRITE_HASHES_SCRAMBLE_OPTION: - scramble_filenames = 1; case WRITE_HASHES_OPTION: write_hashes = 1; serialize_fname = strdup(optarg); @@ -1236,7 +1224,8 @@ fancy_status = 1; if (read_hashes) { - ret = read_hash_tree(serialize_fname, &tree, &blocksize, NULL); + ret = read_hash_tree(serialize_fname, &tree, &blocksize, NULL, + 0); if (ret == FILE_VERSION_ERROR) { fprintf(stderr, "Hash file \"%s\": " @@ -1250,6 +1239,12 @@ "Bad magic.\n", serialize_fname); goto out; + } else if (ret == FILE_HASH_TYPE_ERROR) { + fprintf(stderr, + "Hash file \"%s\": Unkown hash type \"%.*s\".\n" + "(we use \"%.*s\").\n", serialize_fname, + 8, unknown_hash_type, 8, hash_type); + goto out; } else if (ret) { fprintf(stderr, "Hash file \"%s\": " "Error %d while reading: %s.\n", @@ -1259,6 +1254,7 @@ } printf("Using %uK blocks\n", blocksize/1024); + printf("Using hash: %.*s\n", 8, hash_type); if (!read_hashes) { ret = populate_hash_tree(&tree); @@ -1271,8 +1267,7 @@ debug_print_tree(&tree); if (write_hashes) { - ret = serialize_hash_tree(serialize_fname, &tree, blocksize, - scramble_filenames); + ret = serialize_hash_tree(serialize_fname, &tree, blocksize); if (ret) fprintf(stderr, "Error %d while writing to hash file\n", ret); goto out; diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/duperemove-v0.09.beta3/filerec.c new/duperemove-v0.09.beta5/filerec.c --- old/duperemove-v0.09.beta3/filerec.c 2014-11-17 20:07:48.000000000 +0100 +++ new/duperemove-v0.09.beta5/filerec.c 2014-12-09 05:05:30.000000000 +0100 @@ -477,8 +477,10 @@ memset(fiemap, 0, sizeof(struct fiemap)); do { +#ifndef FILEREC_TEST dprintf("(fiemap) %s: start: %"PRIu64", len: %"PRIu64"\n", file->filename, start, len); +#endif /* * Do search from 0 to EOF. btrfs was doing some weird @@ -651,42 +653,40 @@ int main(int argc, char **argv) { - int ret; + int ret, i; struct filerec *file; - uint64_t loff, len; uint64_t shared = 0; init_filerec(); - /* test_filerec filename loff len */ - if (argc < 4) { - printf("Usage: filerec_test filename loff len\n"); - return 1; - } - - file = filerec_new(argv[1], 500, 1); /* Use made up ino */ - if (!file) { - fprintf(stderr, "filerec_new(): malloc error\n"); + if (argc < 2) { + printf("Usage: show_shared_extents filename1 filename2 ...\n"); return 1; } - ret = filerec_open(file, 0); - if (ret) - goto out; + for (i = 1; i < argc; i++) { + file = filerec_new(argv[i], 500 + i, 1); /* Use made up ino */ + if (!file) { + fprintf(stderr, "filerec_new(): malloc error\n"); + return 1; + } - loff = atoll(argv[2]); - len = atoll(argv[3]); + ret = filerec_open(file, 0); + if (ret) + goto out; + + ret = filerec_count_shared(file, 0, -1ULL, &shared); + filerec_close(file); + if (ret) { + fprintf(stderr, "fiemap error %d: %s\n", ret, strerror(ret)); + goto out; + } - ret = filerec_count_shared(file, loff, len, &shared); - if (ret) { - fprintf(stderr, "fiemap error %d: %s\n", ret, strerror(ret)); - goto out_close; + printf("%s: %"PRIu64" shared bytes\n", file->filename, shared); + filerec_free(file); + file = NULL; } - printf("%s: %"PRIu64" shared bytes\n", file->filename, shared); - -out_close: - filerec_close(file); out: filerec_free(file); return ret; diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/duperemove-v0.09.beta3/hashstats.8 new/duperemove-v0.09.beta5/hashstats.8 --- old/duperemove-v0.09.beta3/hashstats.8 1970-01-01 01:00:00.000000000 +0100 +++ new/duperemove-v0.09.beta5/hashstats.8 2014-12-09 05:05:30.000000000 +0100 @@ -0,0 +1,30 @@ +.TH "hashstats" "8" "March 2014" "Version 0.09" +.SH "NAME" +hashstats \- Print information about a duperemove hashfile +.SH "SYNOPSIS" +\fBhashstats\fR \fI[options]\fR \fIhashfile\fR +.SH "DESCRIPTION" +.PP +\fIhashfile\fR should be a file generated by running duperemove with +the --write-hashes option. + +.SH "OPTIONS" + +.TP +\fB\-n NUM\fR +Print top \fINUM\fR hashes, sorted by bucket size. Default is 10. + +.TP +\fB\-a\fR +Print all hashes (overrides \fB-n\fR, above) +.TP + +\fB\-b\fR +Print info on each block within our hash buckets. + +.TP +\fB\-l\fR +Print a list of all files + +.SH "SEE ALSO" +.BR duperemove(8) diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/duperemove-v0.09.beta3/hashstats.c new/duperemove-v0.09.beta5/hashstats.c --- old/duperemove-v0.09.beta3/hashstats.c 2014-11-17 20:07:48.000000000 +0100 +++ new/duperemove-v0.09.beta5/hashstats.c 2014-12-09 05:05:30.000000000 +0100 @@ -264,7 +264,7 @@ return EINVAL; } - ret = read_hash_tree(serialize_fname, &tree, &blocksize, &h); + ret = read_hash_tree(serialize_fname, &tree, &blocksize, &h, 0); if (ret == FILE_VERSION_ERROR) { fprintf(stderr, "Hash file \"%s\": " diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/duperemove-v0.09.beta3/serialize.c new/duperemove-v0.09.beta5/serialize.c --- old/duperemove-v0.09.beta3/serialize.c 2014-11-17 20:07:48.000000000 +0100 +++ new/duperemove-v0.09.beta5/serialize.c 2014-12-09 05:05:30.000000000 +0100 @@ -39,6 +39,9 @@ #include "serialize.h" +char unknown_hash_type[8]; +#define hash_type_v1_0 "\0\0\0\0\0\0\0\0" + #if __BYTE_ORDER == __LITTLE_ENDIAN #define swap16(_x) ((uint16_t)_x) #define swap32(_x) ((uint32_t)_x) @@ -60,6 +63,7 @@ dprintf("num_files: %"PRIu64"\t", h->num_files); dprintf("num_hashes: %"PRIu64"\t", h->num_hashes); dprintf("block_size: %u\t", h->block_size); + dprintf("hash_type: %.*s\t", 8, h->hash_type); dprintf(" ]\n"); } @@ -91,6 +95,7 @@ disk.num_files = swap64(h->num_files); disk.num_hashes = swap64(h->num_hashes); disk.block_size = swap32(h->block_size); + memcpy(&disk.hash_type, hash_type, 8); ret = lseek(fd, 0, SEEK_SET); if (ret == (loff_t)-1) @@ -104,28 +109,10 @@ return 0; } -/* name scrambling code taken from e2fsprogs */ -static int name_id[256]; -static void scramble_name(char *name, int len) -{ - int id; - char *cp = name; - - memset(cp, 'A', len); - id = name_id[len]++; - while ((len > 0) && (id > 0)) { - *cp += id % 26; - id = id / 26; - cp++; - len--; - } -} - -static int write_file_info(int fd, struct filerec *file, int scramble) +static int write_file_info(int fd, struct filerec *file) { int written, name_len; struct file_info finfo = { 0, }; - char fname[PATH_MAX+1]; char *n; finfo.ino = swap64(file->inum); @@ -144,11 +131,6 @@ return EIO; n = file->filename; - if (scramble) { - strcpy(fname, file->filename); - n = fname; - scramble_name(n, name_len); - } written = write(fd, n, name_len); if (written == -1) @@ -179,7 +161,7 @@ } int serialize_hash_tree(char *filename, struct hash_tree *tree, - unsigned int block_size, int scramble) + unsigned int block_size) { int ret, fd; struct hash_file_header *h = calloc(1, sizeof(*h)); @@ -206,7 +188,7 @@ if (list_empty(&file->block_list)) continue; - ret = write_file_info(fd, file, scramble); + ret = write_file_info(fd, file); if (ret) goto out; tot_files++; @@ -332,12 +314,14 @@ h->num_files = swap64(disk.num_files); h->num_hashes = swap64(disk.num_hashes); h->block_size = swap32(disk.block_size); + memcpy(&h->hash_type, &disk.hash_type, 8); return 0; } int read_hash_tree(char *filename, struct hash_tree *tree, - unsigned int *block_size, struct hash_file_header *ret_hdr) + unsigned int *block_size, struct hash_file_header *ret_hdr, + int ignore_hash_type) { int ret, fd; uint32_t i; @@ -361,6 +345,22 @@ goto out; } + if (!ignore_hash_type) { + /* + * v1.0 hash files were SHA256 but wrote out hash_type + * as nulls + */ + if (h.minor == 0 && memcmp(hash_type_v1_0, h.hash_type, 8)) { + ret = FILE_HASH_TYPE_ERROR; + memcpy(unknown_hash_type, hash_type_v1_0, 8); + goto out; + } else if (h.minor > 0 && memcmp(h.hash_type, hash_type, 8)) { + ret = FILE_HASH_TYPE_ERROR; + memcpy(unknown_hash_type, h.hash_type, 8); + goto out; + } + } + *block_size = h.block_size; dprintf("Load %"PRIu64" files from \"%s\"\n", diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/duperemove-v0.09.beta3/serialize.h new/duperemove-v0.09.beta5/serialize.h --- old/duperemove-v0.09.beta3/serialize.h 2014-11-17 20:07:48.000000000 +0100 +++ new/duperemove-v0.09.beta5/serialize.h 2014-12-09 05:05:30.000000000 +0100 @@ -17,7 +17,7 @@ #define __SERIALIZE__ #define HASH_FILE_MAJOR 1 -#define HASH_FILE_MINOR 0 +#define HASH_FILE_MINOR 1 #define HASH_FILE_MAGIC "dupehash" struct hash_file_header { @@ -28,7 +28,8 @@ /*20*/ uint64_t num_hashes; uint32_t block_size; /* In bytes */ uint32_t pad0; - uint64_t pad1[10]; + char hash_type[8]; + uint64_t pad1[9]; }; #define DISK_DIGEST_LEN 32 @@ -53,11 +54,14 @@ }; int serialize_hash_tree(char *filename, struct hash_tree *tree, - unsigned int block_size, int scramble); + unsigned int block_size); #define FILE_VERSION_ERROR 1001 #define FILE_MAGIC_ERROR 1002 +#define FILE_HASH_TYPE_ERROR 1003 +extern char unknown_hash_type[8]; int read_hash_tree(char *filename, struct hash_tree *tree, - unsigned int *block_size, struct hash_file_header *ret_hdr); + unsigned int *block_size, struct hash_file_header *ret_hdr, + int ignore_hash_type); #endif /* __SERIALIZE__ */ diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/duperemove-v0.09.beta3/show-shared-extents.8 new/duperemove-v0.09.beta5/show-shared-extents.8 --- old/duperemove-v0.09.beta3/show-shared-extents.8 1970-01-01 01:00:00.000000000 +0100 +++ new/duperemove-v0.09.beta5/show-shared-extents.8 2014-12-09 05:05:30.000000000 +0100 @@ -0,0 +1,15 @@ +.TH "show-shared-extents" "8" "December 2014" "Version 0.09" +.SH "NAME" +show-shared-extents \- Show extents that are shared. +.SH "SYNOPSIS" +\fBshow-shared-extents\fR \fIfiles...\fI +.SH "DESCRIPTION" +.PP +Print all the extents in \fIfiles\fR that are shared. A sum of shared +extents is also printed. + +On btrfs, an extent is reported as shared if it has more than one reference. + +.SH "SEE ALSO" +.BR duperemove(8) +.BR btrfs(8) -- To unsubscribe, e-mail: opensuse-commit+unsubscr...@opensuse.org For additional commands, e-mail: opensuse-commit+h...@opensuse.org