Re: rsync as a de-duplication-only tool, using --link-dest

2024-05-01 Thread Kevin Korb via rsync
I don't believe that what you are asking for can be done with rsync.  At 
first thought you can't mix --ignore-existing with --ignore-non-existing 
as that would ignore everything.  Something would have to at least exist 
and not be ignored for rsync to link to it.


Anyway, for a laugh, I asked chatgpt to make something to do this. 
After I got my laugh I cleaned up some of the silly stuff it did and 
came up with this:


#!/bin/bash

# Define the directories to compare
dir1="$1"
dir2="$2"

# Recursively list all files in both directories
files1=$(find "$dir1" -type f)

# Loop through files in first directory
for file1 in $files1; do
# Get relative path of file1
rel_path="${file1#$dir1}"
file2="$dir2$rel_path"

# Check if file exists in the second directory
if [ -f "$file2" ]; then
# Get metadata of both files
metadata1=$(stat -c "%Y%s" "$file1")
metadata2=$(stat -c "%Y%s" "$file2")

# Compare metadata
if [ "$metadata1" -eq "$metadata2" ]; then
# Delete file1 and create a hard link to file2
# rm "$file1"
# ln "$file2" "$file1"
echo "Hard linked: $file2 to $File1"
# else
# echo "Different: $file1"
fi
fi
done

Note that I only tested it a little bit which is why anything actually 
destructive is commented.


On 5/1/24 19:34, B via rsync wrote:
Recently I was thinking about --link-dest= and if it was possible to use 
rsync to de-duplicate two nearly-identical directory structures.


Normally I would use a tool like hardlink, jdupes, or rdfind, but in 
this case the files are huge and numerous, so hashing them would take 
forever. I did a test run and these tools mostly choked to death after a 
few hours.


These directories were made using rsync in the first place, so I know 
the files are duplicate and I would be willing to use rsync's 
quick-check (path/filename, mtime, size) to assume uniqueness of the files.


My objective is to hard-link files with the same relative path/filename, 
mtime, and size. Nothing more. Files which are different should not be 
touched. Files which exist in the destination but not the source should 
not be deleted. Files which exist in the source but not the destination 
should not be transferred.


The problem is that I don't want to create any new files in the 
destination. That's the sticking point.


I thought maybe I could do something wacky like 'rsync -a 
--ignore-existing --ignore-non-existing --link-dest="../new/" old/ new', 
but that doesn't work. The existing files get ignored and nothing is 
linked.


Is there a way to do this with rsync?





--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


rsync as a de-duplication-only tool, using --link-dest

2024-05-01 Thread B via rsync
Recently I was thinking about --link-dest= and if it was possible to use 
rsync to de-duplicate two nearly-identical directory structures.


Normally I would use a tool like hardlink, jdupes, or rdfind, but in 
this case the files are huge and numerous, so hashing them would take 
forever. I did a test run and these tools mostly choked to death after a 
few hours.


These directories were made using rsync in the first place, so I know 
the files are duplicate and I would be willing to use rsync's 
quick-check (path/filename, mtime, size) to assume uniqueness of the files.


My objective is to hard-link files with the same relative path/filename, 
mtime, and size. Nothing more. Files which are different should not be 
touched. Files which exist in the destination but not the source should 
not be deleted. Files which exist in the source but not the destination 
should not be transferred.


The problem is that I don't want to create any new files in the 
destination. That's the sticking point.


I thought maybe I could do something wacky like 'rsync -a 
--ignore-existing --ignore-non-existing --link-dest="../new/" old/ new', 
but that doesn't work. The existing files get ignored and nothing is linked.


Is there a way to do this with rsync?



--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html