Hi,
I recently had to purge files from large Git repos (many files, many commits).
The usual recommendation is to use `git filter-branch --index-filter` to purge
files. However, this is *very* slow for large repos (e.g. it takes 45min to
remove the `builtin` directory from git core). I realized that I can remove
files *way* faster by exporting the repo, removing the file references,
and then importing the repo (see Perl script below, it takes ~30sec to remove
the `builtin` directory from git core). Do you see any problem with this
approach?
Thank you,
Lars
#!/usr/bin/perl
#
# Purge paths from Git repositories.
#
# Usage:
# git-purge-path [path-regex1] [path-regex2] ...
#
# Examples:
# Remove the file "test.bin" from all directories:
# git-purge-path "/test.bin$"
#
# Remove all "*.bin" files from all directories:
# git-purge-path "\.bin$"
#
# Remove all files in the "/foo" directory:
# git-purge-path "^/foo/$"
#
# Attention:
# You want to run this script on a case sensitive file-system (e.g.
# ext4 on Linux). Otherwise the resulting Git repository will not
# contain changes that modify the casing of file paths.
#
use strict;
use warnings;
open( my $pipe_in, "git fast-export --progress=100 --no-data HEAD |" ) or die
$!;
open( my $pipe_out, "| git fast-import --force --quiet" ) or die $!;
LOOP: while ( my $cmd = <$pipe_in> ) {
my $data = "";
if ( $cmd =~ /^data ([0-9]+)$/ ) {
# skip data blocks
my $skip_bytes = $1;
read($pipe_in, $data, $skip_bytes);
}
elsif ( $cmd =~ /^M [0-9]{6} [0-9a-f]{40} (.+)$/ ) {
my $pathname = $1;
foreach (@ARGV) {
next LOOP if ("/" . $pathname) =~ /$_/
}
}
print {$pipe_out} $cmd . $data;
}