Hi,

I recently had to purge files from large Git repos (many files, many commits). 
The usual recommendation is to use `git filter-branch --index-filter` to purge 
files. However, this is *very* slow for large repos (e.g. it takes 45min to
remove the `builtin` directory from git core). I realized that I can remove
files *way* faster by exporting the repo, removing the file references, 
and then importing the repo (see Perl script below, it takes ~30sec to remove
the `builtin` directory from git core). Do you see any problem with this 
approach?

Thank you,
Lars



#!/usr/bin/perl
#
# Purge paths from Git repositories.
#
# Usage:
#     git-purge-path [path-regex1] [path-regex2] ...
#
# Examples:
#    Remove the file "test.bin" from all directories:
#    git-purge-path "/test.bin$"
#
#    Remove all "*.bin" files from all directories:
#    git-purge-path "\.bin$"
#
#    Remove all files in the "/foo" directory:
#    git-purge-path "^/foo/$"
#
# Attention:
#     You want to run this script on a case sensitive file-system (e.g.
#     ext4 on Linux). Otherwise the resulting Git repository will not
#     contain changes that modify the casing of file paths.
#

use strict;
use warnings;

open( my $pipe_in, "git fast-export --progress=100 --no-data HEAD |" ) or die 
$!;
open( my $pipe_out, "| git fast-import --force --quiet" ) or die $!;

LOOP: while ( my $cmd = <$pipe_in> ) {
    my $data = "";
    if ( $cmd =~ /^data ([0-9]+)$/ ) {
        # skip data blocks
        my $skip_bytes = $1;
        read($pipe_in, $data, $skip_bytes);
    }
    elsif ( $cmd =~ /^M [0-9]{6} [0-9a-f]{40} (.+)$/ ) {
        my $pathname = $1;
        foreach (@ARGV) {
            next LOOP if ("/" . $pathname) =~ /$_/
        }
    }
    print {$pipe_out} $cmd . $data;
}

Reply via email to