Re: Every Version of Every File in a Repository

Alexey Neyman Tue, 07 Oct 2014 17:36:48 -0700

On Wednesday, October 08, 2014 12:41:01 AM Branko Čibej wrote:
> On 07.10.2014 22:36, Andreas Mohr wrote:
> > Hi,
> > 
> > That's certainly a somewhat tough one.
> > 
> > 
> > I will get tarred and feathered here for my way of trying to solve this,
> > and possibly even rightfully so, but... ;)
> 
> Well, I certainly won't skin you alive for suggesting this; but ... I
> would imagine that "git svn fetch" has to essentially do just what the
> OP doesn't want to do, i.e., successively retreive each revision of
> every file in the Subversion repository to populate the Git repository.
> There's not much chance this would be faster than just doing the same
> with Subversion, especially since, once you're done you /still/ have to
> scan the files resulting Git repo.
> 
> 
> Going back to the original question ...
> 
> >    Aside from the brute-force method of checking out the entire repository
> >    starting at revision 1 , performing a scan, updating to the next
> >    revision,
> >    and repeating until I reach the head, I don’t know of a way to do this.
> 
> This is, in fact, likely to be (almost) the most efficient way to do
> this, since you can just use the existing Subversion client to deal with
> the repository contents and version discrepancies.
> 
> But there is an alternative that might be more efficient in your case:
> Create a dumpstream of the repository using "svnadmin dump",
> non-incremental and not using deltas, then pipe the stream to a custom
> tool that extracts the file contents the stream and either writes them
> to disk, or passes them to your scanning tool in some other way.
> 
> The reason why this could be faster than the checkout+repeated update is
> that you do not have the overhead of a working copy, directory tracking,
> property handling, etc. etc., and you can probably save on disk space by
> keeping the file contents around only as long as they're being scanned.
> It does mean that your custom tool will have to parse the dumpfile
> format, but that's really not so hard, the format is quite simple, and
> there are a number of example scripts that do that in our repository.
> Another alternative is to use our API directly, possibly through one of
> the bindings, to get file contents straight from the repository; but I
> suspect it's harder than parsing the dump file.


The Python bindings for parsing the dumpstream currently do not work as I 
described on 
svn-dev@ some time ago: the layer which does "thunking" of the C calls back to 
Python 
code is not implemented right now. As far as I can see, Perl/Ruby bindings have 
the same 
problem.

That, and the way to create a stream in Python does not seem to be working - 
see the 
email I just sent to svn-dev@ a few minutes ago. Ironically, I found that when 
I tried to test 
the implementation of this "thunking" code for parsing the dumpstream :) Not 
sure if this 
affects Perl/Ruby.

So, back to your advice - it's either using C library directly, or implementing 
the parser for 
the stream. Which isn't hard, I admit.

Regards,
Alexey.
> 
> -- Brane

Re: Every Version of Every File in a Repository

Reply via email to