On Wednesday, October 08, 2014 12:41:01 AM Branko Čibej wrote: > On 07.10.2014 22:36, Andreas Mohr wrote: > > Hi, > > > > That's certainly a somewhat tough one. > > > > > > I will get tarred and feathered here for my way of trying to solve this, > > and possibly even rightfully so, but... ;) > > Well, I certainly won't skin you alive for suggesting this; but ... I > would imagine that "git svn fetch" has to essentially do just what the > OP doesn't want to do, i.e., successively retreive each revision of > every file in the Subversion repository to populate the Git repository. > There's not much chance this would be faster than just doing the same > with Subversion, especially since, once you're done you /still/ have to > scan the files resulting Git repo. > > > Going back to the original question ... > > > Aside from the brute-force method of checking out the entire repository > > starting at revision 1 , performing a scan, updating to the next > > revision, > > and repeating until I reach the head, I don’t know of a way to do this. > > This is, in fact, likely to be (almost) the most efficient way to do > this, since you can just use the existing Subversion client to deal with > the repository contents and version discrepancies. > > But there is an alternative that might be more efficient in your case: > Create a dumpstream of the repository using "svnadmin dump", > non-incremental and not using deltas, then pipe the stream to a custom > tool that extracts the file contents the stream and either writes them > to disk, or passes them to your scanning tool in some other way. > > The reason why this could be faster than the checkout+repeated update is > that you do not have the overhead of a working copy, directory tracking, > property handling, etc. etc., and you can probably save on disk space by > keeping the file contents around only as long as they're being scanned. > It does mean that your custom tool will have to parse the dumpfile > format, but that's really not so hard, the format is quite simple, and > there are a number of example scripts that do that in our repository. > Another alternative is to use our API directly, possibly through one of > the bindings, to get file contents straight from the repository; but I > suspect it's harder than parsing the dump file.
The Python bindings for parsing the dumpstream currently do not work as I described on svn-dev@ some time ago: the layer which does "thunking" of the C calls back to Python code is not implemented right now. As far as I can see, Perl/Ruby bindings have the same problem. That, and the way to create a stream in Python does not seem to be working - see the email I just sent to svn-dev@ a few minutes ago. Ironically, I found that when I tried to test the implementation of this "thunking" code for parsing the dumpstream :) Not sure if this affects Perl/Ruby. So, back to your advice - it's either using C library directly, or implementing the parser for the stream. Which isn't hard, I admit. Regards, Alexey. > > -- Brane