Re: Every Version of Every File in a Repository
Branko Čibej wrote on Wed, Oct 08, 2014 at 00:41:01 +0200: On 07.10.2014 22:36, Andreas Mohr wrote: Aside from the brute-force method of checking out the entire repository starting at revision 1 , performing a scan, updating to the next revision, and repeating until I reach the head, I don’t know of a way to do this. This is, in fact, likely to be (almost) the most efficient way to do this, since you can just use the existing Subversion client to deal with the repository contents and version discrepancies. But there is an alternative that might be more efficient in your case: Create a dumpstream of the repository using svnadmin dump, non-incremental and not using deltas, then pipe the stream to a custom tool that extracts the file contents the stream and either writes them to disk, or passes them to your scanning tool in some other way. Non-incremental dump will dump the entire tree anew for every revision. An incremental, non-deltas dump should suffice here.
RE: Every Version of Every File in a Repository
You know, the files aren't really stored as files per say. Also, if using correct ACLs in your repository there is no way any of these files can be executed. I assume by scan you are talking about virus scanning. I would question the need to do this. Yea, I know... but still, many request come from a lack of understanding of a technology. From: jt.mil...@l-3com.com [mailto:jt.mil...@l-3com.com] Sent: Tuesday, October 07, 2014 4:03 PM To: users@subversion.apache.org Subject: Every Version of Every File in a Repository Is there a way to check out every version of a file in a repository? We just had a requirement levied to perform a scan of every file in a repository. The scan tool must have each file in a stand-alone format. Thus, I need a way to extract every version of every file within a repository. Aside from the brute-force method of checking out the entire repository starting at revision 1 , performing a scan, updating to the next revision, and repeating until I reach the head, I don't know of a way to do this. Thanks, JT Miller
Re: Every Version of Every File in a Repository
Hi, On 08/10/14 21:08, Bob Archer wrote: I assume by “scan” you are talking about virus scanning. I would question the need to do this. Yea, I know… but still, many request come from a lack of understanding of a technology. It is more likely that this is about a legal discovery or license/code review. Here then is a hint. #1 Fetch the global verbose xml log with files of the root of the repository or the path you want to examine: svn log -r1:HEAD --xml -v ^/ #2 XSLT over that to find logentry/paths/path nodes with relevant actions (add, modified, moved here), kind (file) and, when modified, the relevant modifications (text-mods=true). The result will be a machine readable list with path / revision coordinates/URLs. Depending on whether you included branches this may even be unique in terms of content. #3 Run that list through your favorite svn client or plain HTTP user agent, as required and suitable. Use pat + peg revisions. https://svn.example.com/svn/repo/path/file.txt?p=N http://svnbook.red-bean.com/nightly/en/svn.serverconfig.httpd.html#svn.serverconfig.httpd.extra.browsing #4 Run your scan as per your requirements. You can adjust this to fit your (disk,process,scan) needs and resources. Andreas
Re: Every Version of Every File in a Repository
On Wed, Oct 8, 2014 at 3:38 PM, Andreas Stieger andreas.stie...@gmx.de wrote: Hi, On 08/10/14 21:08, Bob Archer wrote: I assume by “scan” you are talking about virus scanning. I would question the need to do this. Yea, I know… but still, many request come from a lack of understanding of a technology. It is more likely that this is about a legal discovery or license/code review. Here then is a hint. If you are looking to make it searchable, fisheye from Atlassian knows how to do that. Be prepared to wait a couple weeks for a large repository while it does an 'svn cat' of every revision of every file to feed to it's indexer, though. -- Les Mikesell lesmikes...@gmail.com
Re: Every Version of Every File in a Repository
Hi, On Tue, Oct 07, 2014 at 03:03:13PM -0500, jt.mil...@l-3com.com wrote: Is there a way to check out every version of a file in a repository? We just had a requirement levied to perform a scan of every file in a repository. The scan tool must have each file in a stand-alone format. Thus, I need a way to extract every version of every file within a repository. Aside from the brute-force method of checking out the entire repository starting at revision 1 , performing a scan, updating to the next revision, and repeating until I reach the head, I don’t know of a way to do this. That's certainly a somewhat tough one. I will get tarred and feathered here for my way of trying to solve this, and possibly even rightfully so, but... ;) OK, here it goes: you could do a git-svn on your repo, then get all files ever existing via http://stackoverflow.com/a/12090812 , then for each such file do a git log --all --something --someveryshortformat to get all its revisions, then do a file_content=$(git show revision:./path/to/file) (alternatively do git show ... $TMPDIR/mytmp since that ought to be more reliable for largish files) , then scan that (but ideally you'd be able to directly pipe the git show stream into your scan tool). That ought to give you a scan result for *all* revisions of *all* files in *all* branches of your repo (you might want to decorate things with a uniq applied at some place or another, to ensure that you're indeed not doing wasteful duplicate processing of certain items). OK possibly scratch the *all* branches part, since this may require some extra effort in the case of git-svn... However this high-level complex lookup solution might be both rather crude and much less precise compared to a parse-each-object kind of solution at git plumbing level, if this is possible (and I'd very much guess it is). Hmm, that could be a git rev-list, and that would then list changed files for each commit, and AFAICS globally (i.e., on the global commit tree, rather than specific human-tagged branch names). So that operation mode once successfully scripted ought to be *a lot* better than the list all files, then rev-log each file algo. And you could then safety check your algorithm by having it spit out a full list of all commit hash / file combos (this happens to be the same list which you would then feed into git show, entry by entry), and then try hard to figure out a way to pick a repo-side file version which accidentally is NOT contained in that list -- algo error! Oh, and BTW: all this *without* having to do a filesystem-based checkout (i.e., working copy modification) of any repo item, even once. (i.e., this is actually going *against* your initially stated requirement of Is there a way to check out every version of a file in a repository?, and rightfully so ;) HTH, Andreas Mohr
Re: Every Version of Every File in a Repository
On 07.10.2014 22:36, Andreas Mohr wrote: Hi, That's certainly a somewhat tough one. I will get tarred and feathered here for my way of trying to solve this, and possibly even rightfully so, but... ;) Well, I certainly won't skin you alive for suggesting this; but ... I would imagine that git svn fetch has to essentially do just what the OP doesn't want to do, i.e., successively retreive each revision of every file in the Subversion repository to populate the Git repository. There's not much chance this would be faster than just doing the same with Subversion, especially since, once you're done you /still/ have to scan the files resulting Git repo. Going back to the original question ... Aside from the brute-force method of checking out the entire repository starting at revision 1 , performing a scan, updating to the next revision, and repeating until I reach the head, I don’t know of a way to do this. This is, in fact, likely to be (almost) the most efficient way to do this, since you can just use the existing Subversion client to deal with the repository contents and version discrepancies. But there is an alternative that might be more efficient in your case: Create a dumpstream of the repository using svnadmin dump, non-incremental and not using deltas, then pipe the stream to a custom tool that extracts the file contents the stream and either writes them to disk, or passes them to your scanning tool in some other way. The reason why this could be faster than the checkout+repeated update is that you do not have the overhead of a working copy, directory tracking, property handling, etc. etc., and you can probably save on disk space by keeping the file contents around only as long as they're being scanned. It does mean that your custom tool will have to parse the dumpfile format, but that's really not so hard, the format is quite simple, and there are a number of example scripts that do that in our repository. Another alternative is to use our API directly, possibly through one of the bindings, to get file contents straight from the repository; but I suspect it's harder than parsing the dump file. -- Brane
Re: Every Version of Every File in a Repository
On Wednesday, October 08, 2014 12:41:01 AM Branko Čibej wrote: On 07.10.2014 22:36, Andreas Mohr wrote: Hi, That's certainly a somewhat tough one. I will get tarred and feathered here for my way of trying to solve this, and possibly even rightfully so, but... ;) Well, I certainly won't skin you alive for suggesting this; but ... I would imagine that git svn fetch has to essentially do just what the OP doesn't want to do, i.e., successively retreive each revision of every file in the Subversion repository to populate the Git repository. There's not much chance this would be faster than just doing the same with Subversion, especially since, once you're done you /still/ have to scan the files resulting Git repo. Going back to the original question ... Aside from the brute-force method of checking out the entire repository starting at revision 1 , performing a scan, updating to the next revision, and repeating until I reach the head, I don’t know of a way to do this. This is, in fact, likely to be (almost) the most efficient way to do this, since you can just use the existing Subversion client to deal with the repository contents and version discrepancies. But there is an alternative that might be more efficient in your case: Create a dumpstream of the repository using svnadmin dump, non-incremental and not using deltas, then pipe the stream to a custom tool that extracts the file contents the stream and either writes them to disk, or passes them to your scanning tool in some other way. The reason why this could be faster than the checkout+repeated update is that you do not have the overhead of a working copy, directory tracking, property handling, etc. etc., and you can probably save on disk space by keeping the file contents around only as long as they're being scanned. It does mean that your custom tool will have to parse the dumpfile format, but that's really not so hard, the format is quite simple, and there are a number of example scripts that do that in our repository. Another alternative is to use our API directly, possibly through one of the bindings, to get file contents straight from the repository; but I suspect it's harder than parsing the dump file. The Python bindings for parsing the dumpstream currently do not work as I described on svn-dev@ some time ago: the layer which does thunking of the C calls back to Python code is not implemented right now. As far as I can see, Perl/Ruby bindings have the same problem. That, and the way to create a stream in Python does not seem to be working - see the email I just sent to svn-dev@ a few minutes ago. Ironically, I found that when I tried to test the implementation of this thunking code for parsing the dumpstream :) Not sure if this affects Perl/Ruby. So, back to your advice - it's either using C library directly, or implementing the parser for the stream. Which isn't hard, I admit. Regards, Alexey. -- Brane