Re: Every Version of Every File in a Repository

2014-10-09 Thread Daniel Shahaf
Branko Čibej wrote on Wed, Oct 08, 2014 at 00:41:01 +0200:
 On 07.10.2014 22:36, Andreas Mohr wrote:
 Aside from the brute-force method of checking out the entire repository
 starting at revision 1 , performing a scan, updating to the next 
  revision,
 and repeating until I reach the head, I don’t know of a way to do this.
 
 This is, in fact, likely to be (almost) the most efficient way to do
 this, since you can just use the existing Subversion client to deal with
 the repository contents and version discrepancies.
 
 But there is an alternative that might be more efficient in your case:
 Create a dumpstream of the repository using svnadmin dump,
 non-incremental and not using deltas, then pipe the stream to a custom
 tool that extracts the file contents the stream and either writes them
 to disk, or passes them to your scanning tool in some other way.

Non-incremental dump will dump the entire tree anew for every revision.
An incremental, non-deltas dump should suffice here.


RE: Every Version of Every File in a Repository

2014-10-08 Thread Bob Archer
You know, the files aren't really stored as files per say. Also, if using 
correct ACLs in your repository there is no way any of these files can be 
executed.

I assume by scan you are talking about virus scanning.  I would question the 
need to do this. Yea, I know... but still, many request come from a lack of 
understanding of a technology.

From: jt.mil...@l-3com.com [mailto:jt.mil...@l-3com.com]
Sent: Tuesday, October 07, 2014 4:03 PM
To: users@subversion.apache.org
Subject: Every Version of Every File in a Repository

Is there a way to check out every version of a file in a repository? We just 
had a requirement levied to perform a scan of every file in a repository. The 
scan tool must have each file in a stand-alone format. Thus, I need a way to 
extract every version of every file within a repository.

Aside from the brute-force method of checking out the entire repository 
starting at revision 1 , performing a scan, updating to the next revision, and 
repeating until I reach the head, I don't know of a way to do this.

Thanks,
JT Miller



Re: Every Version of Every File in a Repository

2014-10-08 Thread Andreas Stieger
Hi,

On 08/10/14 21:08, Bob Archer wrote:
 I assume by “scan” you are talking about virus scanning.  I would
 question the need to do this. Yea, I know… but still, many request come
 from a lack of understanding of a technology.

It is more likely that this is about a legal discovery or license/code
review. Here then is a hint.

#1 Fetch the global verbose xml log with files of the root of the
repository or the path you want to examine:

  svn log -r1:HEAD --xml -v ^/

#2 XSLT over that to find logentry/paths/path nodes with relevant
actions (add, modified, moved here), kind (file) and, when modified, the
relevant modifications (text-mods=true). The result will be a machine
readable list with path / revision coordinates/URLs. Depending on
whether you included branches this may even be unique in terms of content.

#3 Run that list through your favorite svn client or plain HTTP user
agent, as required and suitable. Use pat + peg revisions.

https://svn.example.com/svn/repo/path/file.txt?p=N
http://svnbook.red-bean.com/nightly/en/svn.serverconfig.httpd.html#svn.serverconfig.httpd.extra.browsing

#4 Run your scan as per your requirements.

You can adjust this to fit your (disk,process,scan) needs and resources.

Andreas


Re: Every Version of Every File in a Repository

2014-10-08 Thread Les Mikesell
On Wed, Oct 8, 2014 at 3:38 PM, Andreas Stieger andreas.stie...@gmx.de wrote:
 Hi,

 On 08/10/14 21:08, Bob Archer wrote:
 I assume by “scan” you are talking about virus scanning.  I would
 question the need to do this. Yea, I know… but still, many request come
 from a lack of understanding of a technology.

 It is more likely that this is about a legal discovery or license/code
 review. Here then is a hint.

If you are looking to make it searchable, fisheye from Atlassian knows
how to do that.  Be prepared to wait a couple weeks for a large
repository while it does an 'svn cat' of every revision of every file
to feed to it's indexer, though.

-- 
   Les Mikesell
  lesmikes...@gmail.com


Re: Every Version of Every File in a Repository

2014-10-07 Thread Andreas Mohr
Hi,

On Tue, Oct 07, 2014 at 03:03:13PM -0500, jt.mil...@l-3com.com wrote:
Is there a way to check out every version of a file in a repository? We
just had a requirement levied to perform a scan of every file in a
repository. The scan tool must have each file in a stand-alone format.
Thus, I need a way to extract every version of every file within a
repository.
 
 
 
Aside from the brute-force method of checking out the entire repository
starting at revision 1 , performing a scan, updating to the next revision,
and repeating until I reach the head, I don’t know of a way to do this.

That's certainly a somewhat tough one.


I will get tarred and feathered here for my way of trying to solve this,
and possibly even rightfully so, but... ;)

OK, here it goes:
you could do a git-svn on your repo,
then get all files ever existing via http://stackoverflow.com/a/12090812
, then for each such file do a git log --all --something --someveryshortformat
to get all its revisions,
then do a
file_content=$(git show revision:./path/to/file)
(alternatively do git show ...  $TMPDIR/mytmp since that ought to be more
reliable for largish files)
, then scan that
(but ideally you'd be able to directly pipe the git show stream into your scan 
tool).

That ought to give you a scan result for *all* revisions of *all* files
in *all* branches of your repo (you might want to decorate things with a
uniq applied at some place or another, to ensure that you're indeed
not doing wasteful duplicate processing of certain items).
OK possibly scratch the *all* branches part, since this may require
some extra effort in the case of git-svn...


However this high-level complex lookup solution
might be both rather crude and much less precise
compared to a parse-each-object kind of solution at git plumbing level, if this 
is
possible (and I'd very much guess it is).
Hmm, that could be a git rev-list, and that would then list changed files for 
each commit,
and AFAICS globally (i.e., on the global commit tree, rather than specific
human-tagged branch names). So that operation mode once successfully scripted
ought to be *a lot* better than the list all files, then rev-log each file 
algo.

And you could then safety check your algorithm
by having it spit out a full list of all commit hash / file combos
(this happens to be the same list which you would then feed into git show,
entry by entry),
and then try hard to figure out a way
to pick a repo-side file version which accidentally is NOT contained in that 
list
-- algo error!


Oh, and BTW: all this *without* having to do a filesystem-based checkout
(i.e., working copy modification)
of any repo item, even once.
(i.e., this is actually going *against* your initially stated requirement of
Is there a way to check out every version of a file in a repository?,
and rightfully so ;)

HTH,

Andreas Mohr


Re: Every Version of Every File in a Repository

2014-10-07 Thread Branko Čibej
On 07.10.2014 22:36, Andreas Mohr wrote:
 Hi,

 That's certainly a somewhat tough one.


 I will get tarred and feathered here for my way of trying to solve this,
 and possibly even rightfully so, but... ;)

Well, I certainly won't skin you alive for suggesting this; but ... I
would imagine that git svn fetch has to essentially do just what the
OP doesn't want to do, i.e., successively retreive each revision of
every file in the Subversion repository to populate the Git repository.
There's not much chance this would be faster than just doing the same
with Subversion, especially since, once you're done you /still/ have to
scan the files resulting Git repo.


Going back to the original question ...

Aside from the brute-force method of checking out the entire repository
starting at revision 1 , performing a scan, updating to the next revision,
and repeating until I reach the head, I don’t know of a way to do this.

This is, in fact, likely to be (almost) the most efficient way to do
this, since you can just use the existing Subversion client to deal with
the repository contents and version discrepancies.

But there is an alternative that might be more efficient in your case:
Create a dumpstream of the repository using svnadmin dump,
non-incremental and not using deltas, then pipe the stream to a custom
tool that extracts the file contents the stream and either writes them
to disk, or passes them to your scanning tool in some other way.

The reason why this could be faster than the checkout+repeated update is
that you do not have the overhead of a working copy, directory tracking,
property handling, etc. etc., and you can probably save on disk space by
keeping the file contents around only as long as they're being scanned.
It does mean that your custom tool will have to parse the dumpfile
format, but that's really not so hard, the format is quite simple, and
there are a number of example scripts that do that in our repository.
Another alternative is to use our API directly, possibly through one of
the bindings, to get file contents straight from the repository; but I
suspect it's harder than parsing the dump file.

-- Brane


Re: Every Version of Every File in a Repository

2014-10-07 Thread Alexey Neyman
On Wednesday, October 08, 2014 12:41:01 AM Branko Čibej wrote:
 On 07.10.2014 22:36, Andreas Mohr wrote:
  Hi,
  
  That's certainly a somewhat tough one.
  
  
  I will get tarred and feathered here for my way of trying to solve this,
  and possibly even rightfully so, but... ;)
 
 Well, I certainly won't skin you alive for suggesting this; but ... I
 would imagine that git svn fetch has to essentially do just what the
 OP doesn't want to do, i.e., successively retreive each revision of
 every file in the Subversion repository to populate the Git repository.
 There's not much chance this would be faster than just doing the same
 with Subversion, especially since, once you're done you /still/ have to
 scan the files resulting Git repo.
 
 
 Going back to the original question ...
 
 Aside from the brute-force method of checking out the entire repository
 starting at revision 1 , performing a scan, updating to the next
 revision,
 and repeating until I reach the head, I don’t know of a way to do this.
 
 This is, in fact, likely to be (almost) the most efficient way to do
 this, since you can just use the existing Subversion client to deal with
 the repository contents and version discrepancies.
 
 But there is an alternative that might be more efficient in your case:
 Create a dumpstream of the repository using svnadmin dump,
 non-incremental and not using deltas, then pipe the stream to a custom
 tool that extracts the file contents the stream and either writes them
 to disk, or passes them to your scanning tool in some other way.
 
 The reason why this could be faster than the checkout+repeated update is
 that you do not have the overhead of a working copy, directory tracking,
 property handling, etc. etc., and you can probably save on disk space by
 keeping the file contents around only as long as they're being scanned.
 It does mean that your custom tool will have to parse the dumpfile
 format, but that's really not so hard, the format is quite simple, and
 there are a number of example scripts that do that in our repository.
 Another alternative is to use our API directly, possibly through one of
 the bindings, to get file contents straight from the repository; but I
 suspect it's harder than parsing the dump file.

The Python bindings for parsing the dumpstream currently do not work as I 
described on 
svn-dev@ some time ago: the layer which does thunking of the C calls back to 
Python 
code is not implemented right now. As far as I can see, Perl/Ruby bindings have 
the same 
problem.

That, and the way to create a stream in Python does not seem to be working - 
see the 
email I just sent to svn-dev@ a few minutes ago. Ironically, I found that when 
I tried to test 
the implementation of this thunking code for parsing the dumpstream :) Not 
sure if this 
affects Perl/Ruby.

So, back to your advice - it's either using C library directly, or implementing 
the parser for 
the stream. Which isn't hard, I admit.

Regards,
Alexey.
 
 -- Brane