Answered my own question to a degree. For the benefit of the group here is how to do it:
rsync -a -v --include */ --include *.pom --include *.xml --exclude * --bwlimit=1000 mirrors.ibiblio.org::maven2/ maven2 That will retrieve all of the pom and xml metadata files for the maven central repository. At first I tried to just do a full rsync, but ibiblio cut me off after about 3.4G of transfer. After an hour or so they let me back in, hence the bwlimit of 1000KB/s to attempt to not hog their bandwidth. Unfortunately they don't seem to publish what their limits are so I guess I'll have to play with it to see how long it takes me to get the all the data. After I get all the poms I'll start in on the full repository via a slow slurp. I'm OK with it taking weeks to get the jars for the first sync, and then once I have the full repo getting the updates shouldn't be so taxing. Progress! Matt On Mon, Apr 9, 2012 at 9:20 PM, Matt Taylor <[email protected]>wrote: > Perhaps this is already in existence somewhere. If so please point me in > the right direction. > > I want to know what the most popular dependancies are, not based on > downloads, but based on dependancies from other projects. > I want to explore the full dependency graph and see its evolution over > 'time' (for instance seeing how fast versions of artifacts are adopted). > I want to create a visual representations of all the dependancies just > because it would look cool. > > In general I want total access to all the metadata (pom files essentially) > in the maven central repo, so I can see how the worlds software fits > together on a 'global' scale. > > Eventually I would like to explore the jar artifacts as well to get deeper > insights into what methods/classes are being referenced as well, but that > is phase 2. :) > > From googling around is appears that understandably it is improper to > simply wget the entire repo. However, there don't seem to be any publicly > available torrents, or other resources for me to get access to this data. > > http://search.maven.org/#stats > > 457GB is a lot of data, but it isn't an unimaginable amount, and most of > that is no doubt the artifacts, not the metadata (pom files). > > So I really have two questions: > > 1. What is the easiest path to getting rsync type access of the full repo > (I'd quite understand if I needed to pay a fee for this level of access). > 2. Failing that, what would be a legitimate way of just getting all the > pom files? > > Basically I want to be a good guy and not put undo load on the servers, > but at the same time I really want the data. > > Thanks, > > Matt Taylor > http://blog.matthewjosephtaylor.com >
