Answered my own question to a degree.  For the benefit of the group here is
how to do it:

rsync -a -v --include */ --include *.pom --include *.xml --exclude *
--bwlimit=1000 mirrors.ibiblio.org::maven2/ maven2

That will retrieve all of the pom and xml metadata files for the maven
central repository.

At first I tried to just do a full rsync, but ibiblio cut me off after
about 3.4G of transfer.  After an hour or so they let me back in, hence the
bwlimit of 1000KB/s to attempt to not hog their bandwidth.  Unfortunately
they don't seem to publish what their limits are so I guess I'll have to
play with it to see how long it takes me to get the all the data.

After I get all the poms I'll start in on the full repository via a slow
slurp.  I'm OK with it taking weeks to get the jars for the first sync, and
then once I have the full repo getting the updates shouldn't be so taxing.

Progress!

Matt


On Mon, Apr 9, 2012 at 9:20 PM, Matt Taylor <[email protected]>wrote:

> Perhaps this is already in existence somewhere.  If so please point me in
> the right direction.
>
> I want to know what the most popular dependancies are, not based on
> downloads, but based on dependancies from other projects.
> I want to explore the full dependency graph and see its evolution over
> 'time' (for instance seeing how fast versions of artifacts are adopted).
> I want to create a visual representations of all the dependancies just
> because it would look cool.
>
> In general I want total access to all the metadata (pom files essentially)
> in the maven central repo, so I can see how the worlds software fits
> together on a 'global' scale.
>
> Eventually I would like to explore the jar artifacts as well to get deeper
> insights into what methods/classes are being referenced as well, but that
> is phase 2. :)
>
> From googling around is appears that understandably it is improper to
> simply wget the entire repo.  However, there don't seem to be any publicly
> available torrents, or other resources for me to get access to this data.
>
> http://search.maven.org/#stats
>
> 457GB is a lot of data, but it isn't an unimaginable amount, and most of
> that is no doubt the artifacts, not the metadata (pom files).
>
> So I really have two questions:
>
> 1. What is the easiest path to getting rsync type access of the full repo
> (I'd quite understand if I needed to pay a fee for this level of access).
> 2. Failing that, what would be a legitimate way of just getting all the
> pom files?
>
> Basically I want to be a good guy and not put undo load on the servers,
> but at the same time I really want the data.
>
> Thanks,
>
> Matt Taylor
> http://blog.matthewjosephtaylor.com
>

Reply via email to