Re: How to get access to ALL the data in maven central?

Matt Taylor Mon, 09 Apr 2012 22:06:43 -0700

I agree it is definitely going to be imperfect and it will in the end only
be a sampling of the real usage, but I think that it will still prove
interesting information.  As far as bogus conclusions reached from others:
I plan on putting in some effort into explaining what the results are, what
they mean and making them accessible. Hopefully I'll get it mostly right
and/or attract other smarter people who will carry on from me.  Time will
tell on that one. :)


I agree that figuring out the temporal aspects of the graph will be a hard
problem (but rewarding as well if I can tease out the evolution of the
ecosystem).  Version numbers provide a sort of ordering, but it's messy.

All in all I think you make some valid points as far as the difficulty, but
the challenges are part of what attract me to this.  Even if I
fail miserably, I'll still learn a ton, and hopefully have some fun along
the way.

Matt

On Mon, Apr 9, 2012 at 10:01 PM, Ron Wheeler <[email protected]
> wrote:

> You are going to be missing the key ingredient which is the application
> POMs that tell you what artifacts are actually used.
>
> You might get some interesting information about things like log4j which
> is probably used by lots of things inside Maven Central.
> You will be grossly misled about the use of things like CXF since it is
> hardly ever called by a library that would be submitted to Maven Central
> but is frequently used by project that are in private repositories.
>
> You may be able to visualize a "where used" between libraries but you will
> have a lot of nodes that are "never used" which is not true.
>
> You will have to figure out a way to separate projects that are still used
> and produced a ton of revisions 5 years ago but nothing since, from
> projects that are mature yet still active but only produce new versions
> every 18 months since they are stable and work, from projects that were
> very active and then died as they became unnecessary due to newer
> technologies being introduced.
>
> You will also have trouble with projects that repackage their artifacts
> between major releases and change the GAV structure by redistributing the
> functionality.
>
> Not sure that your project is going to produce any useful information and
> I fear that it will be misleading to anyone who does not look deeper into
> the raw data.
>
> Visualization may just make it easier for incorrect conclusions to be
> developed.
>
> Ron
>
>
> On 09/04/2012 10:20 PM, Matt Taylor wrote:
>
>> Perhaps this is already in existence somewhere.  If so please point me in
>> the right direction.
>>
>> I want to know what the most popular dependancies are, not based on
>> downloads, but based on dependancies from other projects.
>> I want to explore the full dependency graph and see its evolution over
>> 'time' (for instance seeing how fast versions of artifacts are adopted).
>> I want to create a visual representations of all the dependancies just
>> because it would look cool.
>>
>> In general I want total access to all the metadata (pom files essentially)
>> in the maven central repo, so I can see how the worlds software fits
>> together on a 'global' scale.
>>
>> Eventually I would like to explore the jar artifacts as well to get deeper
>> insights into what methods/classes are being referenced as well, but that
>> is phase 2. :)
>>
>> > From googling around is appears that understandably it is improper to
>> simply wget the entire repo.  However, there don't seem to be any publicly
>> available torrents, or other resources for me to get access to this data.
>>
>> http://search.maven.org/#stats
>>
>> 457GB is a lot of data, but it isn't an unimaginable amount, and most of
>> that is no doubt the artifacts, not the metadata (pom files).
>>
>> So I really have two questions:
>>
>> 1. What is the easiest path to getting rsync type access of the full repo
>> (I'd quite understand if I needed to pay a fee for this level of access).
>> 2. Failing that, what would be a legitimate way of just getting all the
>> pom
>> files?
>>
>> Basically I want to be a good guy and not put undo load on the servers,
>> but
>> at the same time I really want the data.
>>
>> Thanks,
>>
>> Matt Taylor
>> http://blog.**matthewjosephtaylor.com<http://blog.matthewjosephtaylor.com>
>>
>>
>
> --
> Ron Wheeler
> President
> Artifact Software Inc
> email: [email protected]
> skype: ronaldmwheeler
> phone: 866-970-2435, ext 102
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

Re: How to get access to ALL the data in maven central?

Reply via email to