I agree it is definitely going to be imperfect and it will in the end only be a sampling of the real usage, but I think that it will still prove interesting information. As far as bogus conclusions reached from others: I plan on putting in some effort into explaining what the results are, what they mean and making them accessible. Hopefully I'll get it mostly right and/or attract other smarter people who will carry on from me. Time will tell on that one. :)
I agree that figuring out the temporal aspects of the graph will be a hard problem (but rewarding as well if I can tease out the evolution of the ecosystem). Version numbers provide a sort of ordering, but it's messy. All in all I think you make some valid points as far as the difficulty, but the challenges are part of what attract me to this. Even if I fail miserably, I'll still learn a ton, and hopefully have some fun along the way. Matt On Mon, Apr 9, 2012 at 10:01 PM, Ron Wheeler <[email protected] > wrote: > You are going to be missing the key ingredient which is the application > POMs that tell you what artifacts are actually used. > > You might get some interesting information about things like log4j which > is probably used by lots of things inside Maven Central. > You will be grossly misled about the use of things like CXF since it is > hardly ever called by a library that would be submitted to Maven Central > but is frequently used by project that are in private repositories. > > You may be able to visualize a "where used" between libraries but you will > have a lot of nodes that are "never used" which is not true. > > You will have to figure out a way to separate projects that are still used > and produced a ton of revisions 5 years ago but nothing since, from > projects that are mature yet still active but only produce new versions > every 18 months since they are stable and work, from projects that were > very active and then died as they became unnecessary due to newer > technologies being introduced. > > You will also have trouble with projects that repackage their artifacts > between major releases and change the GAV structure by redistributing the > functionality. > > Not sure that your project is going to produce any useful information and > I fear that it will be misleading to anyone who does not look deeper into > the raw data. > > Visualization may just make it easier for incorrect conclusions to be > developed. > > Ron > > > On 09/04/2012 10:20 PM, Matt Taylor wrote: > >> Perhaps this is already in existence somewhere. If so please point me in >> the right direction. >> >> I want to know what the most popular dependancies are, not based on >> downloads, but based on dependancies from other projects. >> I want to explore the full dependency graph and see its evolution over >> 'time' (for instance seeing how fast versions of artifacts are adopted). >> I want to create a visual representations of all the dependancies just >> because it would look cool. >> >> In general I want total access to all the metadata (pom files essentially) >> in the maven central repo, so I can see how the worlds software fits >> together on a 'global' scale. >> >> Eventually I would like to explore the jar artifacts as well to get deeper >> insights into what methods/classes are being referenced as well, but that >> is phase 2. :) >> >> > From googling around is appears that understandably it is improper to >> simply wget the entire repo. However, there don't seem to be any publicly >> available torrents, or other resources for me to get access to this data. >> >> http://search.maven.org/#stats >> >> 457GB is a lot of data, but it isn't an unimaginable amount, and most of >> that is no doubt the artifacts, not the metadata (pom files). >> >> So I really have two questions: >> >> 1. What is the easiest path to getting rsync type access of the full repo >> (I'd quite understand if I needed to pay a fee for this level of access). >> 2. Failing that, what would be a legitimate way of just getting all the >> pom >> files? >> >> Basically I want to be a good guy and not put undo load on the servers, >> but >> at the same time I really want the data. >> >> Thanks, >> >> Matt Taylor >> http://blog.**matthewjosephtaylor.com<http://blog.matthewjosephtaylor.com> >> >> > > -- > Ron Wheeler > President > Artifact Software Inc > email: [email protected] > skype: ronaldmwheeler > phone: 866-970-2435, ext 102 > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] >
