FYI, running another mirror is a noble goal, and I'm sure there is a lot of good that could come of finding innovative ways to index, but I did want to let you know that there is already a searchable index that provides a number of tools a way to search by class name, GAV coordinates, etc.
Maybe it might make more sense to find new information to include in the Nexus index? See http://docs.codehaus.org/display/M2ECLIPSE/Nexus+Indexer for more information about the index and how to generate one from the command line. References: [1] http://weblogs.java.net/blog/kohsuke/archive/2008/05/nexus_index_is.html [2] http://wiki.netbeans.org/MavenBestPractices On Tue, Jul 14, 2009 at 4:08 PM, Geoff Clitheroe<[email protected]> wrote: > Hi, > > I'm interested in hosting a maven mirror in New Zealand. As far as I > know there is not one available in this region. Any comments on this > being a good idea? If so what is the preferred method (and source) > for creating a mirror and keeping it in sync? I work for > http://www.gns.cri.nz and we could either host here (good because we > are on research network as well) or at one of our remote sites (good > because they are on the main NZ peering points). I've got 'in > principal agreement' but I need to work out some real numbers (disk, > bandwidth etc). > > Aside from the obvious local mirror reasons I'm also interested in > adding class name search to a repo, similar to > http://www.findjar.com/. I'll include, at the end of this message, a > discussion I've been having with one of the Tattletale developers > (http://www.jboss.org/tattletale) about this idea and some testing > I've been doing. If anyone has any comments on the idea (validity, > necessity, obvious pitfalls etc) it would be greatly appreciated. > > Cheers, > Geoff > > > Hi Jesper, > > Just following up on searching for classes. I'm imagining that when > Tattletale reports on missing classes that it would then be possible > to provide a link to a list of jars that contain that class (like > http://www.findjar.com/). You mentioned profiles for Tattletale which > I will get back to at the end. > > I've done a spike test for implementing class name level searching for > public repos. I would hope to develop a search function that is > embeddable and could also be a web service (it's actually more complex > to make it embeddable than to provide a webservice). From what I've > done I see the main issues as being bandwidth and political and I > don't think they would be insurmountable. > > What I've done is scrape about 2000 jars from > http://mirrors.ibiblio.org/pub/mirrors/maven2/ > and http://repository.jboss.com/maven2/ I targeted Spring, Hibernate, > Seam, Webbeans, Tapestry, and a few apache-commons projects. I then > analyse the jars using 'jar tf ...' to extract the class names and > populate a lucene index using solr. The resulting index is about 3.8M > (for 880M of jars) with no thought to space saving in the index yet. > Analysis and indexing takes about 10 mins on my aged laptop and again > I've done nothing to optimize this as yet. > > I've added two search methods to access the index: > public List <JarName> findJarsByClassName(String className); > public List <JarLocation> findJarsByJarName(String jarName); > > So findJarsByClassName() returns a list of JarName that contain that > class and from that the jar name can listed: > > findJarsByClassName("net.sf.hibernate.Hibernate") > hibernate-2.0.1.jar > hibernate-2.0.2.jar > hibernate-2.0.3.jar > hibernate-2.0-beta-6.jar > hibernate-2.0-final.jar > hibernate-2.0-final.jar > hibernate-2.0-beta-5.jar > hibernate-2.1.1.jar > hibernate-2.1.2.jar > hibernate-2.1.3.jar > > Searching for a class name is case insensitive and can be part class > name to a punctuation token level (e.g., net.sf but not net.sf.hiber). > For implementing search the devil is always in the analysis but I > think this is a fairly well defined problem. > > Then findJarsByJarName() can be used to to find where a praticualr jar > is and return a URL and containing directory URL (often more useful as > the first thing to do is usually look at the POM). Ultimately this > could return links to several repos: > > findJarsByJarName("hibernate-2.1.3.jar") > hibernate-2.1.3.jar > http://mirrors.ibiblio.org/pub/mirrors/maven2/hibernate/hibernate/2.1.3/hibernate-2.1.3.jar > http://mirrors.ibiblio.org/pub/mirrors/maven2/hibernate/hibernate/2.1.3 > > The search is very fast (milliseconds per query in my spike) and I'm > going indirectly through solr - this can be made quicker (but more > complex) by working with lucene directly. > > If this looks interesting I'd be happy to provide the spike code for > any comments feedback etc. Do you use linux (unix), Windows, or Mac > then I could make sure there are working examples? > > For the project I'm proposing there are some implementation questions > mainly around getting all the jars and syncing out new indexes (but > these issues have been well addressed by Lucene before). In the first > instance I would approach the owners of http://repo2.maven.org/maven2/ > about scraping and hosting a mirror in New Zealand. I certainly think > it could be done. I guess I'd imagine a subproject of lucene but that > is yet to be thought through. > > Back to the Tattletale profile question. Are you referring to files like > this: > http://fisheye.jboss.org/browse/Tattletale/trunk/src/etc/sunjdk5-jsse.clz?r=trunk > > If this API is a snapshot of the classes contained in a jar at some > release point then it's easy to do with 'jar tf' and a script (as long > as the jar spec stays about the same and people don't obfuscate, or > use a custom class loader etc). I could produce this as a side effect > of indexing a repo but I wonder how this will scale. > > I look forwards to hearing from you. > > Cheers, > Geoff > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
