Hosting (and synching) a maven mirror

Geoff Clitheroe Tue, 14 Jul 2009 14:09:16 -0700

Hi,

I'm interested in hosting a maven mirror in New Zealand.  As far as I
know there is not one available in this region.  Any comments on this
being a good idea?  If so what is the preferred method (and source)
for creating a mirror and keeping it in sync?  I work for
http://www.gns.cri.nz and we could either host here (good because we
are on research network as well) or at one of our remote sites (good
because they are on the main NZ peering points).  I've got 'in
principal agreement' but I need to work out some real numbers (disk,
bandwidth etc).


Aside from the obvious local mirror reasons I'm also interested in
adding class name search to a repo, similar to
http://www.findjar.com/.  I'll include, at the end of this message, a
discussion I've been having with one of the Tattletale developers
(http://www.jboss.org/tattletale) about this idea and some testing
I've been doing.  If anyone has any comments on the idea (validity,
necessity, obvious pitfalls etc) it would be greatly appreciated.

Cheers,
Geoff


Hi Jesper,

Just following up on searching for classes.  I'm imagining that when
Tattletale reports on missing classes that it would then be possible
to provide a link to a list of jars that contain that class (like
http://www.findjar.com/).  You mentioned profiles for Tattletale which
I will get back to at the end.

I've done a spike test for implementing class name level searching for
public repos.  I would hope to develop a search function that is
embeddable and could also be a web service (it's actually more complex
to make it embeddable than to provide a webservice).  From what I've
done I see the main issues as being bandwidth and political and I
don't think they would be insurmountable.

What I've done is scrape about 2000 jars from
http://mirrors.ibiblio.org/pub/mirrors/maven2/
and http://repository.jboss.com/maven2/  I targeted Spring, Hibernate,
Seam, Webbeans, Tapestry, and a few apache-commons projects.  I then
analyse the jars using 'jar tf ...' to extract the class names and
populate a lucene index using solr.  The resulting index is about 3.8M
(for 880M of jars) with no thought to space saving in the index yet.
Analysis and indexing takes about 10 mins on my aged laptop and again
I've done nothing to optimize this as yet.

I've added two search methods to access the index:
public List <JarName> findJarsByClassName(String className);
public List <JarLocation> findJarsByJarName(String jarName);

So findJarsByClassName() returns a list of JarName that contain that
class and from that the jar name can listed:

findJarsByClassName("net.sf.hibernate.Hibernate")
hibernate-2.0.1.jar
hibernate-2.0.2.jar
hibernate-2.0.3.jar
hibernate-2.0-beta-6.jar
hibernate-2.0-final.jar
hibernate-2.0-final.jar
hibernate-2.0-beta-5.jar
hibernate-2.1.1.jar
hibernate-2.1.2.jar
hibernate-2.1.3.jar

Searching for a class name is case insensitive and can be part class
name to a punctuation token level (e.g., net.sf but not net.sf.hiber).
 For implementing search the devil is always in the analysis but I
think this is a fairly well defined problem.

Then findJarsByJarName() can be used to to find where a praticualr jar
is and return a URL and containing directory URL (often more useful as
the first thing to do is usually look at the POM).   Ultimately this
could return links to several repos:

findJarsByJarName("hibernate-2.1.3.jar")
hibernate-2.1.3.jar
http://mirrors.ibiblio.org/pub/mirrors/maven2/hibernate/hibernate/2.1.3/hibernate-2.1.3.jar
http://mirrors.ibiblio.org/pub/mirrors/maven2/hibernate/hibernate/2.1.3

The search is very fast (milliseconds per query in my spike) and I'm
going indirectly through solr - this can be made quicker (but more
complex) by working with lucene directly.

If this looks interesting I'd be happy to provide the spike code for
any comments feedback etc.  Do you use linux (unix), Windows, or Mac
then I could make sure there are working examples?

For the project I'm proposing there are some implementation questions
mainly around getting all the jars and syncing out new indexes (but
these issues have been well addressed by Lucene before).  In the first
instance I would approach the owners of http://repo2.maven.org/maven2/
about scraping and hosting a mirror in New Zealand.  I certainly think
it could be done.  I guess I'd imagine a subproject of lucene but that
is yet to be thought through.

Back to the Tattletale profile question.  Are you referring to files like this:
http://fisheye.jboss.org/browse/Tattletale/trunk/src/etc/sunjdk5-jsse.clz?r=trunk

If this API is a snapshot of the classes contained in a jar at some
release point then it's easy to do with 'jar tf' and a script (as long
as the jar spec stays about the same and people don't obfuscate, or
use a custom class loader etc).  I could produce this as a side effect
of indexing a repo but I wonder how this will scale.

I look forwards to hearing from you.

Cheers,
Geoff

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Hosting (and synching) a maven mirror

Reply via email to