Re: Taking a step back

Marvin Humphrey Thu, 11 May 2006 10:09:30 -0700


On May 10, 2006, at 8:02 AM, Robert Engels wrote:

The file format issue whoever is a non-issue. If you wantinteroperability
between systems do it via remote invocation and IIOP, or some HTTP
interface. This is far more easier to control, especially throughversionchange cycles - otherwise all platforms need to be updated together- which
is very hard to do (unless you are using Java with WORA !).
I also don't understand why Lucene doesn't focus on being THE JAVAsearchengine. Anything I think that detracts that from moving forwardshould be
out of scope.

I really don't relish the prospect that this might degenerate into alanguage argument, but I think it falls to me to respond, since thepatch I submitted on Monday opens up a lot of possibilities for interop.


I don't necessarily disagree.

Abandoning all attempts at interop has its advantages. Oneunfortunate albeit unavoidable aspect of Lucene is that it is tightlybound to its file format. In a perfect world, the file reading/writing apparatus would be modular: the index would be read intomemory using a plugin, manipulated, then saved using another plugin.That doesn't work, obviously, because indexes are commonly too largeto be read into available RAM, and so the I/O stuff is scattered overthe entire library, which makes maintaining compatibility laborious.

However, Lucene has to make some effort to track its file formatdefinition, so that it may live up to the commitments for backwards-compatibility codified earlier in this thread. This is currentlydone using the File Formats document (though that document isincomplete and buggy). There's not much difference betweensupporting the files written by an earlier version of Lucene andsupporting the files written by another implementation of Lucenewhich adhere to the same spec.

The only question is whether there are Java-specific optimizationswhich are so advantageous that they outweigh the benefits ofinterchange. There is no inherent advantage in using Modified UTF-8over standard UTF-8, and the UTF-8 code I supplied actually speeds upLucene by a couple percent because it simplifies some conditionals --all of the performance hit comes from using a bytecount as the Stringprefix. I have good reasons to believe that this can go away, notthe least of which is I've actually written a working implementationin Perl/C which uses bytecounts and I know where all the bottlenecksare.

There are also advantages to keeping the file format public, both forJava Lucene and for the larger Apache Lucene project. Of coursethere's the the raw usefulness of interchange. For instance, itmight be nice to whip up a little script in Perl or Ruby which workswith your existing rig -- especially if there's a CPAN module thatoffers functionality you need which isn't available yet in Java, oryou'd benefit from a near-instantaneous startup time.

But more important, I'd argue, is that having all implementationsshare a common file format means that all the authors have anamplified interest in coordinating, communicating, and contributing.Just as learning new languages, programming or natural, broadens anindividual's horizons, so does working out an implementation based onLucene's data structures in another language lead to fresh thinking.The more cross-pollination of ideas from various authors and byproxy, their extended communities, the more all of the sub-projectsgain and the faster Apache Lucene as a whole advances.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Taking a step back

Reply via email to