On May 10, 2006, at 8:02 AM, Robert Engels wrote:

The file format issue whoever is a non-issue. If you want interoperability
between systems do it via remote invocation and IIOP, or some HTTP
interface. This is far more easier to control, especially through version change cycles - otherwise all platforms need to be updated together - which
is very hard to do (unless you are using Java with WORA !).

I also don't understand why Lucene doesn't focus on being THE JAVA search engine. Anything I think that detracts that from moving forward should be
out of scope.

I really don't relish the prospect that this might degenerate into a language argument, but I think it falls to me to respond, since the patch I submitted on Monday opens up a lot of possibilities for interop.

I don't necessarily disagree.

Abandoning all attempts at interop has its advantages. One unfortunate albeit unavoidable aspect of Lucene is that it is tightly bound to its file format. In a perfect world, the file reading/ writing apparatus would be modular: the index would be read into memory using a plugin, manipulated, then saved using another plugin. That doesn't work, obviously, because indexes are commonly too large to be read into available RAM, and so the I/O stuff is scattered over the entire library, which makes maintaining compatibility laborious.

However, Lucene has to make some effort to track its file format definition, so that it may live up to the commitments for backwards- compatibility codified earlier in this thread. This is currently done using the File Formats document (though that document is incomplete and buggy). There's not much difference between supporting the files written by an earlier version of Lucene and supporting the files written by another implementation of Lucene which adhere to the same spec.

The only question is whether there are Java-specific optimizations which are so advantageous that they outweigh the benefits of interchange. There is no inherent advantage in using Modified UTF-8 over standard UTF-8, and the UTF-8 code I supplied actually speeds up Lucene by a couple percent because it simplifies some conditionals -- all of the performance hit comes from using a bytecount as the String prefix. I have good reasons to believe that this can go away, not the least of which is I've actually written a working implementation in Perl/C which uses bytecounts and I know where all the bottlenecks are.

There are also advantages to keeping the file format public, both for Java Lucene and for the larger Apache Lucene project. Of course there's the the raw usefulness of interchange. For instance, it might be nice to whip up a little script in Perl or Ruby which works with your existing rig -- especially if there's a CPAN module that offers functionality you need which isn't available yet in Java, or you'd benefit from a near-instantaneous startup time.

But more important, I'd argue, is that having all implementations share a common file format means that all the authors have an amplified interest in coordinating, communicating, and contributing. Just as learning new languages, programming or natural, broadens an individual's horizons, so does working out an implementation based on Lucene's data structures in another language lead to fresh thinking. The more cross-pollination of ideas from various authors and by proxy, their extended communities, the more all of the sub-projects gain and the faster Apache Lucene as a whole advances.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to