On May 10, 2006, at 8:02 AM, Robert Engels wrote:
The file format issue whoever is a non-issue. If you want
interoperability
between systems do it via remote invocation and IIOP, or some HTTP
interface. This is far more easier to control, especially through
version
change cycles - otherwise all platforms need to be updated together
- which
is very hard to do (unless you are using Java with WORA !).
I also don't understand why Lucene doesn't focus on being THE JAVA
search
engine. Anything I think that detracts that from moving forward
should be
out of scope.
I really don't relish the prospect that this might degenerate into a
language argument, but I think it falls to me to respond, since the
patch I submitted on Monday opens up a lot of possibilities for interop.
I don't necessarily disagree.
Abandoning all attempts at interop has its advantages. One
unfortunate albeit unavoidable aspect of Lucene is that it is tightly
bound to its file format. In a perfect world, the file reading/
writing apparatus would be modular: the index would be read into
memory using a plugin, manipulated, then saved using another plugin.
That doesn't work, obviously, because indexes are commonly too large
to be read into available RAM, and so the I/O stuff is scattered over
the entire library, which makes maintaining compatibility laborious.
However, Lucene has to make some effort to track its file format
definition, so that it may live up to the commitments for backwards-
compatibility codified earlier in this thread. This is currently
done using the File Formats document (though that document is
incomplete and buggy). There's not much difference between
supporting the files written by an earlier version of Lucene and
supporting the files written by another implementation of Lucene
which adhere to the same spec.
The only question is whether there are Java-specific optimizations
which are so advantageous that they outweigh the benefits of
interchange. There is no inherent advantage in using Modified UTF-8
over standard UTF-8, and the UTF-8 code I supplied actually speeds up
Lucene by a couple percent because it simplifies some conditionals --
all of the performance hit comes from using a bytecount as the String
prefix. I have good reasons to believe that this can go away, not
the least of which is I've actually written a working implementation
in Perl/C which uses bytecounts and I know where all the bottlenecks
are.
There are also advantages to keeping the file format public, both for
Java Lucene and for the larger Apache Lucene project. Of course
there's the the raw usefulness of interchange. For instance, it
might be nice to whip up a little script in Perl or Ruby which works
with your existing rig -- especially if there's a CPAN module that
offers functionality you need which isn't available yet in Java, or
you'd benefit from a near-instantaneous startup time.
But more important, I'd argue, is that having all implementations
share a common file format means that all the authors have an
amplified interest in coordinating, communicating, and contributing.
Just as learning new languages, programming or natural, broadens an
individual's horizons, so does working out an implementation based on
Lucene's data structures in another language lead to fresh thinking.
The more cross-pollination of ideas from various authors and by
proxy, their extended communities, the more all of the sub-projects
gain and the faster Apache Lucene as a whole advances.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]