Re: Proposal about Version API "relaxation"

DM Smith Thu, 15 Apr 2010 10:31:08 -0700

On 04/15/2010 09:49 AM, Robert Muir wrote:

wrong, it doesnt fix the analyzers problem.


you need to reindex.

On Thu, Apr 15, 2010 at 9:39 AM, Earwin Burrfoot <[email protected]<mailto:[email protected]>> wrote:


    On Thu, Apr 15, 2010 at 17:17, Yonik Seeley
    <[email protected] <mailto:[email protected]>>
    wrote:
    > Seamless online upgrades have their place too... say you are
    upgrading
    > one server at a time in a cluster.

    Nothing here that can't be solved with an upgrade tool. Down one
    server, upgrade index, upgrade sofware, up.


Having read the thread, I have a few comments. Much of it is summary.

The current proposal requires re-index on every upgrade to Lucene. Plainand simple.


Robert is right about the analyzers.

There are three levels of backward compatibility, though we talk about 2.

First, the index format. IMHO, it is a good thing for a major release tobe able to read the prior major release's index. And the ability toconvert it to the current format via optimize is also good. Whatever isdecided on this thread should take this seriously.

Second, the API. The current mechanism to use deprecations to migrateusers to a new API is both a blessing and a curse. It is a blessing toend users so that they have a clear migration path. It is a curse todevelopment because the API is bloated with the old and the new. Furtherit causes unfortunate class naming, with the tendency to migrate awayfrom the good name. It is a curse to end users because it can causeconfusion.

While I like the mechanism of deprecations to migrate me from onerelease to another, I'd be open to another mechanism. So much effort isput into API bw compat that might be better spent on another mechanism.E.g. thorough documentation.

Third, the behavior. WRT, Analyzers (consisting of tokenizers, stemmers,stop words, ...) if the token stream changes, the index is no longervalid. It may appear to work, but it is broken. The token stream appliesnot only to the indexed documents, but also to the user supplied query.A simple example, if from one release to another the stop word 'a' isdropped, then phrase searches including 'a' won't work as 'a' is not inthe index. Even a simple, obvious bug fix that changes the stream is bad.

Another behavior change is an upgrade in Java version. By forcing usersto go to Java 5 with Lucene 3, the version of Unicode changed. This initself causes a change in some token streams.

With a change to a token stream, the index must be re-created to ensureexpected behavior. If the original input is no longer available or theindex cannot be rebuilt for whatever reason, then lucene should not beupgraded.

It is my observation, though possibly not correct, that core only hasrudimentary analysis capabilities, handling English very well. To handleother languages well "contrib/analyzers" is required. Until recently itdid not get much love. There have been many bw compat breaking changes(though w/ version one can probably get the prior behavior). IMHO, mostof contrib/analyzers should be core. My guess is that most non-trivialapplications will use contrib/analyzers.

The other problem I have is the assumption that re-index is feasible andthat indexes are always server based. Re-index feasibility has alreadybeen well-discussed on this thread from a server side perspective. Thereare many client side applications, like mine, where the index is builtand used on the clients computer. In my scenario the user builds indexesindividually for books. From the index perspective, the sentence is theLucene document and the book is the index. Building an index isvoluntary and takes time proportional to the size of the document andtime inversely proportional to the power of the computer. Our user baseare those with ancient, underpowered laptops in 3-rd world countries. Onthose machines it might take 10 minutes to create an index and duringthat time the machine is fairly unresponsive. There is no opportunity to"do it in the background."

So what are my choices? (rhetorical) With each new release of my app,I'd like to exploit the latest and greatest features of Lucene. And I'mgoing to change my app with features which may or may not be related tothe use of Lucene. Those latter features are what matter the most to myuser base. They don't care what technologies are used to do searches. Ifthe latest Lucene jar does not let me use Version (or some othermechanism) to maintain compatibility with an older index, the user willhave to re-index. Or I can forgo any future upgrades with Lucene.Neither are very palatable.


-- DM Smith

Re: Proposal about Version API "relaxation"

Reply via email to