On 04/15/2010 09:49 AM, Robert Muir wrote:
wrong, it doesnt fix the analyzers problem.
you need to reindex.
On Thu, Apr 15, 2010 at 9:39 AM, Earwin Burrfoot <ear...@gmail.com
<mailto:ear...@gmail.com>> wrote:
On Thu, Apr 15, 2010 at 17:17, Yonik Seeley
<yo...@lucidimagination.com <mailto:yo...@lucidimagination.com>>
wrote:
> Seamless online upgrades have their place too... say you are
upgrading
> one server at a time in a cluster.
Nothing here that can't be solved with an upgrade tool. Down one
server, upgrade index, upgrade sofware, up.
Having read the thread, I have a few comments. Much of it is summary.
The current proposal requires re-index on every upgrade to Lucene. Plain
and simple.
Robert is right about the analyzers.
There are three levels of backward compatibility, though we talk about 2.
First, the index format. IMHO, it is a good thing for a major release to
be able to read the prior major release's index. And the ability to
convert it to the current format via optimize is also good. Whatever is
decided on this thread should take this seriously.
Second, the API. The current mechanism to use deprecations to migrate
users to a new API is both a blessing and a curse. It is a blessing to
end users so that they have a clear migration path. It is a curse to
development because the API is bloated with the old and the new. Further
it causes unfortunate class naming, with the tendency to migrate away
from the good name. It is a curse to end users because it can cause
confusion.
While I like the mechanism of deprecations to migrate me from one
release to another, I'd be open to another mechanism. So much effort is
put into API bw compat that might be better spent on another mechanism.
E.g. thorough documentation.
Third, the behavior. WRT, Analyzers (consisting of tokenizers, stemmers,
stop words, ...) if the token stream changes, the index is no longer
valid. It may appear to work, but it is broken. The token stream applies
not only to the indexed documents, but also to the user supplied query.
A simple example, if from one release to another the stop word 'a' is
dropped, then phrase searches including 'a' won't work as 'a' is not in
the index. Even a simple, obvious bug fix that changes the stream is bad.
Another behavior change is an upgrade in Java version. By forcing users
to go to Java 5 with Lucene 3, the version of Unicode changed. This in
itself causes a change in some token streams.
With a change to a token stream, the index must be re-created to ensure
expected behavior. If the original input is no longer available or the
index cannot be rebuilt for whatever reason, then lucene should not be
upgraded.
It is my observation, though possibly not correct, that core only has
rudimentary analysis capabilities, handling English very well. To handle
other languages well "contrib/analyzers" is required. Until recently it
did not get much love. There have been many bw compat breaking changes
(though w/ version one can probably get the prior behavior). IMHO, most
of contrib/analyzers should be core. My guess is that most non-trivial
applications will use contrib/analyzers.
The other problem I have is the assumption that re-index is feasible and
that indexes are always server based. Re-index feasibility has already
been well-discussed on this thread from a server side perspective. There
are many client side applications, like mine, where the index is built
and used on the clients computer. In my scenario the user builds indexes
individually for books. From the index perspective, the sentence is the
Lucene document and the book is the index. Building an index is
voluntary and takes time proportional to the size of the document and
time inversely proportional to the power of the computer. Our user base
are those with ancient, underpowered laptops in 3-rd world countries. On
those machines it might take 10 minutes to create an index and during
that time the machine is fairly unresponsive. There is no opportunity to
"do it in the background."
So what are my choices? (rhetorical) With each new release of my app,
I'd like to exploit the latest and greatest features of Lucene. And I'm
going to change my app with features which may or may not be related to
the use of Lucene. Those latter features are what matter the most to my
user base. They don't care what technologies are used to do searches. If
the latest Lucene jar does not let me use Version (or some other
mechanism) to maintain compatibility with an older index, the user will
have to re-index. Or I can forgo any future upgrades with Lucene.
Neither are very palatable.
-- DM Smith