On Mon, Nov 29, 2010 at 9:05 AM, DM Smith <dmsmith...@gmail.com> wrote:
>
> In my project, I don't use any of the Analyzers that Lucene provides, but I 
> have variants of them. (Mine allow take flags indicating whether to filter 
> stop words and whether to do stemming). The effort recently has been to 
> change these analyzers to follow the new reuse pattern to improve performance.
>
> Having a declarative mechanism and I wouldn't have needed to make the changes.

Right, this is I think what we want? To just provide examples so the
user can make what they need to suit their application.

>
> WRT to an analyzer, if any of the following changes, all bets are off:
>    Tokenizer (i.e. which tokenizer is used)
>    The rules that a tokenizer uses to break into tokens. (E.g. query parser, 
> break iterator, ...)
>    The type associated with each token (e.g. word, number, url, .... )
>    Presence/Absence of a particular filter
>    Order of filters
>    Tables that a filter uses
>    Rules that a filter encodes
>    The version and implementation of Unicode being used (whether via ICU, 
> Lucene and/or Java)
>    Bugs fixed in these components.
> (This list is adapted from an email I wrote to a user's group explaining why 
> texts need to be re-indexed.)
>

Right, i agree, and some of these things (such as JVM unicode version)
are completely outside of our control.
But for the things inside our control, where are the breaks that
caused you any reindexing?

> Additionally, it is the user's responsibility to normalize the text, probably 
> to NFC or NFKC, before index and search. (It may need to precede the 
> Tokenizer if it is not Unicode aware. E.g. what does a LetterTokenizer do if 
> input is NFD and it encounters an accent?)

I would not recommend this approach: NFC doesnt mean its going to take
letter+accent combinations and compose them into a 'composed'
character with the letter property... especially for non-latin
scripts!

In some cases, NFC will even cause the codepoint to be expanded: the
NFC form of 0958 (QA) is 0915 + 093C (KA+NUKTA)... of course if you
use LetterTokenizer with any language in this script, you are screwed
anyway :)

But even for latin scripts this won't work... not all combinations
have a composed form and i think composed forms are in general not
being added anymore. For example, see the lithuanian sequences in
http://www.unicode.org/Public/6.0.0/ucd/NamedSequences.txt:

LATIN SMALL LETTER A WITH OGONEK AND TILDE;0105 0303

You can normalize this all you want, but there is no single composed
form, in NFC its gonna be 0105 0303.

Instead, you should use a Tokenizer that respects canonical
equivalence (tokenizes text that is canonically equivalent in the same
way), such as UAX29Tokenizer/StandardTokenizer in branch_3x. Ideally
your filters too, will respect this equivalence, and you can finally
normalize a single time at the *end* of processing. For example, don't
use LowerCaseFilter + ASCIIFoldingFilter or something like that to
lowercanse & remove accents, but use ICUFoldingFilter instead, which
handles all this stuff consistently, even if your text doesnt conform
to any unicode normalization form...

>
> Recently, we've seen that there is some mistrust here in JVMs at the same 
> version level from different vendors (Sun, Harmony, IBM) in producing the 
> same results. (IIRC: Thai break iterator. Random tests.)

Right, Sun JDK 7 will be a new unicode version. Harmony uses a
different unicode version than Sun. There's nothing we can do about
this except document it?
Whether or not a special customized break iterator for Thai locale
exists, and how it works, is just a jvm "feature". There's nothing we
can do about this except document it?

> Within a release of Lucene, a small handful of analyzers may have changed 
> sufficiently to warrant re-index of indexes built with them.

which ones changed in a backwards-incompatible way that forced you to reindex?

> So basically, I have given up on Lucene being backward compatible where it 
> matters the most to me: Stable analyzer components. The gain I get from this 
> admission is far better. YMMV.
>

which ones changed in a backwards-incompatible way that forced you to reindex?

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to