On Mon, Nov 29, 2010 at 12:51 PM, DM Smith <dmsmith...@gmail.com> wrote: >> Instead, you should use a Tokenizer that respects canonical >> equivalence (tokenizes text that is canonically equivalent in the same >> way), such as UAX29Tokenizer/StandardTokenizer in branch_3x. Ideally >> your filters too, will respect this equivalence, and you can finally >> normalize a single time at the *end* of processing. > > Should it be normalized at all before using these? NFKC? >
Sorry, i wanted to answer this one too :) NFKC is definitely a case where its likely what you want for search, but you don't want to normalize your documents to this... it removes certain distinctions important to display. If you are going to normalize to NFK[CD], thats a good reason to to deal with normalization in the analysis process, instead of normalizing your docs to these destructive lossy forms. (I do, however think its ok to normalize the docs to NFC for display, this is probably a good thing, because many rendering engines+fonts will display it better). The ICUTokenizer/UAX29Tokenizer/StandardTokenizer only respects canonical equivalence, not compatibility equivalence, but I think this is actually good. Have a look at the examples in http://unicode.org/reports/tr15/, such as fractions and subscripts. Its sorta up to the app to determine how it wants to deal with these, so treating 2⁵ the same as "25" by default (thats what NFKC will do!) early in the analysis process is dangerous. An app might want to normalize this to "32". So it can be better to normalize towards the end of your analysis process, e.g. have a look at ICUNormalizer2Filter: which supports the NFKC_CaseFold normal form (NFKC + CaseFold + removing Ignorables) in additional to the standard ones, and ICUFoldingFilter, which is just like that, except it does additional folding for search (like removing diacritics). These foldings are computed recursively up front so they give a stable result. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org