On Tue, Mar 13, 2012 at 1:13 PM, Suat Gonul <[email protected]> wrote: >> (**) All *_t fields use string as field type. This means that no >> tokenizer is used AND queries are case sensitive. I do not think this >> is a good decision and would rather us the already defined "text_ws" >> type (white space tokenizer, word delimiter and lower case) >> >> > > Ok, thanks for this suggestion. Indeed, it might be better to set it to > "text_general". WDYT? > I am not sure about text_general because it is specific to the english language. So if you can ensure that such labels will be all English, than it might be still ok, but otherwise I would prefer a non language specific filed such as "text_ws"
We might also want to consider to use * ICUTokenizerFactory instead of the WhitespaceTokenizerFactory do also cover languages that do not use whitespaces to separate words. * ICUFoldingFilterFactory (combination of ASCIIFoldingFilter, LowerCaseFilter, and ICUNormalizer2Filter) That brings me to an other question: How does the Contenthub currently deal with internationalization? best Rupert -- | Rupert Westenthaler [email protected] | Bodenlehenstraße 11 ++43-699-11108907 | A-5500 Bischofshofen
