Re: Indexing and searching using Apache Stanbol

Rupert Westenthaler Tue, 13 Mar 2012 05:34:47 -0700

On Tue, Mar 13, 2012 at 1:13 PM, Suat Gonul <[email protected]> wrote:
>> (**) All *_t fields use string as field type. This means that no
>> tokenizer is used AND queries are case sensitive. I do not think this
>> is a good decision and would rather us the already defined "text_ws"
>> type (white space tokenizer, word delimiter and lower case)
>>
>>
>
> Ok, thanks for this suggestion. Indeed, it might be better to set it to
> "text_general". WDYT?
>
I am not sure about text_general because it is specific to the english
language. So if you can ensure that such labels will be all English,
than it might be still ok, but otherwise I would prefer a non language
specific filed such as "text_ws"


We might also want to consider to use

* ICUTokenizerFactory instead of the WhitespaceTokenizerFactory do
also cover languages that do not use whitespaces to separate words.
* ICUFoldingFilterFactory (combination of ASCIIFoldingFilter,
LowerCaseFilter, and ICUNormalizer2Filter)


That brings me to an other question: How does the Contenthub currently
deal with internationalization?

best
Rupert


-- 
| Rupert Westenthaler             [email protected]
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: Indexing and searching using Apache Stanbol

Reply via email to