Re: Indexing and searching using Apache Stanbol

Ali Anil Sinaci Tue, 13 Mar 2012 06:10:06 -0700

On 03/13/2012 02:34 PM, Rupert Westenthaler wrote:

On Tue, Mar 13, 2012 at 1:13 PM, Suat Gonul<[email protected]>  wrote:

(**) All *_t fields use string as field type. This means that no
tokenizer is used AND queries are case sensitive. I do not think this
is a good decision and would rather us the already defined "text_ws"
type (white space tokenizer, word delimiter and lower case)

Ok, thanks for this suggestion. Indeed, it might be better to set it to
"text_general". WDYT?

I am not sure about text_general because it is specific to the english
language. So if you can ensure that such labels will be all English,
than it might be still ok, but otherwise I would prefer a non language
specific filed such as "text_ws"

We might also want to consider to use

* ICUTokenizerFactory instead of the WhitespaceTokenizerFactory do
also cover languages that do not use whitespaces to separate words.
* ICUFoldingFilterFactory (combination of ASCIIFoldingFilter,
LowerCaseFilter, and ICUNormalizer2Filter)


That brings me to an other question: How does the Contenthub currently
deal with internationalization?

best
Rupert

Default Contenthub index uses "text_general", this field uses followinglanguage specific operations in addition to the generic ones:* StopFilterFactory (default Contenthub comes with English and Germanstopwords)

* SnowballPorterFilterFactory for English.

So, to remove the dependency to English, we can switch "text_general"with "text_ws".

Furthermore, LDPath integration of Contenthub currently does notconsider the language tags inside the LDPath programs. It only resolvesthe default XSD types. We can add this to out todo list, considering thelanguage tags inside the programs (if they exist) while determining thetype of the Solr fields.


Best,
Anil.

Re: Indexing and searching using Apache Stanbol

Reply via email to