On 03/13/2012 02:34 PM, Rupert Westenthaler wrote:
On Tue, Mar 13, 2012 at 1:13 PM, Suat Gonul<[email protected]> wrote:
(**) All *_t fields use string as field type. This means that no
tokenizer is used AND queries are case sensitive. I do not think this
is a good decision and would rather us the already defined "text_ws"
type (white space tokenizer, word delimiter and lower case)
Ok, thanks for this suggestion. Indeed, it might be better to set it to
"text_general". WDYT?
I am not sure about text_general because it is specific to the english
language. So if you can ensure that such labels will be all English,
than it might be still ok, but otherwise I would prefer a non language
specific filed such as "text_ws"
We might also want to consider to use
* ICUTokenizerFactory instead of the WhitespaceTokenizerFactory do
also cover languages that do not use whitespaces to separate words.
* ICUFoldingFilterFactory (combination of ASCIIFoldingFilter,
LowerCaseFilter, and ICUNormalizer2Filter)
That brings me to an other question: How does the Contenthub currently
deal with internationalization?
best
Rupert
Default Contenthub index uses "text_general", this field uses following
language specific operations in addition to the generic ones:
* StopFilterFactory (default Contenthub comes with English and German
stopwords)
* SnowballPorterFilterFactory for English.
So, to remove the dependency to English, we can switch "text_general"
with "text_ws".
Furthermore, LDPath integration of Contenthub currently does not
consider the language tags inside the LDPath programs. It only resolves
the default XSD types. We can add this to out todo list, considering the
language tags inside the programs (if they exist) while determining the
type of the Solr fields.
Best,
Anil.