Lucene index - problem with indexOriginalTerm

Piotr Tajduś Mon, 24 Jun 2019 08:16:18 -0700

Hi,

In latest OAK versions I have noticed problems with indexing words withspecial characters. indexOriginalTerm was set on index, however Icouldn't find things like "xxx-yyy*" (some versions ago it was workingfine I think). I have checked sources and it seems that"indexOriginalTerm" is used as parameter of the WordDelimiterFilter,however OakAnalyzer uses StandardTokenizer which split words on almostevery special character before WordDelimiterFilter. Here is fragment ifthe the class description:

/ * One use for {@link WordDelimiterFilter} is to help match words withdifferent//// * subword delimiters. For example, if the source text contained"wi-fi" one may//// * want "wifi" "WiFi" "wi-fi" "wi+fi" queries to all match. One way ofdoing so//

// * is to specify combinations="1" in the analyzer used for indexing, and//

// * combinations="0" (the default) in the analyzer used for querying.Given that//// * the current {@link StandardTokenizer} immediately removes manyintra-word//// * delimiters, it is recommended that this filter be used after atokenizer that//

// * does not do this (such as {@link WhitespaceTokenizer})./

Is this intended functionality ?


Best regards,

Piotr

Lucene index - problem with indexOriginalTerm

Reply via email to