Hi,

In latest OAK versions I have noticed problems with indexing words with special characters. indexOriginalTerm was set on index, however I couldn't find things like "xxx-yyy*" (some versions ago it was working fine I think). I have checked sources and it seems that "indexOriginalTerm" is used as parameter of the WordDelimiterFilter, however OakAnalyzer uses StandardTokenizer which split words on almost every special character before WordDelimiterFilter. Here is fragment if the the class description:

/ * One use for {@link WordDelimiterFilter} is to help match words with different// // * subword delimiters. For example, if the source text contained "wi-fi" one may// // * want "wifi" "WiFi" "wi-fi" "wi+fi" queries to all match. One way of doing so//
// * is to specify combinations="1" in the analyzer used for indexing, and//
// * combinations="0" (the default) in the analyzer used for querying. Given that// // * the current {@link StandardTokenizer} immediately removes many intra-word// // * delimiters, it is recommended that this filter be used after a tokenizer that//
// * does not do this (such as {@link WhitespaceTokenizer})./

Is this intended functionality ?


Best regards,

Piotr




Reply via email to