Hello guys, I have been using the edgeNGram tokenizer to enable partial prefix matching on a query. However, the tokenizer treats certain characters as punctuations (e.g. C# => C, I/O => I and O), so I had to add "punctuation" character class to the edgeNGram tokenizer and use the word_delimiter filter to drop punctuations. 'tokenizer': { 'prefix_tokenizer': { 'type': 'edgeNGram', 'min_gram': 1, 'max_gram': 30, 'token_chars': ['letter', 'digit', 'symbol', 'punctuation'], }, }
'filter': { 'my_word_delimiter': { 'type': 'word_delimiter', 'type_table': [ '# => ALPHANUM' ] } } Unfortunately, this causes the highlight snippets to contain the duplicate tokens when, for example, the query is "U.S. pol" and the matching document contains "U.S. politics, as follows: <em>*U*</em><em>*U.S*</em>. <em>Pol</em>itics (the letter U is highlighted twice). I see how word delimiter creates the same token for different prefixes ("U" tokens for "U" and "U.") , but the highlighting seems strange to me because "U" and "U.S" have the same offset. Do you have any suggestions? -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a88b15ae-bfb8-419b-a58c-f3e7c8556faa%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.