edgeNGram tokenizer with the word delimiter filter

Hieu Nguyen Sat, 26 Apr 2014 17:52:15 -0700

Hello guys,
I have been using the edgeNGram tokenizer to enable partial prefix matching 
on a query. However, the tokenizer treats certain characters as 
punctuations (e.g. C# => C, I/O => I and O), so I had to add "punctuation" 
character class to the edgeNGram tokenizer and use the word_delimiter 
filter to drop punctuations.
   'tokenizer': {                                                           
             
       'prefix_tokenizer': {
           'type': 'edgeNGram',                                             
             
           'min_gram': 1,                                                   
             
           'max_gram': 30,                                                 
              
           'token_chars': ['letter', 'digit', 'symbol', 'punctuation'],     
             
        },
    }


    'filter': {                                                             
              
        'my_word_delimiter': {
            'type': 'word_delimiter',
                'type_table': [                                             
                  
                    '# => ALPHANUM'                                         
               
                 ]                                                         
                    
             }                                                             
                    
     }

Unfortunately, this causes the highlight snippets to contain the duplicate 
tokens when, for example, the query is "U.S. pol" and the matching document 
contains "U.S. politics, as follows: <em>*U*</em><em>*U.S*</em>. 
<em>Pol</em>itics (the letter U is highlighted twice). I see how word 
delimiter creates the same token for different prefixes ("U" tokens for "U" 
and "U.") , but the highlighting seems strange to me because "U" and "U.S" 
have the same offset.

Do you have any suggestions? 


-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/a88b15ae-bfb8-419b-a58c-f3e7c8556faa%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

edgeNGram tokenizer with the word delimiter filter

Reply via email to