Re: Changing Analyzer behavior for hyphens - suggestions?

horst knete Fri, 21 Nov 2014 01:52:11 -0800

I think our solution is now to just replace all the "non-letters" from 
elasticsearch with an "_".


"char_filter" : {
        "replace" : {
                "type" : "mapping",
                                "mappings": ["\\.=>_", "\\u2010=>_", 
"'\''=>_", "\\:=>_", "\\u0020=>_", "\\u005C=>_", "\\u0028=>_", 
"\\u0029=>_", "\\u0026=>_", "\\u002F=>_", "\\u002D=>_", "\\u003F=>_", 
"\\u003D=>_"]
                   }
              },

This lead to that the terms wont get splited into useless tokens from the 
standard analyzer.

The downside of this solution is that some urls or Windows paths looks very 
ugly for the human eye now, e.g:

http://g.ceipmsn.com/8SE/411?MI=B9DC2E6D07184453A1EFC4E765A16D30-0&LV=3.0.131.0&OS=6.1.7601&AG=1217
 
 =>

http___g_ceipmsn_com_8se_411_mi_b9dc2e6d07184453a1efc4e765a16d30_0_lv_3_0_131_0_os_6_1_7601_ag_1217

The good thing compared to not analyzed is that if i search for url:*8se*, 
the search will return the events with this url in it.

I think this is not a perfect solution, but rather an good workaround till 
lucene gives us better analyzer types to work with.

Thanks for sharing your experience so far!

cheers

Am Donnerstag, 20. November 2014 20:07:27 UTC+1 schrieb Jörg Prante:
>
> The whitespace tokenizer has the problem that punctuation is not ignored. 
> I find the word_delimiter filter not working at all with whitespace, only 
> with keyword tokenizer, with massive pattern matching which is complex and 
> expensive :(
>
> Therefore I took the classic tokenizer and generalized the hyphen rules in 
> the grammar. The tokenizer "hyphen" and filter "hyphen" are two routines. 
> The tokenizer "hyphen" keeps hyphenated words together and handles 
> punctuation correct. The filter "hyphen" adds combinations to the original 
> form.
>
> Main point is to add combinations of dehyphenated forms so they can be 
> searched. 
>
> Single words are only taken into account when the word is positioned at 
> the edge.
>
> For example, the phrase "der-die-das" should be indexed in the following 
> forms:
>
> "der-die-das",  "derdiedas", "das", "derdie", derdie-das", "die-das", "der"
>
> Jörg
>
> On Thu, Nov 20, 2014 at 9:29 AM, horst knete <badun...@hotmail.de 
> <javascript:>> wrote:
>
>>
>> So the term "this-is-a-test" get tokenized into "this-is-a-test" which is 
>> nice behaviour, but in order to make an "full-text-search" on this field it 
>> should get tokenized into "this-is-a-test", "this", "is", "a" and "test" as 
>> i wrote before.
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/2aa17dac-b484-4385-8efd-9a74847e9582%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Changing Analyzer behavior for hyphens - suggestions?

Reply via email to