Changing Analyzer behavior for hyphens - suggestions?

horst knete Wed, 19 Nov 2014 08:47:51 -0800

Hey guys,

after working with the ELK stack for a while now, we still got an very 
annoying problem regarding the behavior of the standard analyzer - it 
splits terms into tokens using hyphens or dots as delimiters.


e.g logsource:firewall-physical-management get split into "firewall" , 
"physical" and "management". On one side thats cool because if you search 
for logsource:firewall you get all the events with firewall as an token in 
the field logsource. 

The downside on this behaviour is if you are doing e.g. an "top 10 search" 
on an field in Kibana, all the tokens are counted as an whole term and get 
rated due to their count: 
top 10: 
1. firewall : 10
2. physical : 10
3. management: 10

instead of top 10:
1. firewall-physical-management: 10

Well in the standard mapping from logstash this is solved using and .raw 
field as "not_analyzed" but the downside on this is you got 2 fields 
instead of one (even if its a multi_field) and the usage for kibana users 
is not that great.

So what we need is that logsource:firewall-physical-management get 
tokenized into "firewall-physical-management", "firewall" , "physical" and 
"management".

I tried this using the word_delimiter filter token with the following 
mapping:

 "analysis" : {
 "analyzer" : {
                         "my_analyzer" : {
                                 "type" : "custom",
                                 "tokenizer" : "whitespace",
                                 "filter" : ["lowercase", "asciifolding", 
"my_worddelimiter"]
                                     }
              },
 "filter" : {
        "my_worddelimiter" : {
                "type" : "word_delimiter",
                                "generate_word_parts": false,
                                "generate_number_parts": false,
                                "catenate_words": false,
                                "catenate_numbers": false,
                                "catenate_all": false,
                                "split_on_case_change": false,
                                "preserve_original": true,
                                "split_on_numerics": false,
                                "stem_english_possessive": true
                   }
              }
              }

But this unfortunately didnt do the job.

I´ve saw on my recherche that some other guys have an similar problem like 
this, but expect some replacement suggestions, no real solution was found.

If anyone have some ideas on how to start working on this, i would be very 
happy.

thanks.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/4094292c-057f-43d8-9af0-1ea83ad45a1c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Changing Analyzer behavior for hyphens - suggestions?

Reply via email to