Hi team - I just wanted to share complete config file wherein I am able to see this problem with word delimiter (unless I got the config wrong). My config is below and if I analyze the string "650-454-2343", I get the following tokens:
1. 650-454-2343 [expected since we have "preserve_original": true] 2. 650 [unexpected] 3. 454 [unexpected] 4. 2343 [unexpected] 5. 6504542343 [expected since we have "catenate_all": true] thoughts? { "settings": { "number_of_shards": 5, "number_of_replicas": 0, "analysis": { "analyzer": { "phoneAnalyzer": { "type": "custom", "tokenizer": "whitespace", "filter": [ "word_delimiter_for_phone" ] } }, "filter": { "word_delimiter_for_phone": { "type": "word_delimiter", "catenate_all": true, "generate_number_parts ": false, "split_on_case_change": false, "generate_word_parts": false, "split_on_numerics": false, "preserve_original": true } } } }, "mappings": { "my_type": { "properties": { "phone": { "type": "string", "index_analyzer": "phoneAnalyzer", "include_in_all": false } } } } } -Amit. On Mon, Apr 21, 2014 at 8:46 PM, Amit Soni <amitson...@gmail.com> wrote: > hi everyone - I have changed the mapping so that it now looks like below. > However for a given input say 123-456-8989, the generated tokens are: > > a) 123-456-8989 b) 123 c) 456 d) 8989 e) 1234568989 > > I was expecting just two tokens: a) 123-456-8989 b) 1234568989 > > Would you know what might be going wrong here? > > "default_index": { > "tokenizer": "keyword", > "filter": [ > "lowercase" > ] > > }, > > "phoneAnalyzer": { > "type": "custom", > "tokenizer": "keyword", > "filter": [ > "word_delimiter_for_phone" > ] > }, > > "word_delimiter_for_phone": { > "type": "word_delimiter", > "catenate_all": true, > "generate_number_parts ": false, > "split_on_case_change": false, > "generate_word_parts": false, > "split_on_numerics": false, > "preserve_original": true > }, > > -Amit. > > > On Fri, Nov 1, 2013 at 1:07 AM, David Pilato <da...@pilato.fr> wrote: > >> Sorry. Forget my answer. Useless here. >> >> >> -- >> David ;-) >> Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs >> >> Le 1 nov. 2013 à 08:05, David Pilato <da...@pilato.fr> a écrit : >> >> Or disable analysis for this field. >> >> HTH >> >> -- >> David ;-) >> Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs >> >> Le 1 nov. 2013 à 07:42, sina.tama...@gmail.com a écrit : >> >> Analysis starts by using tokenizer, which in your case is "standard". >> Therefore the input "345 678-1234" will be tokenized to "345", "678", and >> "1234", and only then the filters will be applied. A solution to get the >> original and the concatenated input would be to use the "keyword" tokenizer. >> >> On Thursday, October 31, 2013 8:10:55 PM UTC+1, amit.soni wrote: >>> >>> Hi all - I have a phone number field and I am trying to use >>> word_delimiter filter in order break it up into tokens, preserve the >>> original entry and concatenate all the numbers in the entry. I have the >>> following entry: >>> >>> "phoneAnalyzer" : { >>> "type": "custom", >>> "tokenizer": "standard", >>> "filter": [ >>> "word_delimiter_for_phone" >>> ] >>> } >>> >>> "filter": { >>> "word_delimiter_for_phone": { >>> "type": "word_delimiter", >>> * "catenate_numbers" : true,* >>> "preserve_original" : true >>> }, >>> } >>> >>> Using this, when I run it on input "345 678-1234" I get the following: >>> >>> { >>> "tokens" : [ { >>> "token" : "*345*", >>> "start_offset" : 0, >>> "end_offset" : 3, >>> "type" : "<NUM>", >>> "position" : 1 >>> }, { >>> "token" : "*678*", >>> "start_offset" : 4, >>> "end_offset" : 7, >>> "type" : "<NUM>", >>> "position" : 2 >>> }, { >>> "token" : "*1234*", >>> "start_offset" : 8, >>> "end_offset" : 12, >>> "type" : "<NUM>", >>> "position" : 3 >>> } ] >>> } >>> >>> Question: Should this also not have generated a concatenated string of >>> the form: 3456781234. >>> >>> Anything I am missing here? >>> >>> -Amit. >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "elasticsearch" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to elasticsearch+unsubscr...@googlegroups.com. >> For more options, visit https://groups.google.com/groups/opt_out. >> >> -- >> You received this message because you are subscribed to the Google Groups >> "elasticsearch" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to elasticsearch+unsubscr...@googlegroups.com. >> For more options, visit https://groups.google.com/groups/opt_out. >> >> -- >> You received this message because you are subscribed to the Google Groups >> "elasticsearch" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to elasticsearch+unsubscr...@googlegroups.com. >> For more options, visit https://groups.google.com/groups/opt_out. >> > > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAAOGaQKGSvbjGHFvRKtqhW0zPitD-DK%2B2%3DMVBrjS4THJUN4Duw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.