Issue with using word delimiter

Amit Soni Tue, 22 Apr 2014 19:23:27 -0700

Hi team - I just wanted to share complete config file wherein I am able to
see this problem with word delimiter (unless I got the config wrong). My
config is below and if I analyze the string "650-454-2343", I get the
following tokens:


   1. 650-454-2343 [expected since we have "preserve_original": true]
   2. 650 [unexpected]
   3. 454 [unexpected]
   4. 2343 [unexpected]
   5. 6504542343 [expected since we have "catenate_all": true]

thoughts?

{
  "settings": {
    "number_of_shards": 5,
    "number_of_replicas": 0,
    "analysis": {
      "analyzer": {
        "phoneAnalyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "word_delimiter_for_phone"
          ]
        }
      },
      "filter": {
        "word_delimiter_for_phone": {
          "type": "word_delimiter",
          "catenate_all": true,
          "generate_number_parts ": false,
          "split_on_case_change": false,
          "generate_word_parts": false,
          "split_on_numerics": false,
          "preserve_original": true
        }
      }
    }
  },
  "mappings": {
    "my_type": {
      "properties": {
        "phone": {
          "type": "string",
          "index_analyzer": "phoneAnalyzer",
          "include_in_all": false
        }
      }
    }
  }
}

-Amit.


On Mon, Apr 21, 2014 at 8:46 PM, Amit Soni <amitson...@gmail.com> wrote:

> hi everyone - I have changed the mapping so that it now looks like below.
> However for a given input say 123-456-8989, the generated tokens are:
>
> a) 123-456-8989 b) 123 c) 456 d) 8989 e) 1234568989
>
> I was expecting just two tokens: a) 123-456-8989 b) 1234568989
>
> Would you know what might be going wrong here?
>
> "default_index": {
>           "tokenizer": "keyword",
>           "filter": [
>             "lowercase"
>           ]
>
> },
>
> "phoneAnalyzer": {
>           "type": "custom",
>            "tokenizer": "keyword",
>           "filter": [
>             "word_delimiter_for_phone"
>           ]
> },
>
> "word_delimiter_for_phone": {
>           "type": "word_delimiter",
>           "catenate_all": true,
>           "generate_number_parts ": false,
>           "split_on_case_change": false,
>           "generate_word_parts": false,
>           "split_on_numerics": false,
>           "preserve_original": true
> },
>
> -Amit.
>
>
> On Fri, Nov 1, 2013 at 1:07 AM, David Pilato <da...@pilato.fr> wrote:
>
>> Sorry. Forget my answer. Useless here.
>>
>>
>> --
>> David ;-)
>> Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs
>>
>> Le 1 nov. 2013 à 08:05, David Pilato <da...@pilato.fr> a écrit :
>>
>> Or disable analysis for this field.
>>
>> HTH
>>
>> --
>> David ;-)
>> Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs
>>
>> Le 1 nov. 2013 à 07:42, sina.tama...@gmail.com a écrit :
>>
>> Analysis starts by using tokenizer, which in your case is "standard".
>> Therefore the input "345 678-1234" will be tokenized to "345", "678", and
>> "1234", and only then the filters will be applied. A solution to get the
>> original and the concatenated input would be to use the "keyword" tokenizer.
>>
>> On Thursday, October 31, 2013 8:10:55 PM UTC+1, amit.soni wrote:
>>>
>>> Hi all - I have a phone number field and I am trying to use
>>> word_delimiter filter in order break it up into tokens, preserve the
>>> original entry and concatenate all the numbers in the entry. I have the
>>> following entry:
>>>
>>> "phoneAnalyzer" :  {
>>>                     "type": "custom",
>>>                     "tokenizer": "standard",
>>>                     "filter": [
>>>                         "word_delimiter_for_phone"
>>>                     ]
>>>                 }
>>>
>>> "filter": {
>>>                 "word_delimiter_for_phone": {
>>>                     "type": "word_delimiter",
>>> *                     "catenate_numbers" : true,*
>>>                      "preserve_original" : true
>>>                 },
>>> }
>>>
>>> Using this, when I run it on input "345 678-1234" I get the following:
>>>
>>> {
>>>   "tokens" : [ {
>>>     "token" : "*345*",
>>>     "start_offset" : 0,
>>>     "end_offset" : 3,
>>>     "type" : "<NUM>",
>>>     "position" : 1
>>>   }, {
>>>     "token" : "*678*",
>>>     "start_offset" : 4,
>>>     "end_offset" : 7,
>>>     "type" : "<NUM>",
>>>     "position" : 2
>>>   }, {
>>>     "token" : "*1234*",
>>>     "start_offset" : 8,
>>>     "end_offset" : 12,
>>>     "type" : "<NUM>",
>>>     "position" : 3
>>>   } ]
>>> }
>>>
>>> Question: Should this also not have generated a concatenated string of
>>> the form: 3456781234.
>>>
>>> Anything I am missing here?
>>>
>>> -Amit.
>>>
>>  --
>> You received this message because you are subscribed to the Google Groups
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to elasticsearch+unsubscr...@googlegroups.com.
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>>  --
>> You received this message because you are subscribed to the Google Groups
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to elasticsearch+unsubscr...@googlegroups.com.
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>>  --
>> You received this message because you are subscribed to the Google Groups
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to elasticsearch+unsubscr...@googlegroups.com.
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAAOGaQKGSvbjGHFvRKtqhW0zPitD-DK%2B2%3DMVBrjS4THJUN4Duw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Issue with using word delimiter

Reply via email to