Re: Russian words not work with synonym token filter

Ilya Kantor Mon, 22 Dec 2014 02:20:07 -0800

The topic is kind of old, but I'll answer it, just to be helpful for others 
who have the similar problem.


The topicstarter used the request 
curl -XGET '
http://localhost:9200/test_index/_analyze?text=продажа&analyzer=search&pretty=true
'

The mistake is that the Russian text was not urlencoded. 
Elasticsearch treated it as Japanese, as clearly visible in the response.

Always urlencode Russian letters.

Cheers.

четверг, 6 марта 2014 г., 20:31:41 UTC+3 пользователь Ivan Brusic написал:
>
> Despite my name, I do not speak Russian. :) Please excuse my ignorance of 
> the Russian language while I attempt to debug.
>
> Currently, the synonym token filter is being applied after the other three 
> token filters: "snowball_text", "lowercase", and "russian_morphology". In 
> this case, the synonym mapping will be executing key lookups on terms 
> that have been stemmed and lowercase (I do not know what russian_morphology 
> provides). Try moving your synonym filter before any stemming. After 
> lowercasing is fine, as long as your synonym map have lowercased values (or 
> set ignore_case to true). In your example, foo/bar/baz have no further 
> stemming, so they work as is.
>
> Cheers,
>
> Ivan
>
>
> On Thu, Mar 6, 2014 at 2:39 AM, Владимир Руденко <[email protected] 
> <javascript:>> wrote:
>
>> Hi.
>> I have test index with settings:
>> curl -XPOST 'http://localhost:9200/test_index' -d '
>> {
>>     "settings" : {
>>           "number_of_shards" : 5,
>>           "language":"javascript",
>>           "analysis": {
>>                      "filter": {
>>                           "snowball_text" : {
>>                                 "type": "snowball",
>>                                 "language": "Russian"
>>                             },
>>                           "synonym" : {
>>                                 "type" : "synonym",
>>                                 "synonyms_path" : "synonym.txt"
>>                           }
>>                      },
>>                      "analyzer": {
>>                         "search" : {
>>                             "type" :"custom",
>>                             "tokenizer": "standard",
>>                             "filter": ["snowball_text", "lowercase", 
>> "russian_morphology", "synonym"]
>>                         }
>>                 }
>>           }
>>     },
>>     "mappings" : {
>>         "test_type" : {
>>             "properties" : {
>>                 "test" : {
>>                     "type" : "string",
>>                     "analyzer" : "search"
>>                 },
>>                 "description" : {
>>                     "type" : "string",
>>                     "analyzer" : "search"
>>                 }
>>             }
>>         }
>>     }
>> }'
>>
>> File synonym.txt:
>> продажа => купить
>> аренда => арендовать, сниму, снять
>> foo => foo bar, baz
>>
>> English words works fine:
>> curl -XGET '
>> http://localhost:9200/test_index/_analyze?text=foo&analyzer=search&pretty=true
>> '
>> {
>>   "tokens" : [ {
>>     "token" : "foo",
>>     "start_offset" : 0,
>>     "end_offset" : 3,
>>     "type" : "SYNONYM",
>>     "position" : 1
>>   }, {
>>     "token" : "baz",
>>     "start_offset" : 0,
>>     "end_offset" : 3,
>>     "type" : "SYNONYM",
>>     "position" : 1
>>   }, {
>>     "token" : "bar",
>>     "start_offset" : 0,
>>     "end_offset" : 3,
>>     "type" : "SYNONYM",
>>     "position" : 2
>>   } ]
>> }
>>
>> But russian:
>> curl -XGET '
>> http://localhost:9200/test_index/_analyze?text=продажа&analyzer=search&pretty=true
>> '
>> {
>>   "tokens" : [ {
>>     "token" : "ﾀ",
>>     "start_offset" : 3,
>>     "end_offset" : 4,
>>     "type" : "<KATAKANA>",
>>     "position" : 1
>>   }, {
>>     "token" : "ﾾ",
>>     "start_offset" : 5,
>>     "end_offset" : 6,
>>     "type" : "<HANGUL>",
>>     "position" : 2
>>   }, {
>>     "token" : "ﾴ",
>>     "start_offset" : 7,
>>     "end_offset" : 8,
>>     "type" : "<HANGUL>",
>>     "position" : 3
>>   }, {
>>     "token" : "ﾰ",
>>     "start_offset" : 9,
>>     "end_offset" : 10,
>>     "type" : "<HANGUL>",
>>     "position" : 4
>>   }, {
>>     "token" : "ﾶ",
>>     "start_offset" : 11,
>>     "end_offset" : 12,
>>     "type" : "<HANGUL>",
>>     "position" : 5
>>   }, {
>>     "token" : "ﾰ",
>>     "start_offset" : 13,
>>     "end_offset" : 14,
>>     "type" : "<HANGUL>",
>>     "position" : 6
>>   } ]
>> }
>>
>> I cant't understand what i'm doing wrong?
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/481aacc3-d892-43e3-9024-65d84dcffe56%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/elasticsearch/481aacc3-d892-43e3-9024-65d84dcffe56%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/8fa48048-8fec-414a-b3c3-4667c38b2b93%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Russian words not work with synonym token filter

Reply via email to