Re: Using shingle

Petr Janský Tue, 17 Mar 2015 07:02:29 -0700

Noone? :-(

Petr


Dne pátek 20. února 2015 15:29:15 UTC+1 Petr Janský napsal(a):
>
> Hi there,
>
> I've tried to use shingle for getting bigrams and trigrams
>
> curl -X POST 'localhost:9200/idnes/' -d '{
>   "settings" : {
>     "analysis" : {
>       "filter": {
>         "czech_stop": {
>           "type":       "stop",
>           "stopwords":  "_czech_",
>           "ignore_case": "true",
>           "remove_trailing": "false"
>         },
>         "czech_stop_ngram": {
>           "type":       "stop",
>           "stopwords" : ["a", "i", "k", "o", "s", "u", "v", "z", "do", 
> "co", "by", "do", "je", "mu", "mi", "mě", "mně", "mne", "na", "ne", "ní, 
> "si", "se", "ta", "to", "té", "ti", "ty", "už", "ve", "za", "že", "aby", 
> "ani", "ale", "byl", "jak", "jen", "jde", "kdo", "kdy", "kde", "něm", 
> "nich",  "něj", "než", "pro", "tak", "ten", "tam", "tady", "těch", "jsou", 
> "jsem", "není", "nyní", "nimi", "jako", "jaká", "jaké", "jaká", "právě", 
> "který", "která", "které", "jeho", "její", "nebo", "jako", "toho", "kdyby", 
> "takový", "taková", "takové", "_czech_" ],
>           "ignore_case": "true",
>           "remove_trailing": "false"
>         },
>         "czech_keywords": {
>           "type":       "keyword_marker",
>           "keywords":   ["že"] 
>         },
>         "czech_stemmer": {
>           "type":       "stemmer",
>           "language":   "czech"
>         },
>         "shingle2_filter": {
>             "type":             "shingle",
>             "min_shingle_size": 2, 
>             "max_shingle_size": 2, 
>             "output_unigrams":  false   
>         },
>         "shingle3_filter": {
>             "type":             "shingle",
>             "min_shingle_size": 3, 
>             "max_shingle_size": 3, 
>             *"output_unigrams":  false   *
>         }
>       },
>       "analyzer": {
>         ....
>         "shingle2s_analyzer": {
>             "type": "custom",
>             "tokenizer": "standard",
>             "filter": ["standard", "lowercase", "czech_stop_ngram", 
> "shingle2_filter"]
>         },
>         "shingle3s_analyzer": {
>             "type": "custom",
>             "tokenizer": "standard",
>             "filter": ["czech_stop_ngram", "shingle3_filter" ]
>         }
>       }
>     }
>  },
>
>   "mappings" : {
>     "article" : {
>         "_id" : {
>             "path" : "reference"
>         },
>
>     "properties" : {
>         .....
>         "content2"   : { "type":"string", "analyzer": "shingle2_analyzer"},
>         "content3"   : { "type":"string", "analyzer": "shingle3_analyzer"},
>         "content4"   : { "type":"string", "analyzer": 
> "shingle2s_analyzer"},
>         "content5"   : { "type":"string", "analyzer": 
> "shingle3s_analyzer"},
>         ......
>
> If I try my analysers using by calling:
>
> curl -X GET 
> 'localhost:9200/idnes/_analyze?analyzer=shingle3s_analyzer&pretty' -d 'a e 
> i o u s k z na ke ze nad pod za před Norská strana zatím dostatečně 
> nevyhodnotila, jak citlivou otázkou je pro Česko případ synů Evy 
> Michalákové. Tak popisuje současnou situaci premiér Bohuslav Sobotka. Ten 
> již dostal odpověď na dopis od premiérky Norska Erny Solbergové. S obecnými 
> odpověďmi není spokojen a zvažuje do Norska další psaní.' | grep "token"
>
> It works fine. In results there are only trigrams
>    "tokens" : [ {
>     "token" : "_ e _",
>     "token" : "e _ _",
>     "token" : "_ _ Norská",
>     "token" : "_ Norská _",
>     "token" : "Norská _ zatím",
>     "token" : "_ zatím dostatečně",
>     "token" : "zatím dostatečně nevyhodnotila",
>     "token" : "dostatečně nevyhodnotila _",
>     "token" : "nevyhodnotila _ citlivou",
>     "token" : "_ citlivou otázkou",
>     "token" : "citlivou otázkou _",
>     "token" : "otázkou _ _",
>     ....
>
> But there is an issue if I use it on indexed data
> POST idnes/_search?pretty=true 
> {
>     "query": {
>         "match": {
>            "content_type": "Article"
>         }
>     }, 
>     "facets" : {
>         "tag" : {
>             "terms" : {
>                 "fields" : ["content5"],
>                 "size" : 20
>             }
>         }
>     }
> }
>
> In the response there are also unigrams.
>    "facets": {
>       "tag": {
>          "_type": "terms",
>          "missing": 452,
>          "total": 926077,
>          "other": 762645,
>          "terms": [
>             {
>                "term": "a",
>                "count": 18150
>             },
>             {
>                "term": "to",
>                "count": 17131
>             },
>             {
>                "term": "je",
>                "count": 14090
>             },
>             {
>                "term": "se",
>                "count": 13621
>             },
>             {
>                "term": "na",
>                "count": 12285
>             },
>         ......
>             {
>                "term": "korun _ _",
>                "count": 551
>             },
>             {
>                "term": "_ _ případě",
>                "count": 499
>             },
>             {
>                "term": "zobrazení videa musíte",
>                "count": 449
>             }
>         .....
>
>
>    1. Why does it happen?
>    2. Is there any other way how to skip "_" from stopword than 
>    http://www.elasticsearch.org/blog/searching-with-shingles/ that 
>    doesn't work for Lucene 4.4+?
>
> Thanks
> Petr
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/378228d7-3d93-4248-9728-2d441ecace91%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Using shingle

Reply via email to