Hi there, I've tried to use shingle for getting bigrams and trigrams
curl -X POST 'localhost:9200/idnes/' -d '{ "settings" : { "analysis" : { "filter": { "czech_stop": { "type": "stop", "stopwords": "_czech_", "ignore_case": "true", "remove_trailing": "false" }, "czech_stop_ngram": { "type": "stop", "stopwords" : ["a", "i", "k", "o", "s", "u", "v", "z", "do", "co", "by", "do", "je", "mu", "mi", "mě", "mně", "mne", "na", "ne", "ní, "si", "se", "ta", "to", "té", "ti", "ty", "už", "ve", "za", "že", "aby", "ani", "ale", "byl", "jak", "jen", "jde", "kdo", "kdy", "kde", "něm", "nich", "něj", "než", "pro", "tak", "ten", "tam", "tady", "těch", "jsou", "jsem", "není", "nyní", "nimi", "jako", "jaká", "jaké", "jaká", "právě", "který", "která", "které", "jeho", "její", "nebo", "jako", "toho", "kdyby", "takový", "taková", "takové", "_czech_" ], "ignore_case": "true", "remove_trailing": "false" }, "czech_keywords": { "type": "keyword_marker", "keywords": ["že"] }, "czech_stemmer": { "type": "stemmer", "language": "czech" }, "shingle2_filter": { "type": "shingle", "min_shingle_size": 2, "max_shingle_size": 2, "output_unigrams": false }, "shingle3_filter": { "type": "shingle", "min_shingle_size": 3, "max_shingle_size": 3, *"output_unigrams": false * } }, "analyzer": { .... "shingle2s_analyzer": { "type": "custom", "tokenizer": "standard", "filter": ["standard", "lowercase", "czech_stop_ngram", "shingle2_filter"] }, "shingle3s_analyzer": { "type": "custom", "tokenizer": "standard", "filter": ["czech_stop_ngram", "shingle3_filter" ] } } } }, "mappings" : { "article" : { "_id" : { "path" : "reference" }, "properties" : { ..... "content2" : { "type":"string", "analyzer": "shingle2_analyzer"}, "content3" : { "type":"string", "analyzer": "shingle3_analyzer"}, "content4" : { "type":"string", "analyzer": "shingle2s_analyzer"}, "content5" : { "type":"string", "analyzer": "shingle3s_analyzer"}, ...... If I try my analysers using by calling: curl -X GET 'localhost:9200/idnes/_analyze?analyzer=shingle3s_analyzer&pretty' -d 'a e i o u s k z na ke ze nad pod za před Norská strana zatím dostatečně nevyhodnotila, jak citlivou otázkou je pro Česko případ synů Evy Michalákové. Tak popisuje současnou situaci premiér Bohuslav Sobotka. Ten již dostal odpověď na dopis od premiérky Norska Erny Solbergové. S obecnými odpověďmi není spokojen a zvažuje do Norska další psaní.' | grep "token" It works fine. In results there are only trigrams "tokens" : [ { "token" : "_ e _", "token" : "e _ _", "token" : "_ _ Norská", "token" : "_ Norská _", "token" : "Norská _ zatím", "token" : "_ zatím dostatečně", "token" : "zatím dostatečně nevyhodnotila", "token" : "dostatečně nevyhodnotila _", "token" : "nevyhodnotila _ citlivou", "token" : "_ citlivou otázkou", "token" : "citlivou otázkou _", "token" : "otázkou _ _", .... But there is an issue if I use it on indexed data POST idnes/_search?pretty=true { "query": { "match": { "content_type": "Article" } }, "facets" : { "tag" : { "terms" : { "fields" : ["content5"], "size" : 20 } } } } In the response there are also unigrams. "facets": { "tag": { "_type": "terms", "missing": 452, "total": 926077, "other": 762645, "terms": [ { "term": "a", "count": 18150 }, { "term": "to", "count": 17131 }, { "term": "je", "count": 14090 }, { "term": "se", "count": 13621 }, { "term": "na", "count": 12285 }, ...... { "term": "korun _ _", "count": 551 }, { "term": "_ _ případě", "count": 499 }, { "term": "zobrazení videa musíte", "count": 449 } ..... 1. Why does it happen? 2. Is there any other way how to skip "_" from stopword than http://www.elasticsearch.org/blog/searching-with-shingles/ that doesn't work for Lucene 4.4+? Thanks Petr -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0d2aa0fb-2a12-404d-bdf4-bb09b970cb5c%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.