Noone? :-( Petr
Dne pátek 20. února 2015 15:29:15 UTC+1 Petr Janský napsal(a): > > Hi there, > > I've tried to use shingle for getting bigrams and trigrams > > curl -X POST 'localhost:9200/idnes/' -d '{ > "settings" : { > "analysis" : { > "filter": { > "czech_stop": { > "type": "stop", > "stopwords": "_czech_", > "ignore_case": "true", > "remove_trailing": "false" > }, > "czech_stop_ngram": { > "type": "stop", > "stopwords" : ["a", "i", "k", "o", "s", "u", "v", "z", "do", > "co", "by", "do", "je", "mu", "mi", "mě", "mně", "mne", "na", "ne", "ní, > "si", "se", "ta", "to", "té", "ti", "ty", "už", "ve", "za", "že", "aby", > "ani", "ale", "byl", "jak", "jen", "jde", "kdo", "kdy", "kde", "něm", > "nich", "něj", "než", "pro", "tak", "ten", "tam", "tady", "těch", "jsou", > "jsem", "není", "nyní", "nimi", "jako", "jaká", "jaké", "jaká", "právě", > "který", "která", "které", "jeho", "její", "nebo", "jako", "toho", "kdyby", > "takový", "taková", "takové", "_czech_" ], > "ignore_case": "true", > "remove_trailing": "false" > }, > "czech_keywords": { > "type": "keyword_marker", > "keywords": ["že"] > }, > "czech_stemmer": { > "type": "stemmer", > "language": "czech" > }, > "shingle2_filter": { > "type": "shingle", > "min_shingle_size": 2, > "max_shingle_size": 2, > "output_unigrams": false > }, > "shingle3_filter": { > "type": "shingle", > "min_shingle_size": 3, > "max_shingle_size": 3, > *"output_unigrams": false * > } > }, > "analyzer": { > .... > "shingle2s_analyzer": { > "type": "custom", > "tokenizer": "standard", > "filter": ["standard", "lowercase", "czech_stop_ngram", > "shingle2_filter"] > }, > "shingle3s_analyzer": { > "type": "custom", > "tokenizer": "standard", > "filter": ["czech_stop_ngram", "shingle3_filter" ] > } > } > } > }, > > "mappings" : { > "article" : { > "_id" : { > "path" : "reference" > }, > > "properties" : { > ..... > "content2" : { "type":"string", "analyzer": "shingle2_analyzer"}, > "content3" : { "type":"string", "analyzer": "shingle3_analyzer"}, > "content4" : { "type":"string", "analyzer": > "shingle2s_analyzer"}, > "content5" : { "type":"string", "analyzer": > "shingle3s_analyzer"}, > ...... > > If I try my analysers using by calling: > > curl -X GET > 'localhost:9200/idnes/_analyze?analyzer=shingle3s_analyzer&pretty' -d 'a e > i o u s k z na ke ze nad pod za před Norská strana zatím dostatečně > nevyhodnotila, jak citlivou otázkou je pro Česko případ synů Evy > Michalákové. Tak popisuje současnou situaci premiér Bohuslav Sobotka. Ten > již dostal odpověď na dopis od premiérky Norska Erny Solbergové. S obecnými > odpověďmi není spokojen a zvažuje do Norska další psaní.' | grep "token" > > It works fine. In results there are only trigrams > "tokens" : [ { > "token" : "_ e _", > "token" : "e _ _", > "token" : "_ _ Norská", > "token" : "_ Norská _", > "token" : "Norská _ zatím", > "token" : "_ zatím dostatečně", > "token" : "zatím dostatečně nevyhodnotila", > "token" : "dostatečně nevyhodnotila _", > "token" : "nevyhodnotila _ citlivou", > "token" : "_ citlivou otázkou", > "token" : "citlivou otázkou _", > "token" : "otázkou _ _", > .... > > But there is an issue if I use it on indexed data > POST idnes/_search?pretty=true > { > "query": { > "match": { > "content_type": "Article" > } > }, > "facets" : { > "tag" : { > "terms" : { > "fields" : ["content5"], > "size" : 20 > } > } > } > } > > In the response there are also unigrams. > "facets": { > "tag": { > "_type": "terms", > "missing": 452, > "total": 926077, > "other": 762645, > "terms": [ > { > "term": "a", > "count": 18150 > }, > { > "term": "to", > "count": 17131 > }, > { > "term": "je", > "count": 14090 > }, > { > "term": "se", > "count": 13621 > }, > { > "term": "na", > "count": 12285 > }, > ...... > { > "term": "korun _ _", > "count": 551 > }, > { > "term": "_ _ případě", > "count": 499 > }, > { > "term": "zobrazení videa musíte", > "count": 449 > } > ..... > > > 1. Why does it happen? > 2. Is there any other way how to skip "_" from stopword than > http://www.elasticsearch.org/blog/searching-with-shingles/ that > doesn't work for Lucene 4.4+? > > Thanks > Petr > > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/378228d7-3d93-4248-9728-2d441ecace91%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.