subject:"Using shingle"

Re: Using shingle

2015-03-17 Thread Petr Janský

Noone? :-(

Petr

Dne pátek 20. února 2015 15:29:15 UTC+1 Petr Janský napsal(a):

 Hi there,

 I've tried to use shingle for getting bigrams and trigrams

 curl -X POST 'localhost:9200/idnes/' -d '{
   settings : {
 analysis : {
   filter: {
 czech_stop: {
   type:   stop,
   stopwords:  _czech_,
   ignore_case: true,
   remove_trailing: false
 },
 czech_stop_ngram: {
   type:   stop,
   stopwords : [a, i, k, o, s, u, v, z, do, 
 co, by, do, je, mu, mi, mě, mně, mne, na, ne, ní, 
 si, se, ta, to, té, ti, ty, už, ve, za, že, aby, 
 ani, ale, byl, jak, jen, jde, kdo, kdy, kde, něm, 
 nich,  něj, než, pro, tak, ten, tam, tady, těch, jsou, 
 jsem, není, nyní, nimi, jako, jaká, jaké, jaká, právě, 
 který, která, které, jeho, její, nebo, jako, toho, kdyby, 
 takový, taková, takové, _czech_ ],
   ignore_case: true,
   remove_trailing: false
 },
 czech_keywords: {
   type:   keyword_marker,
   keywords:   [že] 
 },
 czech_stemmer: {
   type:   stemmer,
   language:   czech
 },
 shingle2_filter: {
 type: shingle,
 min_shingle_size: 2, 
 max_shingle_size: 2, 
 output_unigrams:  false   
 },
 shingle3_filter: {
 type: shingle,
 min_shingle_size: 3, 
 max_shingle_size: 3, 
 *output_unigrams:  false   *
 }
   },
   analyzer: {
 
 shingle2s_analyzer: {
 type: custom,
 tokenizer: standard,
 filter: [standard, lowercase, czech_stop_ngram, 
 shingle2_filter]
 },
 shingle3s_analyzer: {
 type: custom,
 tokenizer: standard,
 filter: [czech_stop_ngram, shingle3_filter ]
 }
   }
 }
  },

   mappings : {
 article : {
 _id : {
 path : reference
 },

 properties : {
 .
 content2   : { type:string, analyzer: shingle2_analyzer},
 content3   : { type:string, analyzer: shingle3_analyzer},
 content4   : { type:string, analyzer: 
 shingle2s_analyzer},
 content5   : { type:string, analyzer: 
 shingle3s_analyzer},
 ..

 If I try my analysers using by calling:

 curl -X GET 
 'localhost:9200/idnes/_analyze?analyzer=shingle3s_analyzerpretty' -d 'a e 
 i o u s k z na ke ze nad pod za před Norská strana zatím dostatečně 
 nevyhodnotila, jak citlivou otázkou je pro Česko případ synů Evy 
 Michalákové. Tak popisuje současnou situaci premiér Bohuslav Sobotka. Ten 
 již dostal odpověď na dopis od premiérky Norska Erny Solbergové. S obecnými 
 odpověďmi není spokojen a zvažuje do Norska další psaní.' | grep token

 It works fine. In results there are only trigrams
tokens : [ {
 token : _ e _,
 token : e _ _,
 token : _ _ Norská,
 token : _ Norská _,
 token : Norská _ zatím,
 token : _ zatím dostatečně,
 token : zatím dostatečně nevyhodnotila,
 token : dostatečně nevyhodnotila _,
 token : nevyhodnotila _ citlivou,
 token : _ citlivou otázkou,
 token : citlivou otázkou _,
 token : otázkou _ _,
 

 But there is an issue if I use it on indexed data
 POST idnes/_search?pretty=true 
 {
 query: {
 match: {
content_type: Article
 }
 }, 
 facets : {
 tag : {
 terms : {
 fields : [content5],
 size : 20
 }
 }
 }
 }

 In the response there are also unigrams.
facets: {
   tag: {
  _type: terms,
  missing: 452,
  total: 926077,
  other: 762645,
  terms: [
 {
term: a,
count: 18150
 },
 {
term: to,
count: 17131
 },
 {
term: je,
count: 14090
 },
 {
term: se,
count: 13621
 },
 {
term: na,
count: 12285
 },
 ..
 {
term: korun _ _,
count: 551
 },
 {
term: _ _ případě,
count: 499
 },
 {
term: zobrazení videa musíte,
count: 449
 }
 .


1. Why does it happen?
2. Is there any other way how to skip _ from stopword than 
http://www.elasticsearch.org/blog/searching-with-shingles/ that 
doesn't work for Lucene 4.4+?

 Thanks
 Petr



-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to

Using shingle

2015-02-20 Thread Petr Janský

Hi there,

I've tried to use shingle for getting bigrams and trigrams

curl -X POST 'localhost:9200/idnes/' -d '{
  settings : {
analysis : {
  filter: {
czech_stop: {
  type:   stop,
  stopwords:  _czech_,
  ignore_case: true,
  remove_trailing: false
},
czech_stop_ngram: {
  type:   stop,
  stopwords : [a, i, k, o, s, u, v, z, do, 
co, by, do, je, mu, mi, mě, mně, mne, na, ne, ní, 
si, se, ta, to, té, ti, ty, už, ve, za, že, aby, 
ani, ale, byl, jak, jen, jde, kdo, kdy, kde, něm, 
nich,  něj, než, pro, tak, ten, tam, tady, těch, jsou, 
jsem, není, nyní, nimi, jako, jaká, jaké, jaká, právě, 
který, která, které, jeho, její, nebo, jako, toho, kdyby, 
takový, taková, takové, _czech_ ],
  ignore_case: true,
  remove_trailing: false
},
czech_keywords: {
  type:   keyword_marker,
  keywords:   [že] 
},
czech_stemmer: {
  type:   stemmer,
  language:   czech
},
shingle2_filter: {
type: shingle,
min_shingle_size: 2, 
max_shingle_size: 2, 
output_unigrams:  false   
},
shingle3_filter: {
type: shingle,
min_shingle_size: 3, 
max_shingle_size: 3, 
*output_unigrams:  false   *
}
  },
  analyzer: {

shingle2s_analyzer: {
type: custom,
tokenizer: standard,
filter: [standard, lowercase, czech_stop_ngram, 
shingle2_filter]
},
shingle3s_analyzer: {
type: custom,
tokenizer: standard,
filter: [czech_stop_ngram, shingle3_filter ]
}
  }
}
 },

  mappings : {
article : {
_id : {
path : reference
},

properties : {
.
content2   : { type:string, analyzer: shingle2_analyzer},
content3   : { type:string, analyzer: shingle3_analyzer},
content4   : { type:string, analyzer: shingle2s_analyzer},
content5   : { type:string, analyzer: shingle3s_analyzer},
..

If I try my analysers using by calling:

curl -X GET 
'localhost:9200/idnes/_analyze?analyzer=shingle3s_analyzerpretty' -d 'a e 
i o u s k z na ke ze nad pod za před Norská strana zatím dostatečně 
nevyhodnotila, jak citlivou otázkou je pro Česko případ synů Evy 
Michalákové. Tak popisuje současnou situaci premiér Bohuslav Sobotka. Ten 
již dostal odpověď na dopis od premiérky Norska Erny Solbergové. S obecnými 
odpověďmi není spokojen a zvažuje do Norska další psaní.' | grep token

It works fine. In results there are only trigrams
   tokens : [ {
token : _ e _,
token : e _ _,
token : _ _ Norská,
token : _ Norská _,
token : Norská _ zatím,
token : _ zatím dostatečně,
token : zatím dostatečně nevyhodnotila,
token : dostatečně nevyhodnotila _,
token : nevyhodnotila _ citlivou,
token : _ citlivou otázkou,
token : citlivou otázkou _,
token : otázkou _ _,


But there is an issue if I use it on indexed data
POST idnes/_search?pretty=true 
{
query: {
match: {
   content_type: Article
}
}, 
facets : {
tag : {
terms : {
fields : [content5],
size : 20
}
}
}
}

In the response there are also unigrams.
   facets: {
  tag: {
 _type: terms,
 missing: 452,
 total: 926077,
 other: 762645,
 terms: [
{
   term: a,
   count: 18150
},
{
   term: to,
   count: 17131
},
{
   term: je,
   count: 14090
},
{
   term: se,
   count: 13621
},
{
   term: na,
   count: 12285
},
..
{
   term: korun _ _,
   count: 551
},
{
   term: _ _ případě,
   count: 499
},
{
   term: zobrazení videa musíte,
   count: 449
}
.


   1. Why does it happen?
   2. Is there any other way how to skip _ from stopword than 
http://www.elasticsearch.org/blog/searching-with-shingles/ 
   that doesn't work for Lucene 4.4+?

Thanks
Petr

-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/0d2aa0fb-2a12-404d-bdf4-bb09b970cb5c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Using shingle

Using shingle

2 matches

Site Navigation

Mail list logo

Footer information