Stop words and Keyword tokenizer

German Carrillo Thu, 28 Aug 2014 11:48:48 -0700

Hi all,


I'm looking for a way to remove stop words from tokens returned by a 
keyword tokenizer, i.e., I'd like to obtain the original text without stop 
words after the analysis process. 

Sample data looks like:                         "El corregimiento de 
Mulaló, jurisdicción del municipio de Yumbo (Valle del Cauca)"
After the lowercase token filter:           "el corregimiento de mulaló, 
jurisdicción del municipio de yumbo (valle del cauca)"
After the ascii folding token filter:        "el corregimiento de mulalo, 
jurisdiccion del municipio de yumbo (valle del cauca)"
After removing stop words:                   "corregimiento mulalo, 
municipio yumbo (valle cauca)"

The stop words (currently) are:      ["la", "el", "de", "del", "los", 
"las", "jurisdiccion"]

Is the pattern replace token filter the only (or best) way to go for such a 
task? 

I'd really like to avoid writing custom regular expressions rather than 
specifying a stop words list, which I know would work perfectly fine for 
other tokenizers.


Regards, 

Germán

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/038ff037-ccf3-4aca-b0c0-bb421531c495%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Stop words and Keyword tokenizer

Reply via email to