Hi all,
I'm looking for a way to remove stop words from tokens returned by a
keyword tokenizer, i.e., I'd like to obtain the original text without stop
words after the analysis process.
Sample data looks like: El corregimiento de
Mulaló, jurisdicción del municipio de Yumbo (Valle del Cauca)
After the lowercase token filter: el corregimiento de mulaló,
jurisdicción del municipio de yumbo (valle del cauca)
After the ascii folding token filter:el corregimiento de mulalo,
jurisdiccion del municipio de yumbo (valle del cauca)
After removing stop words: corregimiento mulalo,
municipio yumbo (valle cauca)
The stop words (currently) are: [la, el, de, del, los,
las, jurisdiccion]
Is the pattern replace token filter the only (or best) way to go for such a
task?
I'd really like to avoid writing custom regular expressions rather than
specifying a stop words list, which I know would work perfectly fine for
other tokenizers.
Regards,
Germán
--
You received this message because you are subscribed to the Google Groups
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/038ff037-ccf3-4aca-b0c0-bb421531c495%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.