Hi all, Recently, I've been working on an extension to Lucene's Standard Tokenizer that allows the user to customize / override the default word boundary break rules for Unicode characters. The Standard Tokenizer implements the word break rules from the Unicode Text segmentation <http://www.unicode.org/reports/tr29/> algorithm where most punctuation symbols (except for underscore '_') are treated as hard word breaks (e.g. "@foo" , "#foo" are tokenized to "foo"). While the Standard Tokenizer works great in most cases, I found that being unable to override the default word break rules was quite limiting especially since a lot of these punctuation symbols have important meaning now on the web (@ - mentions, # - hashtags, etc.)
I've wrapped this extension to the Standard Tokenizer in an ElasticSearch plugin, which can be found at - https://github.com/bbguitar77/elasticsearch-analysis-standardext ... definitely looking for feedback as this is my first go at an ElasticSearch plugin! I'm hoping other ElasticSearch / Lucene users find this helpful. Cheers! Bryan -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/929dc7c3-ff99-43a4-a287-1a8f89d86e3f%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
