Hi all,

Recently, I've been working on an extension to Lucene's Standard Tokenizer 
that allows the user to customize / override the default word boundary 
break rules for Unicode characters. The Standard Tokenizer implements the 
word break rules from the Unicode Text segmentation 
<http://www.unicode.org/reports/tr29/> algorithm where most punctuation 
symbols (except for underscore '_') are treated as hard word breaks (e.g. 
"@foo" , "#foo" are tokenized to "foo"). While the Standard Tokenizer works 
great in most cases, I found that being unable to override the default word 
break rules was quite limiting especially since a lot of these punctuation 
symbols have important meaning now on the web (@ - mentions, # - hashtags, 
etc.)

I've wrapped this extension to the Standard Tokenizer in an ElasticSearch 
plugin, which can be found at - 
https://github.com/bbguitar77/elasticsearch-analysis-standardext ... 
definitely looking for feedback as this is my first go at an ElasticSearch 
plugin!

I'm hoping other ElasticSearch / Lucene users find this helpful.

Cheers!
Bryan

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/929dc7c3-ff99-43a4-a287-1a8f89d86e3f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to