[
https://issues.apache.org/jira/browse/LUCENE-5096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13701843#comment-13701843
]
Uwe Schindler commented on LUCENE-5096:
---------------------------------------
Hi,
Lucene is flexible enough to make this configureable. Just subclass
CharTokenizer and provide your own list of "whitespace".
I had several people that wanted to use a WhitespaceTokenizer where also things
like "-" are treated as whitespace, so this was the way to go: A fast approach
for many tokenchars is to make it flexible is to use a java.util.BitSet, mark
all chars that are "whitespace" and then query in isTokenChar(int) the bitset.
Alternatively use a chain of ifs.
An alternative way (if you are on solr) is to inject a CharFilter before the
tokenizer, that maps any "special" whitespace to one of the standard ones
WhitespaceTokenizer detects.
> WhitespaceTokenizer supports Java whitespace, should also support Unicode
> whitespace
> ------------------------------------------------------------------------------------
>
> Key: LUCENE-5096
> URL: https://issues.apache.org/jira/browse/LUCENE-5096
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/analysis
> Affects Versions: 4.3.1
> Environment: all
> Reporter: Jörg Prante
> Priority: Minor
>
> The whitespace tokenizer supports only Java whitespace as defined in
> http://docs.oracle.com/javase/6/docs/api/java/lang/Character.html#isWhitespace(char)
> A useful improvement would be to support also Unicode whitespace as defined
> in the Unicode property list
> http://www.unicode.org/Public/UCD/latest/ucd/PropList.txt
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]