[jira] [Commented] (LUCENE-5096) WhitespaceTokenizer supports Java whitespace, should also support Unicode whitespace

Uwe Schindler (JIRA) Mon, 08 Jul 2013 00:33:53 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13701843#comment-13701843
 ]


Uwe Schindler commented on LUCENE-5096:
---------------------------------------

Hi,
Lucene is flexible enough to make this configureable. Just subclass 
CharTokenizer and provide your own list of "whitespace".

I had several people that wanted to use a WhitespaceTokenizer where also things 
like "-" are treated as whitespace, so this was the way to go: A fast approach 
for many tokenchars is to make it flexible is to use a java.util.BitSet, mark 
all chars that are "whitespace" and then query in isTokenChar(int) the bitset. 
Alternatively use a chain of ifs.

An alternative way (if you are on solr) is to inject a CharFilter before the 
tokenizer, that maps any "special" whitespace to one of the standard ones 
WhitespaceTokenizer detects.
                
> WhitespaceTokenizer supports Java whitespace, should also support Unicode 
> whitespace
> ------------------------------------------------------------------------------------
>
>                 Key: LUCENE-5096
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5096
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 4.3.1
>         Environment: all
>            Reporter: Jörg Prante
>            Priority: Minor
>
> The whitespace tokenizer supports only Java whitespace as defined in 
> http://docs.oracle.com/javase/6/docs/api/java/lang/Character.html#isWhitespace(char)
> A useful improvement would be to support also Unicode whitespace as defined 
> in the Unicode property list 
> http://www.unicode.org/Public/UCD/latest/ucd/PropList.txt

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5096) WhitespaceTokenizer supports Java whitespace, should also support Unicode whitespace

Reply via email to