[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14985076#comment-14985076 ]
Uwe Schindler commented on LUCENE-6874: --------------------------------------- My personal opinion on this: - The thing is called WhitespaceTokenizer, so it should do what the name says (split on isWhitespace). - If we want something else, maybe provide a separate CharTokenizer implementation that also splits on NBSP In general, whitespace tokenizer is not used for "classical" fulltext. For this type of text one would better use StandardTokenizer, ICU's Tokenizers or the language specific ones for Chinese or Japan. People using WhitespaceTokenizer are more those people which have very special types of fields, like a list of whitespace-separated tokens used for facetting or stuff like a list of product numbers. These types of tokens were always good to handle with WhitespaceTokenizer. If you wanted to keep your facet tokens together, you were able to use NBSP! So a change here would be a break for those apps :-) So I would just update documentation to explain what this thing does (splitting on whitespace and not on spaces in general). > WhitespaceTokenizer should tokenize on NBSP > ------------------------------------------- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis > Reporter: David Smiley > Priority: Minor > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org