[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

Jack Krupansky (JIRA) Tue, 03 Nov 2015 14:47:57 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14988338#comment-14988338
 ]


Jack Krupansky commented on LUCENE-6874:
----------------------------------------

Certainly Solr can update its example schemas to use whatever alternative 
tokenizer or option is decided on so that Solr users, many of whom are not Java 
developers, will no longer fall into this NBSP trap, but... that still feels 
like a less than desirable resolution.

[~thetaphi], could you elaborate more specifically on the existing use case 
that you are trying to preserve? I mean, like in terms of a real-world example. 
Where do some of your NBSPs actually live in the wild?

It seems to me that the vast majority of normal users would not be negatively 
impacted by having "white space" be defined using the Unicode model. I never 
objected to using the Java model, but that's because I had overlooked this 
nuance of NBSP. My concern for Solr users is that NBSP occurs somewhat commonly 
in HTML web pages - as a formatting technique more than an attempt at 
influencing tokenization.


> WhitespaceTokenizer should tokenize on NBSP
> -------------------------------------------
>
>                 Key: LUCENE-6874
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6874
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: David Smiley
>            Priority: Minor
>         Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

Reply via email to