RE: WhitespaceTokenizer to consider incorrectly encoded c2a0?

Markus Jelsma Wed, 08 Oct 2014 07:17:06 -0700

Alexandre - i am sorry if i was not clear, this is about queries, this all 
happens at query time. Yes we can do the substitution in with the regex replace 
filter, but i would propose this weird exception to be added to 
WhitespaceTokenizer so Lucene deals with this by itself.


Markus
 
-----Original message-----
> From:Alexandre Rafalovitch <arafa...@gmail.com>
> Sent: Wednesday 8th October 2014 16:12
> To: solr-user <solr-user@lucene.apache.org>
> Subject: Re: WhitespaceTokenizer to consider incorrectly encoded c2a0?
> 
> Is this a suggestion for JIRA ticket? Or a question on how to solve
> it? If the later, you could probably stick a RegEx replacement in the
> UpdateRequestProcessor chain and be done with it.
> 
> As to why? I would look for the rest of the MSWord-generated
> artifacts, such as "smart" quotes, extra-long dashes, etc.
> 
> Regards,
>    Alex.
> Personal: http://www.outerthoughts.com/ and @arafalov
> Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
> Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
> 
> 
> On 8 October 2014 09:59, Markus Jelsma <markus.jel...@openindex.io> wrote:
> > Hi,
> >
> > For some crazy reason, some users somehow manage to substitute a perfectly 
> > normal space with a badly encoded non-breaking space, properly URL encoded 
> > this then becomes %c2a0 and depending on the encoding you use to view you 
> > probably see Â followed by a space. For example:
> >
> > Because c2a0 is not considered whitespace (indeed, it is not real 
> > whitespace, that is 00a0) by the Java Character class, the 
> > WhitespaceTokenizer won't split on it, but the WordDelimiterFilter still 
> > does, somehow mitigating the problem as it becomes:
> >
> > HTMLSCF een abonnement
> > WT een abonnement
> > WDF een eenabonnement abonnement
> >
> > Should the WhitespaceTokenizer not include this weird edge case?
> >
> > Cheers,
> > Markus
>

RE: WhitespaceTokenizer to consider incorrectly encoded c2a0?

Reply via email to