Alexandre - i am sorry if i was not clear, this is about queries, this all happens at query time. Yes we can do the substitution in with the regex replace filter, but i would propose this weird exception to be added to WhitespaceTokenizer so Lucene deals with this by itself.
Markus -----Original message----- > From:Alexandre Rafalovitch <arafa...@gmail.com> > Sent: Wednesday 8th October 2014 16:12 > To: solr-user <solr-user@lucene.apache.org> > Subject: Re: WhitespaceTokenizer to consider incorrectly encoded c2a0? > > Is this a suggestion for JIRA ticket? Or a question on how to solve > it? If the later, you could probably stick a RegEx replacement in the > UpdateRequestProcessor chain and be done with it. > > As to why? I would look for the rest of the MSWord-generated > artifacts, such as "smart" quotes, extra-long dashes, etc. > > Regards, > Alex. > Personal: http://www.outerthoughts.com/ and @arafalov > Solr resources and newsletter: http://www.solr-start.com/ and @solrstart > Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 > > > On 8 October 2014 09:59, Markus Jelsma <markus.jel...@openindex.io> wrote: > > Hi, > > > > For some crazy reason, some users somehow manage to substitute a perfectly > > normal space with a badly encoded non-breaking space, properly URL encoded > > this then becomes %c2a0 and depending on the encoding you use to view you > > probably see  followed by a space. For example: > > > > Because c2a0 is not considered whitespace (indeed, it is not real > > whitespace, that is 00a0) by the Java Character class, the > > WhitespaceTokenizer won't split on it, but the WordDelimiterFilter still > > does, somehow mitigating the problem as it becomes: > > > > HTMLSCF een abonnement > > WT een abonnement > > WDF een eenabonnement abonnement > > > > Should the WhitespaceTokenizer not include this weird edge case? > > > > Cheers, > > Markus >