WhitespaceTokenizer to consider incorrectly encoded c2a0?

Markus Jelsma Wed, 08 Oct 2014 07:00:48 -0700

Hi,

For some crazy reason, some users somehow manage to substitute a perfectly 
normal space with a badly encoded non-breaking space, properly URL encoded this 
then becomes %c2a0 and depending on the encoding you use to view you probably 
see Â followed by a space. For example:


Because c2a0 is not considered whitespace (indeed, it is not real whitespace, 
that is 00a0) by the Java Character class, the WhitespaceTokenizer won't split 
on it, but the WordDelimiterFilter still does, somehow mitigating the problem 
as it becomes:

HTMLSCF een abonnement
WT een abonnement
WDF een eenabonnement abonnement

Should the WhitespaceTokenizer not include this weird edge case? 

Cheers,
Markus

WhitespaceTokenizer to consider incorrectly encoded c2a0?

Reply via email to