Hi,

For some crazy reason, some users somehow manage to substitute a perfectly 
normal space with a badly encoded non-breaking space, properly URL encoded this 
then becomes %c2a0 and depending on the encoding you use to view you probably 
see  followed by a space. For example:

Because c2a0 is not considered whitespace (indeed, it is not real whitespace, 
that is 00a0) by the Java Character class, the WhitespaceTokenizer won't split 
on it, but the WordDelimiterFilter still does, somehow mitigating the problem 
as it becomes:

HTMLSCF een abonnement
WT een abonnement
WDF een eenabonnement abonnement

Should the WhitespaceTokenizer not include this weird edge case? 

Cheers,
Markus

Reply via email to