Re: [Pharo-users] Is a non-break space whitespace?

stephan Sun, 24 Sep 2017 06:06:18 -0700

On 24-09-17 13:53, PBKResearch wrote:

I have a little puzzle to disturb your Sunday lunch, maybe. I have beenscraping text data from web pages, which often comes with redundantspace before or after. I routinely use ‘trim’ on the final stringoutput, but I have found cases where there are still redundant spaces.Inspecting the results, I find that the characters are non-break spaces(codepoint 160, Unicode U+00A0). Looking at the code, String>>#trimdepends on Character>>#isSeparator, which does not answer true for anon-break space. I can use trimBoth: [:char| char asInteger = 160] toremove the redundant spaces if I know where to expect them, so it is nota major problem. But the question remains: should non-break space beincluded in the list of separators in Character>>#isSeparator.

In unicode, there are many more 'characters' that could be consideredwhitespace. You are collecting data from web pages, so you have noinfluence on what you'll get as input. I don't think this should besolved in #isSeparator.


Stephan

Re: [Pharo-users] Is a non-break space whitespace?

Reply via email to