> On 25 Sep 2017, at 09:53, Richard Sargent > <richard.sarg...@gemtalksystems.com> wrote: > > Ben Coman wrote >> On Sun, Sep 24, 2017 at 7:53 PM, PBKResearch < > >> peter@.co > >> > >> wrote: >> >>> Hello All >>> >>> >>> >>> I have a little puzzle to disturb your Sunday lunch, maybe. I have been >>> scraping text data from web pages, which often comes with redundant space >>> before or after. I routinely use ‘trim’ on the final string output, but I >>> have found cases where there are still redundant spaces. Inspecting the >>> results, I find that the characters are non-break spaces (codepoint 160, >>> Unicode U+00A0). Looking at the code, String>>#trim depends on >>> Character>>#isSeparator, which does not answer true for a non-break >>> space. >>> I can use trimBoth: [:char| char asInteger = 160] to remove the redundant >>> spaces if I know where to expect them, so it is not a major problem. But >>> the question remains: should non-break space be included in the list of >>> separators in Character>>#isSeparator. >>> >>> >>> >>> Peter Kenny >>> >>> >>> >> >> Off the cuff.. intuitively "by definition" a Non-Break space seems to be >> Not-a-Separator. >> If the web pages you are scraping are misusing non-break space to munge >> formatting, that is not something to be solved by modifying semantics of >> #isSeparator. >> Stef's suggestion for a selector that takes a list of separators seems >> appropriate. > > > Rather than off-the-cuffing anything, please honour the Unicode Character > Properties. Refer to > https://en.wikipedia.org/wiki/Unicode_character_property#Whitespace, among > others.
The Pharo Unicode Project (http://www.smalltalkhub.com/#!/~Pharo/Unicode) offers support for these properties. There is a performance cost though (it is a big database to load/use). I would not immediately change #isSeparator though. >> cheers -ben > > > > > > -- > Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html