Hello
One way of dealing with this in a general way is to introduce a new predicate #isWhitespace for Character, maybe following the Wikipedia definition as Richard suggests, and then either (a) recode String>>#trim and friends to use #isWhitespace rather than #isSeparator or (b) introduce a new operation String>>#trimWhitespace which uses #isWhitespace. @stef. There is already an operation which allows us to specify the characters to be trimmed, which I mentioned in my original post. We write trimBoth: aBlock, where aBlock value: char answers true if char is to be trimmed. I knew I could solve my immediate problem like this; I just wondered if there should be something more general. Peter Kenny From: Pharo-users [mailto:pharo-users-boun...@lists.pharo.org] On Behalf Of Sven Van Caekenberghe Sent: 25 September 2017 09:10 To: Any question about pharo is welcome <pharo-users@lists.pharo.org> Subject: Re: [Pharo-users] Is a non-break space whitespace? On 25 Sep 2017, at 09:53, Richard Sargent <richard.sarg...@gemtalksystems.com <mailto:richard.sarg...@gemtalksystems.com> > wrote: Ben Coman wrote On Sun, Sep 24, 2017 at 7:53 PM, PBKResearch < <mailto:peter@.co> peter@.co > wrote: Hello All I have a little puzzle to disturb your Sunday lunch, maybe. I have been scraping text data from web pages, which often comes with redundant space before or after. I routinely use ‘trim’ on the final string output, but I have found cases where there are still redundant spaces. Inspecting the results, I find that the characters are non-break spaces (codepoint 160, Unicode U+00A0). Looking at the code, String>>#trim depends on Character>>#isSeparator, which does not answer true for a non-break space. I can use trimBoth: [:char| char asInteger = 160] to remove the redundant spaces if I know where to expect them, so it is not a major problem. But the question remains: should non-break space be included in the list of separators in Character>>#isSeparator. Peter Kenny Off the cuff.. intuitively "by definition" a Non-Break space seems to be Not-a-Separator. If the web pages you are scraping are misusing non-break space to munge formatting, that is not something to be solved by modifying semantics of #isSeparator. Stef's suggestion for a selector that takes a list of separators seems appropriate. Rather than off-the-cuffing anything, please honour the Unicode Character Properties. Refer to https://en.wikipedia.org/wiki/Unicode_character_property#Whitespace, among others. The Pharo Unicode Project (http://www.smalltalkhub.com/#!/~Pharo/Unicode) offers support for these properties. There is a performance cost though (it is a big database to load/use). I would not immediately change #isSeparator though. cheers -ben -- Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html