Ben Coman wrote > On Sun, Sep 24, 2017 at 7:53 PM, PBKResearch < > peter@.co
> > > wrote: > >> Hello All >> >> >> >> I have a little puzzle to disturb your Sunday lunch, maybe. I have been >> scraping text data from web pages, which often comes with redundant space >> before or after. I routinely use ‘trim’ on the final string output, but I >> have found cases where there are still redundant spaces. Inspecting the >> results, I find that the characters are non-break spaces (codepoint 160, >> Unicode U+00A0). Looking at the code, String>>#trim depends on >> Character>>#isSeparator, which does not answer true for a non-break >> space. >> I can use trimBoth: [:char| char asInteger = 160] to remove the redundant >> spaces if I know where to expect them, so it is not a major problem. But >> the question remains: should non-break space be included in the list of >> separators in Character>>#isSeparator. >> >> >> >> Peter Kenny >> >> >> > > Off the cuff.. intuitively "by definition" a Non-Break space seems to be > Not-a-Separator. > If the web pages you are scraping are misusing non-break space to munge > formatting, that is not something to be solved by modifying semantics of > #isSeparator. > Stef's suggestion for a selector that takes a list of separators seems > appropriate. Rather than off-the-cuffing anything, please honour the Unicode Character Properties. Refer to https://en.wikipedia.org/wiki/Unicode_character_property#Whitespace, among others. > cheers -ben -- Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html