On Mon, Sep 25, 2017 at 3:53 PM, Richard Sargent < richard.sarg...@gemtalksystems.com> wrote:
> Ben Coman wrote > > On Sun, Sep 24, 2017 at 7:53 PM, PBKResearch < > > > peter@.co > > > > > > wrote: > > > >> Hello All > >> > >> > >> > >> I have a little puzzle to disturb your Sunday lunch, maybe. I have been > >> scraping text data from web pages, which often comes with redundant > space > >> before or after. I routinely use ‘trim’ on the final string output, but > I > >> have found cases where there are still redundant spaces. Inspecting the > >> results, I find that the characters are non-break spaces (codepoint 160, > >> Unicode U+00A0). Looking at the code, String>>#trim depends on > >> Character>>#isSeparator, which does not answer true for a non-break > >> space. > >> I can use trimBoth: [:char| char asInteger = 160] to remove the > redundant > >> spaces if I know where to expect them, so it is not a major problem. But > >> the question remains: should non-break space be included in the list of > >> separators in Character>>#isSeparator. > >> > >> > >> > >> Peter Kenny > >> > >> > >> > > > > Off the cuff.. intuitively "by definition" a Non-Break space seems to be > > Not-a-Separator. > > If the web pages you are scraping are misusing non-break space to munge > > formatting, that is not something to be solved by modifying semantics of > > #isSeparator. > > Stef's suggestion for a selector that takes a list of separators seems > > appropriate. > > > Rather than off-the-cuffing anything, please honour the Unicode Character > Properties. Refer to > https://en.wikipedia.org/wiki/Unicode_character_property#Whitespace, among > others. > I hope it was clear I wasn't speaking from a position of authoritative knowledge.. Nice to learn something new. Thanks for the correction. cheers -ben