On Mon, Sep 25, 2017 at 3:53 PM, Richard Sargent <
richard.sarg...@gemtalksystems.com> wrote:

> Ben Coman wrote
> > On Sun, Sep 24, 2017 at 7:53 PM, PBKResearch &lt;
>
> > peter@.co
>
> > &gt;
> > wrote:
> >
> >> Hello All
> >>
> >>
> >>
> >> I have a little puzzle to disturb your Sunday lunch, maybe. I have been
> >> scraping text data from web pages, which often comes with redundant
> space
> >> before or after. I routinely use ‘trim’ on the final string output, but
> I
> >> have found cases where there are still redundant spaces. Inspecting the
> >> results, I find that the characters are non-break spaces (codepoint 160,
> >> Unicode U+00A0). Looking at the code, String>>#trim depends on
> >> Character>>#isSeparator, which does not answer true for a non-break
> >> space.
> >> I can use trimBoth: [:char| char asInteger = 160] to remove the
> redundant
> >> spaces if I know where to expect them, so it is not a major problem. But
> >> the question remains: should non-break space be included in the list of
> >> separators in Character>>#isSeparator.
> >>
> >>
> >>
> >> Peter Kenny
> >>
> >>
> >>
> >
> > Off the cuff.. intuitively "by definition" a Non-Break space seems to be
> > Not-a-Separator.
> > If the web pages you are scraping are misusing non-break space to munge
> > formatting, that is not something to be solved by modifying semantics of
> > #isSeparator.
> > Stef's suggestion for a selector that takes a list of separators seems
> > appropriate.
>
>
> Rather than off-the-cuffing anything, please honour the Unicode Character
> Properties. Refer to
> https://en.wikipedia.org/wiki/Unicode_character_property#Whitespace, among
> others.
>

I hope it was clear I wasn't speaking from a position of authoritative
knowledge..
Nice to learn something new.  Thanks for the correction.
cheers -ben

Reply via email to