On Sun, Sep 24, 2017 at 7:53 PM, PBKResearch <pe...@pbkresearch.co.uk>
wrote:

> Hello All
>
>
>
> I have a little puzzle to disturb your Sunday lunch, maybe. I have been
> scraping text data from web pages, which often comes with redundant space
> before or after. I routinely use ‘trim’ on the final string output, but I
> have found cases where there are still redundant spaces. Inspecting the
> results, I find that the characters are non-break spaces (codepoint 160,
> Unicode U+00A0). Looking at the code, String>>#trim depends on
> Character>>#isSeparator, which does not answer true for a non-break space.
> I can use trimBoth: [:char| char asInteger = 160] to remove the redundant
> spaces if I know where to expect them, so it is not a major problem. But
> the question remains: should non-break space be included in the list of
> separators in Character>>#isSeparator.
>
>
>
> Peter Kenny
>
>
>

Off the cuff.. intuitively "by definition" a Non-Break space seems to be
Not-a-Separator.
If the web pages you are scraping are misusing non-break space to munge
formatting, that is not something to be solved by modifying semantics of
#isSeparator.
Stef's suggestion for a selector that takes a list of separators seems
appropriate.

cheers -ben

Reply via email to