Ben Coman wrote
> On Sun, Sep 24, 2017 at 7:53 PM, PBKResearch <

> peter@.co

> >
> wrote:
> 
>> Hello All
>>
>>
>>
>> I have a little puzzle to disturb your Sunday lunch, maybe. I have been
>> scraping text data from web pages, which often comes with redundant space
>> before or after. I routinely use ‘trim’ on the final string output, but I
>> have found cases where there are still redundant spaces. Inspecting the
>> results, I find that the characters are non-break spaces (codepoint 160,
>> Unicode U+00A0). Looking at the code, String>>#trim depends on
>> Character>>#isSeparator, which does not answer true for a non-break
>> space.
>> I can use trimBoth: [:char| char asInteger = 160] to remove the redundant
>> spaces if I know where to expect them, so it is not a major problem. But
>> the question remains: should non-break space be included in the list of
>> separators in Character>>#isSeparator.
>>
>>
>>
>> Peter Kenny
>>
>>
>>
> 
> Off the cuff.. intuitively "by definition" a Non-Break space seems to be
> Not-a-Separator.
> If the web pages you are scraping are misusing non-break space to munge
> formatting, that is not something to be solved by modifying semantics of
> #isSeparator.
> Stef's suggestion for a selector that takes a list of separators seems
> appropriate.


Rather than off-the-cuffing anything, please honour the Unicode Character
Properties. Refer to
https://en.wikipedia.org/wiki/Unicode_character_property#Whitespace, among
others.


> cheers -ben





--
Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html

Reply via email to