> On 25 Sep 2017, at 09:53, Richard Sargent 
> <richard.sarg...@gemtalksystems.com> wrote:
> 
> Ben Coman wrote
>> On Sun, Sep 24, 2017 at 7:53 PM, PBKResearch &lt;
> 
>> peter@.co
> 
>> &gt;
>> wrote:
>> 
>>> Hello All
>>> 
>>> 
>>> 
>>> I have a little puzzle to disturb your Sunday lunch, maybe. I have been
>>> scraping text data from web pages, which often comes with redundant space
>>> before or after. I routinely use ‘trim’ on the final string output, but I
>>> have found cases where there are still redundant spaces. Inspecting the
>>> results, I find that the characters are non-break spaces (codepoint 160,
>>> Unicode U+00A0). Looking at the code, String>>#trim depends on
>>> Character>>#isSeparator, which does not answer true for a non-break
>>> space.
>>> I can use trimBoth: [:char| char asInteger = 160] to remove the redundant
>>> spaces if I know where to expect them, so it is not a major problem. But
>>> the question remains: should non-break space be included in the list of
>>> separators in Character>>#isSeparator.
>>> 
>>> 
>>> 
>>> Peter Kenny
>>> 
>>> 
>>> 
>> 
>> Off the cuff.. intuitively "by definition" a Non-Break space seems to be
>> Not-a-Separator.
>> If the web pages you are scraping are misusing non-break space to munge
>> formatting, that is not something to be solved by modifying semantics of
>> #isSeparator.
>> Stef's suggestion for a selector that takes a list of separators seems
>> appropriate.
> 
> 
> Rather than off-the-cuffing anything, please honour the Unicode Character
> Properties. Refer to
> https://en.wikipedia.org/wiki/Unicode_character_property#Whitespace, among
> others.

The Pharo Unicode Project (http://www.smalltalkhub.com/#!/~Pharo/Unicode) 
offers support for these properties.



There is a performance cost though (it is a big database to load/use).

I would not immediately change #isSeparator though.

>> cheers -ben
> 
> 
> 
> 
> 
> --
> Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html

Reply via email to