Hello

 

One way of dealing with this in a general way is to introduce a new predicate 
#isWhitespace for Character, maybe following the Wikipedia definition as 
Richard suggests, and then either (a) recode String>>#trim and friends to use 
#isWhitespace rather than #isSeparator or (b) introduce a new operation 
String>>#trimWhitespace which uses #isWhitespace.

@stef. There is already an operation which allows us to specify the characters 
to be trimmed, which I mentioned in my original post. We write trimBoth: 
aBlock, where aBlock value: char answers true if char is to be trimmed. I knew 
I could solve my immediate problem like this; I just wondered if there should 
be something more general.

 

Peter Kenny

 

From: Pharo-users [mailto:pharo-users-boun...@lists.pharo.org] On Behalf Of 
Sven Van Caekenberghe
Sent: 25 September 2017 09:10
To: Any question about pharo is welcome <pharo-users@lists.pharo.org>
Subject: Re: [Pharo-users] Is a non-break space whitespace?

 





On 25 Sep 2017, at 09:53, Richard Sargent <richard.sarg...@gemtalksystems.com 
<mailto:richard.sarg...@gemtalksystems.com> > wrote:

Ben Coman wrote



On Sun, Sep 24, 2017 at 7:53 PM, PBKResearch &lt;





 <mailto:peter@.co> peter@.co





&gt;
wrote:




Hello All



I have a little puzzle to disturb your Sunday lunch, maybe. I have been
scraping text data from web pages, which often comes with redundant space
before or after. I routinely use ‘trim’ on the final string output, but I
have found cases where there are still redundant spaces. Inspecting the
results, I find that the characters are non-break spaces (codepoint 160,
Unicode U+00A0). Looking at the code, String>>#trim depends on
Character>>#isSeparator, which does not answer true for a non-break
space.
I can use trimBoth: [:char| char asInteger = 160] to remove the redundant
spaces if I know where to expect them, so it is not a major problem. But
the question remains: should non-break space be included in the list of
separators in Character>>#isSeparator.



Peter Kenny





Off the cuff.. intuitively "by definition" a Non-Break space seems to be
Not-a-Separator.
If the web pages you are scraping are misusing non-break space to munge
formatting, that is not something to be solved by modifying semantics of
#isSeparator.
Stef's suggestion for a selector that takes a list of separators seems
appropriate.



Rather than off-the-cuffing anything, please honour the Unicode Character
Properties. Refer to
https://en.wikipedia.org/wiki/Unicode_character_property#Whitespace, among
others.

 

The Pharo Unicode Project (http://www.smalltalkhub.com/#!/~Pharo/Unicode) 
offers support for these properties.

 



 

There is a performance cost though (it is a big database to load/use).

 

I would not immediately change #isSeparator though.





cheers -ben






--
Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html

 

Reply via email to