I can only fully agree to Richard's explanation. Also, not having the empty
string pre/suf-fixing every string breaks the free monoid that describes
concatenation of strings and forms the basis of regular expressions. In
effect, this will lead to subtle inconsistencies. Hence, I consider this a
bug rather than a feature.


Kind regards,
Steffen





Richard O'Keefe schrieb am Samstag, 23. April 2022 02:37:48 (+02:00):


Dan Ingalls is of course a big NAME in the
world of Smalltalk, but the stated reason
for changing the behaviour of #beginsWith:
and #endsWith: makes no sense.




We have many ways to define a partial order
on strings.
x <= y iff y beginsWith: x
x <= y iff y endsWith: x
x <= y iff y includesSubCollection: x
x <= y iff y includesSubSequence: x
These things are supposed to obey laws:
if a beginsWith: b , c
then a beginsWith: b
if a endsWith: b , c
then a endsWith: c
if a includesSubCollection: b , c
then a includesSubCollection: b
and a includesSubCollection: c
if a includesSubSequence: b , c
then a includesSubSequence: b
and a includesSubSequence: c.


We also expect the usual rules of equality
to hold.  So
(1) a beginsWith: a
(2) a = '' , a
(3) THEREFORE a beginsWith: ''


(1) a endsWith: a
(2) a = a , ''
(3) THEREFORE a endsWith: ''


(1) a includesSubCollection: a
(2) a = '' , a
(3) THEREFORE a includesSubCollect: ''


Reasoning about strings (as values) gets
enormously more complicated if the operations
do not follow simple sensible rules, and
having '' be the least string under these
orderings and having '' be a prefix and a
suffix of any string is essential if the
rules are going to be simple and coherent.


'' is to strings (and more generally
empty sequences are to sequences) pretty
much what 0 is to integers.   Denying that
'abc' beginsWith: '' is *structurally*
just like denying that 0 <= 123.


Now as it happens I *can* see a use for
versions of #beginsWith: and #endsWith:
that diverge from the ones we have, but
*this* is not where they need to diverge.
a beginsWithGraphemesOf: b
iff a asGraphemes = b asGraphemes , c asGraphemes
for some c, where s asGraphemes returns a
sequence of strings each of which is a maximally
long grapheme cluster, such that concatenating
s asGraphemes recovers s.  That is,
#beginsWithGraphemesOf: and
#endsWithGraphemesOf: would respect the
Unicode Text Segmentation boundaries.
But s beginsWithGraphemesOf: ''
would still need to be true.


The thing is, in practice you often DON'T
KNOW whether a potential affix is empty or
not.  Here are some of my test cases.


    testTestData
      "Ensure that the sample string has no duplicates."
      [(Set withAll: string) size = stringSize] assert.

    testBeginsWith
      "Test that every prefix of the sample IS a prefix of it."
      0 to: stringSize do: [:n |
        [string beginsWith: (string copyFrom: 1 to: n)] assert].

    testEndsWith
      "Test that every suffix of the sample IS a suffix of it."
      0 to: stringSize do: [:n |
        [string endsWith: (string copyFrom: stringSize - n + 1 to:
stringSize)] assert].

    testIndexOfSubCollectionAtBeginning
      "Test that every prefix of 'abcd' is found at the beginning."
      0 to: stringSize do: [:n | |s i t|
        s := string copyFrom: 1 to: n.
        i := string indexOfSubCollection: s startingAt: 1.
        [1 = i] assert.
        t := string copyFrom: i to: i - 1 + n.
        [t = s] assert].

    testIndexOfSubCollectionAtEnd
      "Test that every proper suffix of the sample is found at the end."
      1 to: stringSize do: [:n | |s i t|
        s := string copyFrom: stringSize - n + 1 to: stringSize.
        i := string indexOfSubCollection: s startingAt: 1.
        [stringSize + 1 - n = i] assert.
        t := string copyFrom: i to: i - 1 + n.
        [t = s] assert].

    testLastIndexOfSubCollectionAtBeginning
      "Test that every proper prefix of the sample is found at the
beginning."
      1 to: stringSize do: [:n | |s i t|
        s := string copyFrom: 1 to: n.
        i := string lastIndexOfSubCollection: s startingAt: stringSize.
        [1 = i] assert.
        t := string copyFrom: i to: i - 1 + n.
        [t = s] assert].

    testLastIndexOfSubCollectionAtEnd
      "Test that every suffix of the sample is found at the end."
      0 to: stringSize do: [:n | |s i t|
        s := string copyFrom: stringSize - n + 1 to: stringSize.
        i := string lastIndexOfSubCollection: s startingAt: stringSize.
        [stringSize + 1 - n = i] assert.
        t := string copyFrom: i to: i - 1 + n.
        [t = s] assert].

    testOccurrencesOfEmptyCollection
      "Test that the empty string occurs at the beginning,
       at the end, and in between every pair of adjacent characters."
      [(string occurrencesOfSubCollection: '') = (stringSize + 1)] assert.

    testOccurrencesOfUniqueParts
      "Test that unique parts occur as many times as they should."
      |repeated|
      repeated := string , string , string.
      1 to: stringSize do: [:start |
        start to: stringSize do: [:finish | |s n|
          s := string copyFrom: start to: finish.
          n := string occurrencesOfSubCollection: s.
          [n = 1] assert.
          n := repeated occurrencesOfSubCollection: s.
          [n = 3] assert]].



On Fri, 22 Apr 2022 at 13:57, David T. Lewis <le...@mail.msen.com> wrote:

Hi Richard,

(CC squeak-dev list, replies to the relevant list please)

On Thu, Apr 21, 2022 at 12:07:32AM +1200, Richard O'Keefe wrote:
> I've just tracked down a nasty little problem
> porting some code to Pharo.  As a result, I
> have added to the comments in my own versions
> of these methods.    beginsWith: aSequence
>       "Answer true if aSequence is a prefix of the receiver.
>        This makes sense for all sequences.
>        There is a compatibility issue concerning 'abc' beginsWith: ''
>        + VisualWorks, Dolphin, astc, GNU ST (where the method is
>          called #startsWith:) and VisualAge (where the method
>          is called #wbBeginsWith:)
>          agree than EVERY sequence begins with an empty prefix.
>        - Squeak and Pharo
>          agree that NO sequence begins with an empty sequence.
>        # ST/X chooses compatibility with Squeak, heaving a big unhappy
>          sigh, and adds #startsWith: to have something sensible to use.
>        Now ST/X *thinks* it is compatible with VW, though it isn't, so
>        I wonder if this was a bug that VW fixed and Squeak didn't?
>        astc goes with the majority here.  This is also compatible with
>        Haskell, ML, and with StartsWith in C# and startsWith in Java."
>       ^self beginsWith: aSequence ignoringCase: false
>
>     endsWith: aSequence
>       "Answer true if aSequence is a suffix of the receiver.
>        This makes sense for all sequences.
>        There is a compatibility issue concerning 'abc' endsWith: ''.
>        + VisualWorks, Dolphin, astc, GNU ST, and VisualAge (where
>          the method is called #wbEndsWith:)
>          agree that EVERY sequence ends with an empty suffix.
>        - Squeak and Pharo
>          agree that NO sequence ends with an empty suffix.
>        # ST/X chooses compatibility with the majority, apparently
>          unaware that this makes #beginsWith: and #endsWith:
inconsistent.
>        astc goes with the majority here.  This is also compatible with
>        Haskell, ML, C#, and Java."
>       ^self endsWith: aSequence ignoringCase: false
>
> Does anyone have any idea
>  - why Squeak and Pharo are the odd ones out?
>  - why anyone thought making #beginsWith: and #endsWith:, um, "quirky"
>    was a good idea (it's pretty standard in books on the theory of
>    strings to define "x is a prefix of y iff there is a z such that
>    y = x concatenated with z")

The Squeak behavior was introduced in November 1998 by Dan Ingalls
in conjunction with some VM improvements that he was doing, and was
included in the Squeak2.2 release at that time.

Prior to that update, Squeak did this:
  'abc' beginsWith: '' ==> true
  'abc' endsWith: '' ==> true

For Squeak2.2 and later it is:
  'abc' beginsWith: '' ==> false
  'abc' endsWith: '' ==> false

Pharo presumably inherits this from Squeak.

I am attaching Dan's original change set from the early Squeak
update stream so you can see the context.

The change set comment says that "endsWith:, beginsWith:, and match
have been rewritten to take advantage of this considerably faster method"
(referring to a new primitive that Dan had added to the VM in this
change set).

>
> I was about to try to file a bug report for the first time,
> then realised that maybe other people don't think this IS a bug.

I don't think it is a bug. I can't think of a case where it makes
sense to say that a string of characters "begins with" or "ends with"
a string that contains nothing.

In any case, that's the history of the change in Squeak and Pharo as
best as I can reconstruct it.

HTH,
Dave



--
Gesendet mit Vivaldi Mail. Laden Sie Vivaldi kostenlos von vivaldi.com
herunter.

Reply via email to