Re: [swift-evolution] [Review] SE-0180: String Index Overhaul

2017-06-14 Thread Dave Abrahams via swift-evolution

on Wed Jun 14 2017, Xiaodi Wu  wrote:

> On Wed, Jun 14, 2017 at 12:01 PM, Dave Abrahams  wrote:
>
>>
>> on Wed Jun 14 2017, Xiaodi Wu  wrote:
>>
>> > On Wed, Jun 14, 2017 at 09:26 Xiaodi Wu  wrote:
>> >
>> >> If we leave aside for a moment the nomenclature issue where everything
>> in
>> >> Foundation referring to a character is really referring to a Unicode
>> >> scalar, Kevin’s example illustrates the whole problem in a nutshell,
>> >> doesn’t it? In that example, we have a straightforward attempt to slice
>> >> with a misaligned index. The totality of options here are:
>> >>
>> >> * return nil, an option the rejection of which is the premise of your
>> >> proposal
>> >> * return a partial character (i.e., \u{301}), an option which we haven’t
>> >> yet talked about in this thread–seems like this could have simpler
>> >> semantics, potentially yields garbage if the index is garbage but in the
>> >> case of Kevin’s example actually behaves as the user might expect
>>
>> I think that's exactly what I was proposing in
>> https://lists.swift.org/pipermail/swift-evolution/
>> Week-of-Mon-20170612/037466.html
>>
>> >> * return a whole character after “rounding down”–difficult semantics
>> >> to define and explain, always results in a whole character but in the
>> >> case of Kevin’s example gives an unexpected answer * returns a whole
>> >> character after “rounding up”–difficult semantics to define and
>> >> explain, always results in a whole character but when the index is
>> >> misaligned would result in a character or range of characters in
>> >> which the index is not found * trap–simple semantics, never returns
>> >> garbage, obvious disadvantage that execution will not proceed
>> >>
>> >> No clearly perfect answer here. However, _if_ we hew strictly to the
>> >> stated premise of your proposal that failable APIs are awkward enough to
>> >> justify a change, and moreover that the awkwardness is truly “needless”
>> >> because of the rarity of misaligned index usage, then at face value
>> >> trapping should be a perfectly acceptable solution.
>> >>
>> >> That Kevin’s example raises the specter of trapping being a realistic
>> >> occurrence in currently working code actually suggests a challenge to
>> your
>> >> stated premise. If we accept that this challenge is a substantial one,
>> then
>> >> it’s not clear to me that abandoning failable APIs should be ruled out
>> from
>> >> the outset.
>> >>
>> >> However, if this desire to remove failable APIs remains strong then I
>> >> wonder if the undiscussed second option above is worth at least some
>> >> consideration.
>> >>
>> >
>> > Having digested your revised proposed behavior a little better I see
>> you’re
>> > kind of getting at this exact issue, but I’m uncomfortable with how it’s
>> so
>> > tied to the underlying encoding, which is not guaranteed to be UTF-16 but
>> > is assumed to be for the purposes of slicing.
>>
>> I think there's some confusion here; probably I have failed to explain
>> myself.  Today a String happens to always be UTF-16, but there's no
>> intention to assume that it is UTF-16 for the purposes of slicing in the
>> future.  Any place you see something like s.utf16 in an example I've
>> used to illustrate semantics should be interpreted as a s.codeUnits,
>> where codeUnits is a collection of code units for whatever the
>> underlying encoding is.
>>
>> Tying this to underlying encoding actually reflects the true nature of
>> String, which is exposed by the semantics of concatenation and range
>> replacement, where multiple elements may merge into one element).  As
>> stated in
>> https://github.com/apple/swift/blob/master/docs/StringManifesto.md#string-
>> should-be-a-collection-of-characters-again
>> the elements of a String (or any of its views other than native code
>> units) is an emergent property.  To anyone operating at Unicode scalar
>> granularity (which can result in misalignment with respect to
>> characters) or at the higher granularity of code units (native or
>> transcoded, which can result in misalignment with all other views), I
>> think this is actually unsurprising.
>>
>
> That's fair. It this is critical to the semantics, though, and you expect
> that some people will operate at that granularity, it seems incongruous
> that s.codeUnits isn't actually exposed to the user even if it'd be as a
> type-erased AnyCollection.

I agree.  Exposing .codeUnits is part of the longer-term plan, but I'm
trying to keep mostly-orthogonal issues out of this proposal.

>> > I’d like to propose an alternative that attempts to deliver on what
>> > I’ve called the second option above–somewhat similar:
>> >
>> > A string index will notionally or actually keep track of the view
>> > in which it was originally aligned, be it utf8, utf16,
>> > unicodeScalars, or characters. A slicing operation str.xxx[idx]
>> > will behave as expected if idx is not misaligned with respect to
>> > str.xxx. If it is 

Re: [swift-evolution] [Review] SE-0180: String Index Overhaul

2017-06-14 Thread Xiaodi Wu via swift-evolution
On Wed, Jun 14, 2017 at 12:01 PM, Dave Abrahams  wrote:

>
> on Wed Jun 14 2017, Xiaodi Wu  wrote:
>
> > On Wed, Jun 14, 2017 at 09:26 Xiaodi Wu  wrote:
> >
> >> If we leave aside for a moment the nomenclature issue where everything
> in
> >> Foundation referring to a character is really referring to a Unicode
> >> scalar, Kevin’s example illustrates the whole problem in a nutshell,
> >> doesn’t it? In that example, we have a straightforward attempt to slice
> >> with a misaligned index. The totality of options here are:
> >>
> >> * return nil, an option the rejection of which is the premise of your
> >> proposal
> >> * return a partial character (i.e., \u{301}), an option which we haven’t
> >> yet talked about in this thread–seems like this could have simpler
> >> semantics, potentially yields garbage if the index is garbage but in the
> >> case of Kevin’s example actually behaves as the user might expect
>
> I think that's exactly what I was proposing in
> https://lists.swift.org/pipermail/swift-evolution/
> Week-of-Mon-20170612/037466.html
>
> >> * return a whole character after “rounding down”–difficult semantics
> >> to define and explain, always results in a whole character but in the
> >> case of Kevin’s example gives an unexpected answer * returns a whole
> >> character after “rounding up”–difficult semantics to define and
> >> explain, always results in a whole character but when the index is
> >> misaligned would result in a character or range of characters in
> >> which the index is not found * trap–simple semantics, never returns
> >> garbage, obvious disadvantage that execution will not proceed
> >>
> >> No clearly perfect answer here. However, _if_ we hew strictly to the
> >> stated premise of your proposal that failable APIs are awkward enough to
> >> justify a change, and moreover that the awkwardness is truly “needless”
> >> because of the rarity of misaligned index usage, then at face value
> >> trapping should be a perfectly acceptable solution.
> >>
> >> That Kevin’s example raises the specter of trapping being a realistic
> >> occurrence in currently working code actually suggests a challenge to
> your
> >> stated premise. If we accept that this challenge is a substantial one,
> then
> >> it’s not clear to me that abandoning failable APIs should be ruled out
> from
> >> the outset.
> >>
> >> However, if this desire to remove failable APIs remains strong then I
> >> wonder if the undiscussed second option above is worth at least some
> >> consideration.
> >>
> >
> > Having digested your revised proposed behavior a little better I see
> you’re
> > kind of getting at this exact issue, but I’m uncomfortable with how it’s
> so
> > tied to the underlying encoding, which is not guaranteed to be UTF-16 but
> > is assumed to be for the purposes of slicing.
>
> I think there's some confusion here; probably I have failed to explain
> myself.  Today a String happens to always be UTF-16, but there's no
> intention to assume that it is UTF-16 for the purposes of slicing in the
> future.  Any place you see something like s.utf16 in an example I've
> used to illustrate semantics should be interpreted as a s.codeUnits,
> where codeUnits is a collection of code units for whatever the
> underlying encoding is.
>
> Tying this to underlying encoding actually reflects the true nature of
> String, which is exposed by the semantics of concatenation and range
> replacement, where multiple elements may merge into one element).  As
> stated in
> https://github.com/apple/swift/blob/master/docs/StringManifesto.md#string-
> should-be-a-collection-of-characters-again
> the elements of a String (or any of its views other than native code
> units) is an emergent property.  To anyone operating at Unicode scalar
> granularity (which can result in misalignment with respect to
> characters) or at the higher granularity of code units (native or
> transcoded, which can result in misalignment with all other views), I
> think this is actually unsurprising.
>

That's fair. It this is critical to the semantics, though, and you expect
that some people will operate at that granularity, it seems incongruous
that s.codeUnits isn't actually exposed to the user even if it'd be as a
type-erased AnyCollection.

> I’d like to propose an alternative that attempts to deliver on what
> > I’ve called the second option above–somewhat similar:
> >
> > A string index will notionally or actually keep track of the view in
> which
> > it was originally aligned, be it utf8, utf16, unicodeScalars, or
> > characters. A slicing operation str.xxx[idx] will behave as expected if
> idx
> > is not misaligned with respect to str.xxx. If it is misaligned, the
> > operation would instead be notionally String(str.yyy[idx...]).xxx.
> first!,
> > where yyy is the original view in which idx was known aligned–if idx is
> not
> > also misaligned with respect to str.yyy (as might be the case if idx was
> > returned from 

Re: [swift-evolution] [Review] SE-0180: String Index Overhaul

2017-06-14 Thread Dave Abrahams via swift-evolution

on Wed Jun 14 2017, Xiaodi Wu  wrote:

> On Wed, Jun 14, 2017 at 09:26 Xiaodi Wu  wrote:
>
>> If we leave aside for a moment the nomenclature issue where everything in
>> Foundation referring to a character is really referring to a Unicode
>> scalar, Kevin’s example illustrates the whole problem in a nutshell,
>> doesn’t it? In that example, we have a straightforward attempt to slice
>> with a misaligned index. The totality of options here are:
>>
>> * return nil, an option the rejection of which is the premise of your
>> proposal
>> * return a partial character (i.e., \u{301}), an option which we haven’t
>> yet talked about in this thread–seems like this could have simpler
>> semantics, potentially yields garbage if the index is garbage but in the
>> case of Kevin’s example actually behaves as the user might expect

I think that's exactly what I was proposing in
https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20170612/037466.html

>> * return a whole character after “rounding down”–difficult semantics
>> to define and explain, always results in a whole character but in the
>> case of Kevin’s example gives an unexpected answer * returns a whole
>> character after “rounding up”–difficult semantics to define and
>> explain, always results in a whole character but when the index is
>> misaligned would result in a character or range of characters in
>> which the index is not found * trap–simple semantics, never returns
>> garbage, obvious disadvantage that execution will not proceed
>>
>> No clearly perfect answer here. However, _if_ we hew strictly to the
>> stated premise of your proposal that failable APIs are awkward enough to
>> justify a change, and moreover that the awkwardness is truly “needless”
>> because of the rarity of misaligned index usage, then at face value
>> trapping should be a perfectly acceptable solution.
>>
>> That Kevin’s example raises the specter of trapping being a realistic
>> occurrence in currently working code actually suggests a challenge to your
>> stated premise. If we accept that this challenge is a substantial one, then
>> it’s not clear to me that abandoning failable APIs should be ruled out from
>> the outset.
>>
>> However, if this desire to remove failable APIs remains strong then I
>> wonder if the undiscussed second option above is worth at least some
>> consideration.
>>
>
> Having digested your revised proposed behavior a little better I see you’re
> kind of getting at this exact issue, but I’m uncomfortable with how it’s so
> tied to the underlying encoding, which is not guaranteed to be UTF-16 but
> is assumed to be for the purposes of slicing. 

I think there's some confusion here; probably I have failed to explain
myself.  Today a String happens to always be UTF-16, but there's no
intention to assume that it is UTF-16 for the purposes of slicing in the
future.  Any place you see something like s.utf16 in an example I've
used to illustrate semantics should be interpreted as a s.codeUnits,
where codeUnits is a collection of code units for whatever the
underlying encoding is.

Tying this to underlying encoding actually reflects the true nature of
String, which is exposed by the semantics of concatenation and range
replacement, where multiple elements may merge into one element).  As
stated in
https://github.com/apple/swift/blob/master/docs/StringManifesto.md#string-should-be-a-collection-of-characters-again
the elements of a String (or any of its views other than native code
units) is an emergent property.  To anyone operating at Unicode scalar
granularity (which can result in misalignment with respect to
characters) or at the higher granularity of code units (native or
transcoded, which can result in misalignment with all other views), I
think this is actually unsurprising.

> I’d like to propose an alternative that attempts to deliver on what
> I’ve called the second option above–somewhat similar:
>
> A string index will notionally or actually keep track of the view in which
> it was originally aligned, be it utf8, utf16, unicodeScalars, or
> characters. A slicing operation str.xxx[idx] will behave as expected if idx
> is not misaligned with respect to str.xxx. If it is misaligned, the
> operation would instead be notionally String(str.yyy[idx...]).xxx.first!,
> where yyy is the original view in which idx was known aligned–if idx is not
> also misaligned with respect to str.yyy (as might be the case if idx was
> returned from an operation on a different string). If it is still
> misaligned, trap.

That seems much more complicsted than what I'm proposing, but maybe
that's because I haven't yet explained myself clearly enough.

-- 
-Dave
___
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


Re: [swift-evolution] [Review] SE-0180: String Index Overhaul

2017-06-14 Thread Dave Abrahams via swift-evolution

on Wed Jun 14 2017, Xiaodi Wu  wrote:

> On Wed, Jun 14, 2017 at 11:13 AM, Dave Abrahams  wrote:
>
>>
>> on Wed Jun 14 2017, Xiaodi Wu  wrote:
>>
>> > However, if this desire to remove failable APIs remains strong then I
>> > wonder if the undiscussed second option above is worth at least some
>> > consideration.
>>
>> I think you're misunderstanding the motivation here.  It's not so much
>> that I want to remove failable APIs as that I want to reduce overall API
>> surface area.  The current index conversion APIs contribute 16
>> initializers and 16 methods to the overall size of the library.
>>
>
> Ah, and presumably, having only failable APIs once these different index
> types are collapsed into one would be too cumbersome.

Well, yeah, and impossible.  Collection conformance requires that
subscript return a non-optional Element.

-- 
-Dave
___
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


Re: [swift-evolution] [Review] SE-0180: String Index Overhaul

2017-06-14 Thread David Waite via swift-evolution


> On Jun 13, 2017, at 3:21 PM, Dave Abrahams via swift-evolution 
>  wrote:
> 
> 
> on Mon Jun 12 2017, David Waite  > wrote:
> 
>> So is the idea of the Index struct is that the encodedOffset is an
>> offset in the native representation of the string (byte offset, word
>> offset, etc) to the start of a grapheme, and transcodedOffset is data
>> for Unicode Scalar, UTF-16 and UTF-8 views to represent an offset
>> within a grapheme to a code point or code unit?
> 
> Almost.  First, remember that transcodedOffset is currently just a
> conceptual thing and not part of the proposed API.  But if we exposed
> it, the following would be true:
> 
>  s.indices.index(where: { $0.transcodedOffset != 0 }) == nil
>  s.unicodeScalars.indices.index(where: { $0.transcodedOffset != 0 }) == nil
> 
> and, because the native encoding of Strings is currently always UTF-16 
> compatible
> 
>  s.utf16.indices.index(where: { $0.transcodedOffset != 0 }) == nil
> 
> In other words, a non-zero transcodedOffset can only occur in indices
> from views that represent the string as code units in something other
> than its native encoding, and only if that view is not UTF-32.

My main misconception appears to be that the implementation would track the 
beginning of a grapheme as an offset of code units, with additional tracking of 
the offset within a grapheme to a code unit or of state during transcoding. 
This would allow an index to track if it is misaligned with regard to the 
string, to make translations of indexes safer.

Thinking about this more, it would cause creating an index from an 
encodedOffset or incrementing an index to be a potentially O(n) operation as it 
walks the string tracking grapheme clusters.

> 
>> or to specify that an index to the same character in two normalized
>> strings may be different if one is backed by UTF-8 and the other
>> UTF-16. “encodedCharacterOffset” may be better.
> 
> In what way does bringing the word “Character” into this improve things?

It doesn’t; it is based on my misconception above :-)

>> or strings using a stateful character encoding like ISO/IEC 2022.
> 
> I don't believe it prevents that either.  The index already has state to
> avoid repeating work when in a loop such as:
> 
>   var i = someView.startIndex
>   while i != someView.endIndex {
>  somethingWith(someView[i])   // 1
>  i = someView.index(after: i) // 2
>   }
> 
> where lines 1 and 2 both require determining the extent of the element
> in underlying code units.  There's no reason it couldn't acquire
> additional state.
> 
> The most efficient way to deal with a String in a particular encoding is
> to make a new instance of StringProtocol (say ISO_IEC_2022String), which
> would not have to use this index type.
> 
> It is planned that eventually String could actually use something like
> ISO_IEC_2022String as its backing store.  At that point, we'd have a
> choice:
> 
> 1. Allow String.Index to store arbitrary state, burdening it with the
>   cost of potential ARC traffic, or
> 
> 2. Create a limited “scratch space” using fundamental types (e.g., one
>   UInt) that every instance of StringProtocol would have to be able to
>   use to represent its state.

Yes, this is what I was thinking, the Index becomes more complex as the # of 
types the system is leveraging the Index for state grows.

-DW___
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


Re: [swift-evolution] [Review] SE-0180: String Index Overhaul

2017-06-14 Thread Xiaodi Wu via swift-evolution
On Wed, Jun 14, 2017 at 11:13 AM, Dave Abrahams  wrote:

>
> on Wed Jun 14 2017, Xiaodi Wu  wrote:
>
> > If we leave aside for a moment the nomenclature issue where everything in
> > Foundation referring to a character is really referring to a Unicode
> > scalar, Kevin’s example illustrates the whole problem in a nutshell,
> > doesn’t it? In that example, we have a straightforward attempt to slice
> > with a misaligned index. The totality of options here are:
> >
> > * return nil, an option the rejection of which is the premise of your
> > proposal
> > * return a partial character (i.e., \u{301}), an option which we haven’t
> > yet talked about in this thread–seems like this could have simpler
> > semantics, potentially yields garbage if the index is garbage but in the
> > case of Kevin’s example actually behaves as the user might expect
> > * return a whole character after “rounding down”–difficult semantics to
> > define and explain, always results in a whole character but in the case
> of
> > Kevin’s example gives an unexpected answer
> > * returns a whole character after “rounding up”–difficult semantics to
> > define and explain, always results in a whole character but when the
> index
> > is misaligned would result in a character or range of characters in which
> > the index is not found
> > * trap–simple semantics, never returns garbage, obvious disadvantage that
> > execution will not proceed
> >
> > No clearly perfect answer here. However, _if_ we hew strictly to the
> stated
> > premise of your proposal that failable APIs are awkward enough to
> justify a
> > change, and moreover that the awkwardness is truly “needless” because of
> > the rarity of misaligned index usage, then at face value trapping should
> be
> > a perfectly acceptable solution.
> >
> > That Kevin’s example raises the specter of trapping being a realistic
> > occurrence in currently working code actually suggests a challenge to
> your
> > stated premise. If we accept that this challenge is a substantial one,
> then
> > it’s not clear to me that abandoning failable APIs should be ruled out
> from
> > the outset.
> >
> > However, if this desire to remove failable APIs remains strong then I
> > wonder if the undiscussed second option above is worth at least some
> > consideration.
>
> I think you're misunderstanding the motivation here.  It's not so much
> that I want to remove failable APIs as that I want to reduce overall API
> surface area.  The current index conversion APIs contribute 16
> initializers and 16 methods to the overall size of the library.
>

Ah, and presumably, having only failable APIs once these different index
types are collapsed into one would be too cumbersome.
___
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


Re: [swift-evolution] [Review] SE-0180: String Index Overhaul

2017-06-14 Thread Dave Abrahams via swift-evolution

on Wed Jun 14 2017, Xiaodi Wu  wrote:

> If we leave aside for a moment the nomenclature issue where everything in
> Foundation referring to a character is really referring to a Unicode
> scalar, Kevin’s example illustrates the whole problem in a nutshell,
> doesn’t it? In that example, we have a straightforward attempt to slice
> with a misaligned index. The totality of options here are:
>
> * return nil, an option the rejection of which is the premise of your
> proposal
> * return a partial character (i.e., \u{301}), an option which we haven’t
> yet talked about in this thread–seems like this could have simpler
> semantics, potentially yields garbage if the index is garbage but in the
> case of Kevin’s example actually behaves as the user might expect
> * return a whole character after “rounding down”–difficult semantics to
> define and explain, always results in a whole character but in the case of
> Kevin’s example gives an unexpected answer
> * returns a whole character after “rounding up”–difficult semantics to
> define and explain, always results in a whole character but when the index
> is misaligned would result in a character or range of characters in which
> the index is not found
> * trap–simple semantics, never returns garbage, obvious disadvantage that
> execution will not proceed
>
> No clearly perfect answer here. However, _if_ we hew strictly to the stated
> premise of your proposal that failable APIs are awkward enough to justify a
> change, and moreover that the awkwardness is truly “needless” because of
> the rarity of misaligned index usage, then at face value trapping should be
> a perfectly acceptable solution.
>
> That Kevin’s example raises the specter of trapping being a realistic
> occurrence in currently working code actually suggests a challenge to your
> stated premise. If we accept that this challenge is a substantial one, then
> it’s not clear to me that abandoning failable APIs should be ruled out from
> the outset.
>
> However, if this desire to remove failable APIs remains strong then I
> wonder if the undiscussed second option above is worth at least some
> consideration.

I think you're misunderstanding the motivation here.  It's not so much
that I want to remove failable APIs as that I want to reduce overall API
surface area.  The current index conversion APIs contribute 16
initializers and 16 methods to the overall size of the library.

-- 
-Dave
___
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


Re: [swift-evolution] [Review] SE-0180: String Index Overhaul

2017-06-14 Thread Xiaodi Wu via swift-evolution
On Wed, Jun 14, 2017 at 09:26 Xiaodi Wu  wrote:

> If we leave aside for a moment the nomenclature issue where everything in
> Foundation referring to a character is really referring to a Unicode
> scalar, Kevin’s example illustrates the whole problem in a nutshell,
> doesn’t it? In that example, we have a straightforward attempt to slice
> with a misaligned index. The totality of options here are:
>
> * return nil, an option the rejection of which is the premise of your
> proposal
> * return a partial character (i.e., \u{301}), an option which we haven’t
> yet talked about in this thread–seems like this could have simpler
> semantics, potentially yields garbage if the index is garbage but in the
> case of Kevin’s example actually behaves as the user might expect
> * return a whole character after “rounding down”–difficult semantics to
> define and explain, always results in a whole character but in the case of
> Kevin’s example gives an unexpected answer
> * returns a whole character after “rounding up”–difficult semantics to
> define and explain, always results in a whole character but when the index
> is misaligned would result in a character or range of characters in which
> the index is not found
> * trap–simple semantics, never returns garbage, obvious disadvantage that
> execution will not proceed
>
> No clearly perfect answer here. However, _if_ we hew strictly to the
> stated premise of your proposal that failable APIs are awkward enough to
> justify a change, and moreover that the awkwardness is truly “needless”
> because of the rarity of misaligned index usage, then at face value
> trapping should be a perfectly acceptable solution.
>
> That Kevin’s example raises the specter of trapping being a realistic
> occurrence in currently working code actually suggests a challenge to your
> stated premise. If we accept that this challenge is a substantial one, then
> it’s not clear to me that abandoning failable APIs should be ruled out from
> the outset.
>
> However, if this desire to remove failable APIs remains strong then I
> wonder if the undiscussed second option above is worth at least some
> consideration.
>

Having digested your revised proposed behavior a little better I see you’re
kind of getting at this exact issue, but I’m uncomfortable with how it’s so
tied to the underlying encoding, which is not guaranteed to be UTF-16 but
is assumed to be for the purposes of slicing. I’d like to propose an
alternative that attempts to deliver on what I’ve called the second option
above–somewhat similar:

A string index will notionally or actually keep track of the view in which
it was originally aligned, be it utf8, utf16, unicodeScalars, or
characters. A slicing operation str.xxx[idx] will behave as expected if idx
is not misaligned with respect to str.xxx. If it is misaligned, the
operation would instead be notionally String(str.yyy[idx...]).xxx.first!,
where yyy is the original view in which idx was known aligned–if idx is not
also misaligned with respect to str.yyy (as might be the case if idx was
returned from an operation on a different string). If it is still
misaligned, trap.


On Wed, Jun 14, 2017 at 08:49 Dave Abrahams  wrote:
>
>>
>> > On Jun 13, 2017, at 6:16 PM, Xiaodi Wu  wrote:
>> >
>> > I’m coming to this conversation rather late, so forgive the naive
>> question:
>> >
>> > Your proposal claims that current code with failable APIs is needlessly
>> awkward and that most code only interchanges indices that are known to
>> succeed. So, why is it not simply a precondition of string slicing that the
>> index be correctly aligned? It seems like this would simplify the behavior
>> greatly.
>>
>> Well, consider the case raised by Kevin Ballard if nothing else: that
>> code would start trapping.
>>
>> -Dave
>
>
___
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


Re: [swift-evolution] [Review] SE-0180: String Index Overhaul

2017-06-14 Thread Xiaodi Wu via swift-evolution
If we leave aside for a moment the nomenclature issue where everything in
Foundation referring to a character is really referring to a Unicode
scalar, Kevin’s example illustrates the whole problem in a nutshell,
doesn’t it? In that example, we have a straightforward attempt to slice
with a misaligned index. The totality of options here are:

* return nil, an option the rejection of which is the premise of your
proposal
* return a partial character (i.e., \u{301}), an option which we haven’t
yet talked about in this thread–seems like this could have simpler
semantics, potentially yields garbage if the index is garbage but in the
case of Kevin’s example actually behaves as the user might expect
* return a whole character after “rounding down”–difficult semantics to
define and explain, always results in a whole character but in the case of
Kevin’s example gives an unexpected answer
* returns a whole character after “rounding up”–difficult semantics to
define and explain, always results in a whole character but when the index
is misaligned would result in a character or range of characters in which
the index is not found
* trap–simple semantics, never returns garbage, obvious disadvantage that
execution will not proceed

No clearly perfect answer here. However, _if_ we hew strictly to the stated
premise of your proposal that failable APIs are awkward enough to justify a
change, and moreover that the awkwardness is truly “needless” because of
the rarity of misaligned index usage, then at face value trapping should be
a perfectly acceptable solution.

That Kevin’s example raises the specter of trapping being a realistic
occurrence in currently working code actually suggests a challenge to your
stated premise. If we accept that this challenge is a substantial one, then
it’s not clear to me that abandoning failable APIs should be ruled out from
the outset.

However, if this desire to remove failable APIs remains strong then I
wonder if the undiscussed second option above is worth at least some
consideration.


On Wed, Jun 14, 2017 at 08:49 Dave Abrahams  wrote:

>
> > On Jun 13, 2017, at 6:16 PM, Xiaodi Wu  wrote:
> >
> > I’m coming to this conversation rather late, so forgive the naive
> question:
> >
> > Your proposal claims that current code with failable APIs is needlessly
> awkward and that most code only interchanges indices that are known to
> succeed. So, why is it not simply a precondition of string slicing that the
> index be correctly aligned? It seems like this would simplify the behavior
> greatly.
>
> Well, consider the case raised by Kevin Ballard if nothing else: that code
> would start trapping.
>
> -Dave
___
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


Re: [swift-evolution] [Review] SE-0180: String Index Overhaul

2017-06-14 Thread Dave Abrahams via swift-evolution

> On Jun 13, 2017, at 6:16 PM, Xiaodi Wu  wrote:
> 
> I’m coming to this conversation rather late, so forgive the naive question:
> 
> Your proposal claims that current code with failable APIs is needlessly 
> awkward and that most code only interchanges indices that are known to 
> succeed. So, why is it not simply a precondition of string slicing that the 
> index be correctly aligned? It seems like this would simplify the behavior 
> greatly.

Well, consider the case raised by Kevin Ballard if nothing else: that code 
would start trapping. 

-Dave
___
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


Re: [swift-evolution] [Review] SE-0180: String Index Overhaul

2017-06-13 Thread Xiaodi Wu via swift-evolution
I’m coming to this conversation rather late, so forgive the naive question:

Your proposal claims that current code with failable APIs is needlessly
awkward and that most code only interchanges indices that are known to
succeed. So, why is it not simply a precondition of string slicing that the
index be correctly aligned? It seems like this would simplify the behavior
greatly.


On Tue, Jun 13, 2017 at 19:04 Dave Abrahams via swift-evolution <
swift-evolution@swift.org> wrote:

>
> on Tue Jun 06 2017, Dave Abrahams  wrote:
>
> >> Overall it looks pretty good. But unfortunately the answer to "Will
> >> applications still compile but produce different behavior than they
> >> used to?" is actually "Yes", when using APIs provided by
> >> Foundation. This is because Foundation is currently able to return
> >> String.Index values that don't point to Character boundaries.
> >>
> >> Specifically, in Swift 3, the following code:
> >>
> >> import Foundation
> >>
> >> let str = "e\u{301}galite\u{301}"
> >> let r = str.rangeOfCharacter(from: ["\u{301}"])!
> >> print(str[r] == "\u{301}")
> >>
> >> will print “true”, because the returned range identifies the combining
> >> acute accent only. But with the proposed String.Index revisions, the
> >> `str[r]` subscript will return the whole "e\u{301}” combined
> >> character.
> >
> > Hmm, true.
> >
> > This doesn't totally invalidate the concern, but...
> >
> > The existing behavior is a bug in the way Foundation interfaces with the
> > 3.0 standard library.  str.rangeOfCharacter (which should be
> > str.rangeOfUnicodeScalar) should be returning
> > Range but is returning a misaligned
> > Range.  Everything in the 3.0 standard library design is
> > engineered to ensure that misaligned String indices don't happen at all
> > (although they still can—just use an index from string1 in string2),
> > thus the rigorous failable index conversion APIs.
> >
> > It's easy to produce results with this API that don't make sense in
> > Swift 3:
> >
> >   let str = "e\u{301}\u{302}galite\u{301}"
> >   str.rangeOfCharacter(from: ["\u{301}"])!
> >   print(str[r.lowerBound] == "\u{301}") // false
> >
> >> This is, of course, an edge case, but we need to consider the
> >> implications of this and determine if it actually affects anything
> >> that’s likely to be a problem in practice.
> >
> > I agree.  It would also be reasonable to pick a different behavior for
> > misaligned indices, for example:
> >
> >   Indices *that don't fall on a code unit boundary* are “rounded down”
> >   before use.
> >
> > The existing behaviors for these cases are a cluster of coincidences,
> > and were never designed.  I doubt that preserving them in their current
> > form makes sense and will lead to a usable string semantics for the long
> > term, but if they do in fact happen to make sense, we'd still need to
> > codify the rules so we can keep future behaviors consistent.
>
> Having considered this further, I'd like to propose these revised
> semantics for
> misaligned indices, to preserve the behavior of rangeOfCharacter and its
> ilk:
>
> * Definition: an index i is aligned with respect to a string view v iff
>
>  v.indices.contains(i) || v.endIndex == i
>
>   If i is not aligned with respect to v it is *misaligned* with respect
>   to v.
>
> * When i is misaligned with respect to a String/Substring view s.xxx
>   (imagining s itself could also be spelled as s.xxx), combining s.xxx
>   and i is done in terms of underlying code units and i.encodedOffset.
>
>   It's very hard to write these semantics down precisely in terms of
>   existing constructs, but this should give you a sense of what I have
>   in mind:
>
>   1. the suffix beginning at i is formed by slicing the underlying
> codeUnits at i.encodedOffset, forming a new Substring around that
> slice, and getting its corresponding xxx view
>
>  s.xxx[i...]
>
>   is roughly equivalent to:
>
> Substring(s.utf16[String.Index(encodedOffset: i.encodedOffset)...]).xxx
>
>   (given that we currently have UTF-16 code units)
>
>   2. similarly
>
>  s.xxx[..
>   is equivalent to something like:
>
> Substring(s.utf16[..
>   3. s.xxx[i] is equivalent to s.xxx[i...].first!
>
>   4. s.xxx.index(after: i) is equivalent to
> s.xxx[i...].indices.dropFirst().first!
>
>   5. s.xxx.index(before: i) is equivalent to s.xxx[..
> I'm concerned that we have no precise way to specify the semantics of #1
> and #2, to the point where it might be better to implement them that way
> but leave the semantics unspecified.  Another alternative would be to
> add the APIs needed to make it possible to express a precise equivalence
> instead of a rough equivalence.  If anyone has better ideas, I'm all ears.
>
> --
> -Dave
>
> ___
> swift-evolution mailing list
> swift-evolution@swift.org
> 

Re: [swift-evolution] [Review] SE-0180: String Index Overhaul

2017-06-13 Thread Dave Abrahams via swift-evolution

on Tue Jun 06 2017, Dave Abrahams  wrote:

>> Overall it looks pretty good. But unfortunately the answer to "Will
>> applications still compile but produce different behavior than they
>> used to?" is actually "Yes", when using APIs provided by
>> Foundation. This is because Foundation is currently able to return
>> String.Index values that don't point to Character boundaries.
>>
>> Specifically, in Swift 3, the following code:
>>
>> import Foundation
>>
>> let str = "e\u{301}galite\u{301}"
>> let r = str.rangeOfCharacter(from: ["\u{301}"])!
>> print(str[r] == "\u{301}")
>>
>> will print “true”, because the returned range identifies the combining
>> acute accent only. But with the proposed String.Index revisions, the
>> `str[r]` subscript will return the whole "e\u{301}” combined
>> character.
>
> Hmm, true.
>
> This doesn't totally invalidate the concern, but...
>
> The existing behavior is a bug in the way Foundation interfaces with the
> 3.0 standard library.  str.rangeOfCharacter (which should be
> str.rangeOfUnicodeScalar) should be returning
> Range but is returning a misaligned
> Range.  Everything in the 3.0 standard library design is
> engineered to ensure that misaligned String indices don't happen at all
> (although they still can—just use an index from string1 in string2),
> thus the rigorous failable index conversion APIs.
>
> It's easy to produce results with this API that don't make sense in
> Swift 3:
>
>   let str = "e\u{301}\u{302}galite\u{301}"
>   str.rangeOfCharacter(from: ["\u{301}"])!
>   print(str[r.lowerBound] == "\u{301}") // false
>
>> This is, of course, an edge case, but we need to consider the
>> implications of this and determine if it actually affects anything
>> that’s likely to be a problem in practice.
>
> I agree.  It would also be reasonable to pick a different behavior for
> misaligned indices, for example:
>
>   Indices *that don't fall on a code unit boundary* are “rounded down”
>   before use.
>
> The existing behaviors for these cases are a cluster of coincidences,
> and were never designed.  I doubt that preserving them in their current
> form makes sense and will lead to a usable string semantics for the long
> term, but if they do in fact happen to make sense, we'd still need to
> codify the rules so we can keep future behaviors consistent.

Having considered this further, I'd like to propose these revised semantics for
misaligned indices, to preserve the behavior of rangeOfCharacter and its
ilk:

* Definition: an index i is aligned with respect to a string view v iff 

 v.indices.contains(i) || v.endIndex == i

  If i is not aligned with respect to v it is *misaligned* with respect
  to v.

* When i is misaligned with respect to a String/Substring view s.xxx
  (imagining s itself could also be spelled as s.xxx), combining s.xxx
  and i is done in terms of underlying code units and i.encodedOffset.

  It's very hard to write these semantics down precisely in terms of
  existing constructs, but this should give you a sense of what I have
  in mind:

  1. the suffix beginning at i is formed by slicing the underlying
codeUnits at i.encodedOffset, forming a new Substring around that
slice, and getting its corresponding xxx view

 s.xxx[i...] 

  is roughly equivalent to:

Substring(s.utf16[String.Index(encodedOffset: i.encodedOffset)...]).xxx

  (given that we currently have UTF-16 code units)

  2. similarly

 s.xxx[..

Re: [swift-evolution] [Review] SE-0180: String Index Overhaul

2017-06-13 Thread Dave Abrahams via swift-evolution

on Mon Jun 12 2017, David Waite  wrote:

>> On Jun 9, 2017, at 9:24 PM, Dave Abrahams via swift-evolution
>  wrote:
>> on Fri Jun 09 2017, Kevin Ballard
>> > >
>> wrote:
>>> On Tue, Jun 6, 2017, at 10:57 AM, Dave Abrahams via swift-evolution wrote:
> 
>>> 
>>> Ah, right. So a String.Index is actually something similar to
>>> 
>>> public struct Index {
>>>public var encodedOffset: Int
>
>>>private var byteOffset: Int // UTF-8 offset into the UTF-8 code unit
>>> }
>> 
>> Similar.  I'd write it this way:
>> 
>> public struct Index {
>>   public var encodedOffset: Int
>> 
>>   // Offset into a UnicodeScalar represented in an encoding other
>>   // than the String's underlying encoding
>>   private var transcodedOffset: Int 
>> }
>
> I *think* the following is what the proposal is saying, but let me
> walk through it:

OK. I'm going to be extremely nitpicky about terminology just to ensure
complete clarity; please don't take it as criticism.

> My understanding would be:
> - An index manipulated at the string level points to the start a
> grapheme cluster which is also a particular code point 

* A grapheme cluster is not a code point

* Probably you mean that it also points to the start of a code point

* We try not to say “code point” because

  a) despite its loose and liberal use in the Unicode standard,
 according to Unicode experts that term technically means something
 having specifically to do with UTF-16 (IIRC the space of code
 points includes surrogate values), and while it was the same thing
 as a Unicode scalar value in the days of UCS-2, is mostly not a
 useful concept today.

  b) the potential for confusion between “code unit” and “code point” is
 huge; people mix them up all the time.

  c) Instead we use “Unicode scalar value” or “Unicode scalar” for
 short; my advice is to banish the term “code point” from your
 vocabulary as I have—except when picking nits ;-)

> and to a code unit of the underlying string backing data 

Yes.  If String indices were Hashable, then these would all be true:

Set(s.indices).isSubset(of: s.unicodeScalars.indices)
Set(s.unicodeScalars.indices).isSubset(of: s.utf16.indices)
Set(s.unicodeScalars.indices).isSubset(of: s.utf8.indices)

(the views also all have the same endIndex)

Today, the code units are utf16.  If we lift that restriction and add a
codeUnits view, then

Set(s.indices).isSubset(of: s.codeUnits.indices)

> - The unicodeScalar view can be intra-grapheme cluster, pointing at a
> code point 

I don't follow, sorry.  I think the unicodeScalar view doesn't point at
anything.

> - The utf-16 index can be intra-codepoint, since some code points are
> represented by two code units - The uff-8 index can be intra-codepoint
> as well, since code points are represented by up to four code units

if we s/codepoint/unicode scalar/, then yes.

> So is the idea of the Index struct is that the encodedOffset is an
> offset in the native representation of the string (byte offset, word
> offset, etc) to the start of a grapheme, and transcodedOffset is data
> for Unicode Scalar, UTF-16 and UTF-8 views to represent an offset
> within a grapheme to a code point or code unit?

Almost.  First, remember that transcodedOffset is currently just a
conceptual thing and not part of the proposed API.  But if we exposed
it, the following would be true:

  s.indices.index(where: { $0.transcodedOffset != 0 }) == nil
  s.unicodeScalars.indices.index(where: { $0.transcodedOffset != 0 }) == nil

and, because the native encoding of Strings is currently always UTF-16 
compatible

  s.utf16.indices.index(where: { $0.transcodedOffset != 0 }) == nil

In other words, a non-zero transcodedOffset can only occur in indices
from views that represent the string as code units in something other
than its native encoding, and only if that view is not UTF-32.

> My feeling is that ‘encoded’ is not enough to distinguish whether
> encodedOffset is meant to indicate an offset in graphemes, code
> points, or code units, 

IMO if you know Unicode, it does, because **Unicode encoding** is
specifically about *representation* in terms of code units.  The
question, then, is whether it's confusing for people who know Unicode
less well, and whether that actually matters.  My supposition has been
that, when all the right high-level APIs are in place, most people will
never touch encodedOffset(s).  But I could be wrong.

The best alternative I can come up with is “nativeCodeUnitOffset,” which
is a mouthful.  We can't just use “codeUnitOffset” because, for example,
in the utf8 view of today's UTF-16-encoded string, this is not about
counting UTF-8 code units; it's still about UTF-16 code units.

> or to specify that an index to the same character in two normalized
> strings may be different if one is backed by UTF-8 and the other
> UTF-16. 

Re: [swift-evolution] [Review] SE-0180: String Index Overhaul

2017-06-12 Thread David Waite via swift-evolution
> On Jun 9, 2017, at 9:24 PM, Dave Abrahams via swift-evolution 
>  wrote:
> on Fri Jun 09 2017, Kevin Ballard  > wrote:
>> On Tue, Jun 6, 2017, at 10:57 AM, Dave Abrahams via swift-evolution wrote:

>> 
>> Ah, right. So a String.Index is actually something similar to
>> 
>> public struct Index {
>>public var encodedOffset: Int
>>private var byteOffset: Int // UTF-8 offset into the UTF-8 code unit
>> }
> 
> Similar.  I'd write it this way:
> 
> public struct Index {
>   public var encodedOffset: Int
> 
>   // Offset into a UnicodeScalar represented in an encoding other
>   // than the String's underlying encoding
>   private var transcodedOffset: Int 
> }

I *think* the following is what the proposal is saying, but let me walk through 
it:

My understanding would be:
- An index manipulated at the string level points to the start a grapheme 
cluster which is also a particular code point and to a code unit of the 
underlying string backing data
- The unicodeScalar view can be intra-grapheme cluster, pointing at a code point
- The utf-16 index can be intra-codepoint, since some code points are 
represented by two code units
- The uff-8 index can be intra-codepoint as well,  since code points are 
represented by up to four code units

So is the idea of the Index struct is that the encodedOffset is an offset in 
the native representation of the string (byte offset, word offset, etc) to the 
start of a grapheme, and transcodedOffset is data for Unicode Scalar, UTF-16 
and UTF-8 views to represent an offset within a grapheme to a code point or 
code unit?

My feeling is that ‘encoded’ is not enough to distinguish whether encodedOffset 
is meant to indicate an offset in graphemes, code points, or code units, or to 
specify that an index to the same character in two normalized strings may be 
different if one is backed by UTF-8 and the other UTF-16. 
“encodedCharacterOffset” may be better.

This index struct does limit some sorts of imagined string implementations, 
such as a string maintained piecewise across multiple allocation units or 
strings using a stateful character encoding like ISO/IEC 2022.

-DW

P.S. I’m also curious why the methods are optional failing vs retaining the 
current API and having them fatal error.___
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


Re: [swift-evolution] [Review] SE-0180: String Index Overhaul

2017-06-11 Thread T.J. Usiyan via swift-evolution
+1

I only gave it a quick read though.

On Sun, Jun 11, 2017 at 3:01 PM, Hooman Mehr via swift-evolution <
swift-evolution@swift.org> wrote:

> Overall, I am strong +1 on this, but I don’t have time to go through a
> detailed analysis of how it will affect my own use cases.
>
> On Jun 4, 2017, at 4:29 PM, Ted Kremenek via swift-evolution <
> swift-evolution@swift.org> wrote:
>
> Hello Swift community,
>
> The review of SE-0180 "String Index Overhaul" begins now and runs through 
> *June
> 8, 2017*.
>
> The proposal is available here:
>
> https://github.com/apple/swift-evolution/blob/master/
> proposals/0180-string-index-overhaul.md
>
> Reviews are an important part of the Swift evolution process. All reviews
> should be sent to the swift-evolution mailing list at:
>
> https://lists.swift.org/mailman/listinfo/swift-evolution
>
> or, if you would like to keep your feedback private, directly to the
> review manager. When replying, please try to keep the proposal link at the
> top of the message:
>
> Proposal link:
>
> https://github.com/apple/swift-evolution/blob/master/
> proposals/0180-string-index-overhaul.md
> Reply text
>
> Other replies
>
> What goes into a review?
>
> The goal of the review process is to improve the proposal under review
> through constructive criticism and, eventually, determine the direction of
> Swift. When writing your review, here are some questions you might want to
> answer in your review:
>
>- What is your evaluation of the proposal?
>- Is the problem being addressed significant enough to warrant a
>change to Swift?
>- Does this proposal fit well with the feel and direction of Swift?
>- If you have used other languages or libraries with a similar
>feature, how do you feel that this proposal compares to those?
>- How much effort did you put into your review? A glance, a quick
>reading, or an in-depth study?
>
> More information about the Swift evolution process is available at:
>
> https://github.com/apple/swift-evolution/blob/master/process.md
>
> Thank you,
> Ted (Review Manager)
> ___
> swift-evolution mailing list
> swift-evolution@swift.org
> https://lists.swift.org/mailman/listinfo/swift-evolution
>
>
>
> ___
> swift-evolution mailing list
> swift-evolution@swift.org
> https://lists.swift.org/mailman/listinfo/swift-evolution
>
>
___
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


Re: [swift-evolution] [Review] SE-0180: String Index Overhaul

2017-06-11 Thread Hooman Mehr via swift-evolution
Overall, I am strong +1 on this, but I don’t have time to go through a detailed 
analysis of how it will affect my own use cases. 

> On Jun 4, 2017, at 4:29 PM, Ted Kremenek via swift-evolution 
>  wrote:
> 
> Hello Swift community,
> 
> The review of SE-0180 "String Index Overhaul" begins now and runs through 
> June 8, 2017.
> 
> The proposal is available here:
> 
> https://github.com/apple/swift-evolution/blob/master/proposals/0180-string-index-overhaul.md
>  
> 
> Reviews are an important part of the Swift evolution process. All reviews 
> should be sent to the swift-evolution mailing list at:
> 
> https://lists.swift.org/mailman/listinfo/swift-evolution 
> 
> or, if you would like to keep your feedback private, directly to the review 
> manager. When replying, please try to keep the proposal link at the top of 
> the message:
> 
> Proposal link:
> 
> https://github.com/apple/swift-evolution/blob/master/proposals/0180-string-index-overhaul.md
>  
> 
> Reply text
> 
> Other replies
> What goes into a review?
> 
> The goal of the review process is to improve the proposal under review 
> through constructive criticism and, eventually, determine the direction of 
> Swift. When writing your review, here are some questions you might want to 
> answer in your review:
> 
> What is your evaluation of the proposal?
> Is the problem being addressed significant enough to warrant a change to 
> Swift?
> Does this proposal fit well with the feel and direction of Swift?
> If you have used other languages or libraries with a similar feature, how do 
> you feel that this proposal compares to those?
> How much effort did you put into your review? A glance, a quick reading, or 
> an in-depth study?
> More information about the Swift evolution process is available at:
> 
> https://github.com/apple/swift-evolution/blob/master/process.md 
> 
> Thank you,
> Ted (Review Manager)
> 
> ___
> swift-evolution mailing list
> swift-evolution@swift.org
> https://lists.swift.org/mailman/listinfo/swift-evolution

___
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


Re: [swift-evolution] [Review] SE-0180: String Index Overhaul

2017-06-09 Thread Dave Abrahams via swift-evolution

on Fri Jun 09 2017, Kevin Ballard  wrote:

> On Tue, Jun 6, 2017, at 10:57 AM, Dave Abrahams via swift-evolution wrote:
>> 
>> on Mon Jun 05 2017, Kevin Ballard  wrote:
>> 
>> > There’s also the curious case where I can have two String.Index values
>> > that compare unequal but actually return the same value when used in a
>
>> > subscript. 
>> > For example, with the above string, if I have a
>> > String.Index(encodedOffset: 0) and a String.Index(encodedOffset:
>> > 1). This may not be a problem in practice, but it’s something to be
>> > aware of.
>> 
>> I don't think this one even rises to that level.
>> 
>> let s = "aaa"
>> var si = s.indices.makeIterator()
>> let i0 = si.next()!
>> let i1 = si.next()!
>> print(i0 == i1)   // false
>> print(s[i0] == s[i1]) // true.  Surprised?
>
> Good point.
>
>> > I’m also confused by the paragraph about index comparison. It talks
>> > about if two indices are valid in a single String view, comparison
>> > semantics are according to Collection, and otherwise indexes are
>> > compared using encodedOffsets, and this means indexes aren’t totally
>> > ordered. But I’m not sure what the first part is supposed to mean. How
>> > is comparing indices that are valid within a single view any different
>> > than comparing the encodedOffsets?
>> 
>> In today's String, encodedOffset is an offset in UTF-16.  Two indices
>> into a UTF-8 view may be unequal yet have the same encodedOffset.
>
> Ah, right. So a String.Index is actually something similar to
>
> public struct Index {
> public var encodedOffset: Int
> private var byteOffset: Int // UTF-8 offset into the UTF-8 code unit
> }

Similar.  I'd write it this way:

public struct Index {
   public var encodedOffset: Int

   // Offset into a UnicodeScalar represented in an encoding other
   // than the String's underlying encoding
   private var transcodedOffset: Int 
}

> In this case, can't we still define String.Index comparison as merely
> being the lexicographical comparison of (encodedOffset, byteOffset)?

Yes, and that's how it's implemented in the PR.  But byteOffset is not
part of the user model, so we can't specify it that way.

> Also, as a side note, the proposal implies that encodedOffset is
> mutable. Is this actually the case? If so, I assume that mutating it
> would also reset the byteOffset?

Yes, 

 i.encodedOffset = n

is equivalent to

 i = String.Index(encodedOffset: n)
 
-- 
-Dave

___
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


Re: [swift-evolution] [Review] SE-0180: String Index Overhaul

2017-06-09 Thread Kevin Ballard via swift-evolution
On Tue, Jun 6, 2017, at 10:57 AM, Dave Abrahams via swift-evolution wrote:
> 
> on Mon Jun 05 2017, Kevin Ballard  wrote:
> 
> > There’s also the curious case where I can have two String.Index values
> > that compare unequal but actually return the same value when used in a
> > subscript. 
> > For example, with the above string, if I have a
> > String.Index(encodedOffset: 0) and a String.Index(encodedOffset:
> > 1). This may not be a problem in practice, but it’s something to be
> > aware of.
> 
> I don't think this one even rises to that level.
> 
> let s = "aaa"
> var si = s.indices.makeIterator()
> let i0 = si.next()!
> let i1 = si.next()!
> print(i0 == i1)   // false
> print(s[i0] == s[i1]) // true.  Surprised?

Good point.

> > I’m also confused by the paragraph about index comparison. It talks
> > about if two indices are valid in a single String view, comparison
> > semantics are according to Collection, and otherwise indexes are
> > compared using encodedOffsets, and this means indexes aren’t totally
> > ordered. But I’m not sure what the first part is supposed to mean. How
> > is comparing indices that are valid within a single view any different
> > than comparing the encodedOffsets?
> 
> In today's String, encodedOffset is an offset in UTF-16.  Two indices
> into a UTF-8 view may be unequal yet have the same encodedOffset.

Ah, right. So a String.Index is actually something similar to

public struct Index {
public var encodedOffset: Int
private var byteOffset: Int // UTF-8 offset into the UTF-8 code unit
}

In this case, can't we still define String.Index comparison as merely being the 
lexicographical comparison of (encodedOffset, byteOffset)?

Also, as a side note, the proposal implies that encodedOffset is mutable. Is 
this actually the case? If so, I assume that mutating it would also reset the 
byteOffset?

-Kevin Ballard
___
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


Re: [swift-evolution] [Review] SE-0180: String Index Overhaul

2017-06-06 Thread Dave Abrahams via swift-evolution

on Mon Jun 05 2017, Kevin Ballard  wrote:

> https://github.com/apple/swift-evolution/blob/master/proposals/0180-string-index-overhaul.md
> 
>
> Overall it looks pretty good. But unfortunately the answer to "Will
> applications still compile but produce different behavior than they
> used to?" is actually "Yes", when using APIs provided by
> Foundation. This is because Foundation is currently able to return
> String.Index values that don't point to Character boundaries.
>
> Specifically, in Swift 3, the following code:
>
> import Foundation
>
> let str = "e\u{301}galite\u{301}"
> let r = str.rangeOfCharacter(from: ["\u{301}"])!
> print(str[r] == "\u{301}")
>
> will print “true”, because the returned range identifies the combining
> acute accent only. But with the proposed String.Index revisions, the
> `str[r]` subscript will return the whole "e\u{301}” combined
> character.

Hmm, true.

This doesn't totally invalidate the concern, but...

The existing behavior is a bug in the way Foundation interfaces with the
3.0 standard library.  str.rangeOfCharacter (which should be
str.rangeOfUnicodeScalar) should be returning
Range but is returning a misaligned
Range.  Everything in the 3.0 standard library design is
engineered to ensure that misaligned String indices don't happen at all
(although they still can—just use an index from string1 in string2),
thus the rigorous failable index conversion APIs.

It's easy to produce results with this API that don't make sense in
Swift 3:

  let str = "e\u{301}\u{302}galite\u{301}"
  str.rangeOfCharacter(from: ["\u{301}"])!
  print(str[r.lowerBound] == "\u{301}") // false

> This is, of course, an edge case, but we need to consider the
> implications of this and determine if it actually affects anything
> that’s likely to be a problem in practice.

I agree.  It would also be reasonable to pick a different behavior for
misaligned indices, for example:

  Indices *that don't fall on a code unit boundary* are “rounded down”
  before use.

The existing behaviors for these cases are a cluster of coincidences,
and were never designed.  I doubt that preserving them in their current
form makes sense and will lead to a usable string semantics for the long
term, but if they do in fact happen to make sense, we'd still need to
codify the rules so we can keep future behaviors consistent.

> There’s also the curious case where I can have two String.Index values
> that compare unequal but actually return the same value when used in a
> subscript. 
> For example, with the above string, if I have a
> String.Index(encodedOffset: 0) and a String.Index(encodedOffset:
> 1). This may not be a problem in practice, but it’s something to be
> aware of.

I don't think this one even rises to that level.

let s = "aaa"
var si = s.indices.makeIterator()
let i0 = si.next()!
let i1 = si.next()!
print(i0 == i1)   // false
print(s[i0] == s[i1]) // true.  Surprised?

> I’m also confused by the paragraph about index comparison. It talks
> about if two indices are valid in a single String view, comparison
> semantics are according to Collection, and otherwise indexes are
> compared using encodedOffsets, and this means indexes aren’t totally
> ordered. But I’m not sure what the first part is supposed to mean. How
> is comparing indices that are valid within a single view any different
> than comparing the encodedOffsets?

In today's String, encodedOffset is an offset in UTF-16.  Two indices
into a UTF-8 view may be unequal yet have the same encodedOffset.

Regards,

-- 
-Dave

___
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


Re: [swift-evolution] [Review] SE-0180: String Index Overhaul

2017-06-06 Thread Kevin Ballard via swift-evolution
https://github.com/apple/swift-evolution/blob/master/proposals/0180-string-index-overhaul.md
 


Overall it looks pretty good. But unfortunately the answer to "Will 
applications still compile but produce different behavior than they used to?" 
is actually "Yes", when using APIs provided by Foundation. This is because 
Foundation is currently able to return String.Index values that don't point to 
Character boundaries.

Specifically, in Swift 3, the following code:

import Foundation

let str = "e\u{301}galite\u{301}"
let r = str.rangeOfCharacter(from: ["\u{301}"])!
print(str[r] == "\u{301}")

will print “true”, because the returned range identifies the combining acute 
accent only. But with the proposed String.Index revisions, the `str[r]` 
subscript will return the whole "e\u{301}” combined character.

This is, of course, an edge case, but we need to consider the implications of 
this and determine if it actually affects anything that’s likely to be a 
problem in practice.

There’s also the curious case where I can have two String.Index values that 
compare unequal but actually return the same value when used in a subscript. 
For example, with the above string, if I have a String.Index(encodedOffset: 0) 
and a String.Index(encodedOffset: 1). This may not be a problem in practice, 
but it’s something to be aware of.

I’m also confused by the paragraph about index comparison. It talks about if 
two indices are valid in a single String view, comparison semantics are 
according to Collection, and otherwise indexes are compared using 
encodedOffsets, and this means indexes aren’t totally ordered. But I’m not sure 
what the first part is supposed to mean. How is comparing indices that are 
valid within a single view any different than comparing the encodedOffsets?

-Kevin Ballard___
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution