Re: [swift-evolution] [swift-evolution-announce] [Revised and review extended] SE-0180 - String Index Overhaul

2017-07-02 Thread Dave Abrahams via swift-evolution

Hi Karl,

It was pointed out to me that I never answered this thoughtful post of
yours...

on Mon Jun 26 2017, Karl Wagner  wrote:

>> On 23. Jun 2017, at 02:59, Kevin Ballard via swift-evolution
>>  wrote:
>> 
>> https://github.com/apple/swift-evolution/blob/master/proposals/0180-string-index-overhaul.md
>> 
>> 
>> Given the discussion in the original thread about potentially having
>> Strings backed by something other than utf16 code units, I'm
>> somewhat concerned about having this kind of vague `encodedOffset`
>> that happens to be UTF16 code units. If this is supposed to
>> represent an offset into whatever code units the String is backed
>> by, then it's going to be a problem because the user isn't supposed
>> to know or care what the underlying storage for the String is.
>
> Is that true? The String manifesto shows a design where the underlying
> Encoding and code-units are exposed.

That is the eventual goal.  Note that with this proposal we are making
progress towards that goal, but getting all the way there is out of
scope for this release.

> From the talk about String’s being backed by something that isn’t
> UTF-16, I took that to mean that String might one-day become
> generic. Defaults for generic parameters have been mentioned on the
> list before, so “String” could still refer to “String”
> on OSX and maybe “String” on Linux.

I think you may have misunderstood.  String currently supports a few
different underlying representations (ASCII, UTF-16, NSString), all of
which happen to use a UTF-16-compatible encoding.  The eventual goal is
to expand the possible underlying representations of String to
accomodate other encodings.

That said, the underlying representation of String is *not* part of
String's type, and we don't intend to change that.  When String APIs
access the underlying representation, that access is dynamically
dispatched.  If the encoding were a generic parameter, then it would be
statically dispatched (at least in part), but it would also become part
of String's type, and, for example, you would get an error when trying
to pass a String where a String was
expected.  It's important that code passing Strings around remain
smoothly interoperable, so we don't want to introduce this sort of type
mismatch.

Instead, the intention is that someone could make a UTF8String type that
conformed to StringProtocol, and that String itself could be constructed
from any instance of StringProtocol to be used as its underlying
representation.  That way, if you need the performance that comes with
knowing and manipulating the underlying encoding, you can use UTF8String
directly, and if you need to interoperate with code that uses the
lingua-franca String type, you can wrap String around your UTF8String
and pass that.

> I would support a definition of encodedOffset that removed mention of
> UTF-16 and phrased things in terms of String.Encoding and
> code-units. 

Well, a few points about this:

I support removing the text “(UTF-16)” from the initial documentation
comments on these APIs, which is, AFAICT, the only source of the concern
you and others have expressed. That said, Strings are in fact currently
encoded as UTF-16 and as long as Cocoa interop is important, that too is
important and useful information, so it should be documented somewhere.

I don't support describing anything in terms of String.Encoding at this
time.  That enum was added to String by the Foundation overlay, and is not
part of the plan for String except insofar as it is required for source
compatibility and Cocoa interop.  A more appropriate way to describe the
encoding in terms of the language would be as something like
Unicode.UTF16 (at compile-time) or an instance of Unicode.Encoding.Type
(at runtime).  But I see no need to describe it in language terms until
we are ready to add APIs to String that can support multiple encodings
and/or report the underlying encoding, and we are not ready to do that
yet.

> For example, I would like to be able to construct new String indices
> from a known index plus a quantity of code-units known to represent a
> sequence of characters:
>
> var stringOne = “Hello,“
> let stringTwo = “ world"
>
> var idx = stringOne.endIndex
> stringOne.append(contentsOf: stringTwo)
> idx = String.Index(encodedOffset: idx.encodedOffset + 
> stringTwo.codeUnits.count)
> assert(idx == stringOne.endIndex)

I'm not sure what you mean by “represent a sequence of characters” in
this context.  Don't a sequence of code units always represent a
sequence of characters?

The code you wrote above would (almost) work as written under this
proposal, given that Strings always have an encoding that's compatible
with some default.  In other words, making it work *depends* on the fact
that the encoding of stringTwo is compatible with (has a non-strict
sub/superset relation 

Re: [swift-evolution] [swift-evolution-announce] [Revised and review extended] SE-0180 - String Index Overhaul

2017-06-27 Thread Dave Abrahams via swift-evolution

on Thu Jun 22 2017, Kevin Ballard  wrote:

>> https://github.com/apple/swift-evolution/blob/master/proposals/0180-string-index-overhaul.md
> Given the discussion in the original thread about potentially having
> Strings backed by something other than utf16 code units, I'm somewhat
> concerned about having this kind of vague `encodedOffset` that happens
> to be UTF16 code units. 

What's vague about it?

> If this is supposed to represent an offset into whatever code units
> the String is backed by, then it's going to be a problem because the
> user isn't supposed to know or care what the underlying storage for
> the String is. 

In the long run a user that cares about performance may very well care
about the underlying encoding.  However, that, like access to the
index's encodedOffset, is an expert-level concern.

> And I can imagine potential issues with archiving/unarchiving where
> the unarchived String has a different storage type than the archived
> one, and therefore `encodedOffset` would gain a new meaning that
> screws up unarchived String.Index values.  

Yes.  If you round-trip serialize the index and you don't also preserve
the encoding of the string, you will get nonsense.  As far as I know
that is a feature of any universe in which multiple arbitrary encodings
exist... like the universe we live in.

> The other problem with using this as utf16 is how am I supposed to
> archive/unarchive a String.Index that comes from String.UTF8View?

You can serialize its encodedOffset, which will work as long as the
index is on a Unicode scalar boundary.  If it is not, you can compute
the notional transcodedOffset and serialize that too, by measuring the
distance in the UTF8 view to your index from the previous unicode scalar
boundary.  Then you can reconstruct that position.  Obviously this is
inconvenient.  The proposal is not trying to solve the problem of doing
that conveniently today; it's only trying to lay the necessary
groundwork.

> AFAICT the only way to do that is to ignore encodedOffset entirely and
> instead calculate the distance between s.utf8.startIndex and my index
> (and then recreate the index later on by advancing from
> startIndex). 
>
> But this RFC explicitly says that archiving/unarchiving indices is one
> of the goals of this overhaul.  --
>
> The section on comparison still talks about how this is a weak
> ordering.  In the other thread it was explained as being done so
> because the internal transcodedOffset isn't public, but that still
> makes this explanation very odd. String.Index comparison should not be
> weak ordering, because all indices can be expressed in the utf8View if
> nothing else, and in that view they have a total order. So it should
> just be defined as a total order, based on the position in the
> utf8View that the index corresponds with.  --

An index between the halves of a UTF16 surrogate pair has no
corresponding position in the UTF8 view.  You could arbitrarily choose
one (e.g. two UTF8 code units past the start of the unicode scalar), but
I'm not sure that would produce better results.

> The detailed design of the index has encodedOffset being mutable (and
> this was confirmed in the other thread as intentional). I don't think
> this is a good idea, because it makes the following code behave oddly:
>   let x = index.encodedOffset
>   index.encodedOffset = x
>
> Specifically, this resets the private transcodedOffset, so if you do
> this with an intra-code-unit Index taken from the utf8View, the
> modified Index may point to a different byte. I'm also not sure why
> you'd ever want to do this operation anyway. If you want to change the
> encodedOffset, you can just say `index = String.Index(encodedOffset:
> x)`.

I can take or leave the mutability of encodedOffset, personally.

-- 
-Dave

___
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


Re: [swift-evolution] [swift-evolution-announce] [Revised and review extended] SE-0180 - String Index Overhaul

2017-06-27 Thread Drew Crawford via swift-evolution



On June 26, 2017 at 5:43:42 PM, Karl Wagner via swift-evolution 
(swift-evolution@swift.org) wrote:

I would support a definition of encodedOffset that removed mention of UTF-16 
and phrased things in terms of String.Encoding and code-units. For example, I 
would like to be able to construct new String indices from a known index plus a 
quantity of code-units known to represent a sequence of characters:

var stringOne = “Hello,“
let stringTwo = “ world"

var idx = stringOne.endIndex
stringOne.append(contentsOf: stringTwo)
idx = String.Index(encodedOffset: idx.encodedOffset + stringTwo.codeUnits.count)
assert(idx == stringOne.endIndex)


I second this concern.  We currently use a non-Foundation library that prefers 
UTF8 encoding, I think UTF8-backed strings are important.

The choice of UTF16 as string storage in Swift makes historical sense (e.g. 
runtime interop with ObjC-backed strings) but as Swift moves forward it makes 
less sense.  We need a string system that behaves more like a lightweight 
accessor for the underlying storage (e.g. if you like your input's encoding you 
can keep it) unless you do something (like peruse a view) that requires 
promotion to a new format.  That's a different proposal, but that's the 
direction I'd like to see us head.

This proposal is in many ways the opposite of that, it specifies that we 
standardize on UTF16, and in particular we have in view the problem of file 
archiving (where we would have long-term unarchival guarantees) that complicate 
backing this out later.  This feels like a kludge to support Foundation.  In 
the archive context the offset should either be "whatever the string is" (which 
you would have to know anyway to archive/unarchive that string) or a 
full-fledged offset type that specifies the encoding such as

let i = String.Index (
    encoding: .utf16
    offset: 36
)

the latter of which would be used to port an Index between string 
representations if that's a useful feature.

More broadly though, I disagree with the motivation of the proposal, 
specifically

The result is a great deal of API surface area for apparently little gain in 
ordinary code

In ordinary code, we work with a single string representation (e.g. in Cocoa 
it's UTF16), and there is a correspondence between our UTF16 offset and our 
UTF16 string such that index lookups will succeed.  When we collapse indexes, 
we lose the information to make this correspondence, which were previously 
encoded into the typesystem.  So the "gain in ordinary code" is that 
programmers do not have to sprinkle `!` in the common case of string index 
lookups because we can infer at compile time from the type correspondence it is 
unnecessary.

Under this proposal, they will have to sprinkle the `!`, which adds friction 
and performance impact (`!` is a runtime check, and UTF16 promotions are 
expensive).  I don't believe the simplicity of implementing archival (which one 
has to only write once) is worth the hassle of complicating all string index 
lookups.

Does this proposal fit well with the feel and direction of Swift?

To me, one of Swift's greatest strengths is the type system.  We can encode 
information into the type system and find our bugs at compile time instead of 
runtime.

Here, we are proposing to partially erase a type because it's annoying to write 
code that deals with string encodings.  But our code will deal with string 
encodings somehow whether `utf16` appears in our sourcecode or not.

When we erase the type of our offset, we lose a powerful tool to prove the 
correctness of our string encodings, that is, the compiler can check our utf16 
offset is used with a utf16 string.  Without that tool, we either have to check 
that dynamically, or, worst case, there are bugs in our program.

Under this proposal we would encourage the use of a bare-integer offsets for 
string lookup.  That does not seem Swifty to me.  A Swifty solution would be to 
add a dynamically-checked type-erased String.Index alongside the existing 
statically-checked fully-typed String.UTF8/16View.Index so that the programmer 
can choose the abstraction with the performance/simplicity behavior appropriate 
for their problem.

___
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


Re: [swift-evolution] [swift-evolution-announce] [Revised and review extended] SE-0180 - String Index Overhaul

2017-06-26 Thread Karl Wagner via swift-evolution

> On 23. Jun 2017, at 02:59, Kevin Ballard via swift-evolution 
>  wrote:
> 
> https://github.com/apple/swift-evolution/blob/master/proposals/0180-string-index-overhaul.md
>  
> 
> 
> Given the discussion in the original thread about potentially having Strings 
> backed by something other than utf16 code units, I'm somewhat concerned about 
> having this kind of vague `encodedOffset` that happens to be UTF16 code 
> units. If this is supposed to represent an offset into whatever code units 
> the String is backed by, then it's going to be a problem because the user 
> isn't supposed to know or care what the underlying storage for the String is.

Is that true? The String manifesto shows a design where the underlying Encoding 
and code-units are exposed.

>From the talk about String’s being backed by something that isn’t UTF-16, I 
>took that to mean that String might one-day become generic. Defaults for 
>generic parameters have been mentioned on the list before, so “String” could 
>still refer to “String” on OSX and maybe “String” 
>on Linux.

I would support a definition of encodedOffset that removed mention of UTF-16 
and phrased things in terms of String.Encoding and code-units. For example, I 
would like to be able to construct new String indices from a known index plus a 
quantity of code-units known to represent a sequence of characters:

var stringOne = “Hello,“
let stringTwo = “ world"

var idx = stringOne.endIndex
stringOne.append(contentsOf: stringTwo)
idx = String.Index(encodedOffset: idx.encodedOffset + stringTwo.codeUnits.count)
assert(idx == stringOne.endIndex)

- Karl___
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


Re: [swift-evolution] [swift-evolution-announce] [Revised and review extended] SE-0180 - String Index Overhaul

2017-06-26 Thread Jordan Rose via swift-evolution



> On Jun 22, 2017, at 17:10, Ted Kremenek  wrote:
> 
> Proposal Link: 
> https://github.com/apple/swift-evolution/blob/master/proposals/0180-string-index-overhaul.md
>  
> 
> 
> The review of “SE-0180 - String Index Overhaul” ran from June 4…8, 2017.
> 
> Feedback on the proposal and further discussion in the Core Team has resulted 
> in a slightly revised proposal.  A revised review of that proposal will run 
> from now until June 28.
> 
> The revised proposal focuses on feedback concerning cases where indices fall 
> between Unicode scalar boundaries in views having distinct encodings.  Please 
> see the revised review (link above) for the specific technical revisions.
> 
> Thanks to everyone who participated in providing feedback for the review, and 
> please take a look at the revised proposal!
> 
>   - Ted
>   Review Manager
> 
> — REVISED REVIEW —
> 
> When replying, please try to keep the proposal link at the top of the message:
> 
> Proposal link:
> 
>  
> https://github.com/apple/swift-evolution/blob/master/proposals/0180-string-index-overhaul.md
>  
> 
> Reply text
> 
> Other replies
> 
> 
> What goes into a review?
> 
> The goal of the review process is to improve the proposal under review 
> through constructive criticism and, eventually, determine the direction of 
> Swift. When writing your review, here are some questions you might want to 
> answer in your review:
> 
> What is your evaluation of the proposal?
> Is the problem being addressed significant enough to warrant a change to 
> Swift?
> Does this proposal fit well with the feel and direction of Swift?
> If you have used other languages or libraries with a similar feature, how do 
> you feel that this proposal compares to those?
> How much effort did you put into your review? A glance, a quick reading, or 
> an in-depth study?
> More information about the Swift evolution process is available at:
> 
> https://github.com/apple/swift-evolution/blob/master/process.md 
> ___
> swift-evolution-announce mailing list
> swift-evolution-annou...@swift.org
> https://lists.swift.org/mailman/listinfo/swift-evolution-announce

___
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


Re: [swift-evolution] [swift-evolution-announce] [Revised and review extended] SE-0180 - String Index Overhaul

2017-06-26 Thread Jordan Rose via swift-evolution
Accidentally hit send! Sorry everyone.

Here's what I was originally looking for: a diff between the old and new 
revisions of the proposal: 
https://github.com/apple/swift-evolution/commit/0d163c941261b6d16d1adfdc3212e8d3a15a3a73#diff-6034851c9ada9528665f7555a93537a8
 


Jordan


> On Jun 26, 2017, at 10:47, Jordan Rose  wrote:
> 
> 
> 
> 
>> On Jun 22, 2017, at 17:10, Ted Kremenek > > wrote:
>> 
>> Proposal Link: 
>> https://github.com/apple/swift-evolution/blob/master/proposals/0180-string-index-overhaul.md
>>  
>> 
>> 
>> The review of “SE-0180 - String Index Overhaul” ran from June 4…8, 2017.
>> 
>> Feedback on the proposal and further discussion in the Core Team has 
>> resulted in a slightly revised proposal.  A revised review of that proposal 
>> will run from now until June 28.
>> 
>> The revised proposal focuses on feedback concerning cases where indices fall 
>> between Unicode scalar boundaries in views having distinct encodings.  
>> Please see the revised review (link above) for the specific technical 
>> revisions.
>> 
>> Thanks to everyone who participated in providing feedback for the review, 
>> and please take a look at the revised proposal!
>> 
>>  - Ted
>>  Review Manager
>> 
>> — REVISED REVIEW —
>> 
>> When replying, please try to keep the proposal link at the top of the 
>> message:
>> 
>> Proposal link:
>> 
>>  
>> https://github.com/apple/swift-evolution/blob/master/proposals/0180-string-index-overhaul.md
>>  
>> 
>> Reply text
>> 
>> Other replies
>> 
>> 
>> What goes into a review?
>> 
>> The goal of the review process is to improve the proposal under review 
>> through constructive criticism and, eventually, determine the direction of 
>> Swift. When writing your review, here are some questions you might want to 
>> answer in your review:
>> 
>> What is your evaluation of the proposal?
>> Is the problem being addressed significant enough to warrant a change to 
>> Swift?
>> Does this proposal fit well with the feel and direction of Swift?
>> If you have used other languages or libraries with a similar feature, how do 
>> you feel that this proposal compares to those?
>> How much effort did you put into your review? A glance, a quick reading, or 
>> an in-depth study?
>> More information about the Swift evolution process is available at:
>> 
>> https://github.com/apple/swift-evolution/blob/master/process.md 
>> ___
>> swift-evolution-announce mailing list
>> swift-evolution-annou...@swift.org 
>> 
>> https://lists.swift.org/mailman/listinfo/swift-evolution-announce
> 

___
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


Re: [swift-evolution] [swift-evolution-announce] [Revised and review extended] SE-0180 - String Index Overhaul

2017-06-22 Thread Kevin Ballard via swift-evolution
> https://github.com/apple/swift-evolution/blob/master/proposals/0180-string-index-overhaul.md
Given the discussion in the original thread about potentially having
Strings backed by something other than utf16 code units, I'm somewhat
concerned about having this kind of vague `encodedOffset` that happens
to be UTF16 code units. If this is supposed to represent an offset into
whatever code units the String is backed by, then it's going to be a
problem because the user isn't supposed to know or care what the
underlying storage for the String is. And I can imagine potential issues
with archiving/unarchiving where the unarchived String has a different
storage type than the archived one, and therefore `encodedOffset` would
gain a new meaning that screws up unarchived String.Index values.
The other problem with using this as utf16 is how am I supposed to
archive/unarchive a String.Index that comes from String.UTF8View? AFAICT
the only way to do that is to ignore encodedOffset entirely and instead
calculate the distance between s.utf8.startIndex and my index (and then
recreate the index later on by advancing from startIndex). But this RFC
explicitly says that archiving/unarchiving indices is one of the goals
of this overhaul.
--

The section on comparison still talks about how this is a weak ordering.
In the other thread it was explained as being done so because the
internal transcodedOffset isn't public, but that still makes this
explanation very odd. String.Index comparison should not be weak
ordering, because all indices can be expressed in the utf8View if
nothing else, and in that view they have a total order. So it should
just be defined as a total order, based on the position in the utf8View
that the index corresponds with.
--

The detailed design of the index has encodedOffset being mutable (and
this was confirmed in the other thread as intentional). I don't think
this is a good idea, because it makes the following code behave oddly:
  let x = index.encodedOffset
  index.encodedOffset = x

Specifically, this resets the private transcodedOffset, so if you do
this with an intra-code-unit Index taken from the utf8View, the modified
Index may point to a different byte.
I'm also not sure why you'd ever want to do this operation anyway. If
you want to change the encodedOffset, you can just say `index =
String.Index(encodedOffset: x)`.
-Kevin Ballard
___
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution