Hi Karl, It was pointed out to me that I never answered this thoughtful post of yours...
on Mon Jun 26 2017, Karl Wagner <swift-evolution@swift.org> wrote: >> On 23. Jun 2017, at 02:59, Kevin Ballard via swift-evolution >> <swift-evolution@swift.org> wrote: >> >> https://github.com/apple/swift-evolution/blob/master/proposals/0180-string-index-overhaul.md >> <https://github.com/apple/swift-evolution/blob/master/proposals/0180-string-index-overhaul.md> >> >> Given the discussion in the original thread about potentially having >> Strings backed by something other than utf16 code units, I'm >> somewhat concerned about having this kind of vague `encodedOffset` >> that happens to be UTF16 code units. If this is supposed to >> represent an offset into whatever code units the String is backed >> by, then it's going to be a problem because the user isn't supposed >> to know or care what the underlying storage for the String is. > > Is that true? The String manifesto shows a design where the underlying > Encoding and code-units are exposed. That is the eventual goal. Note that with this proposal we are making progress towards that goal, but getting all the way there is out of scope for this release. > From the talk about String’s being backed by something that isn’t > UTF-16, I took that to mean that String might one-day become > generic. Defaults for generic parameters have been mentioned on the > list before, so “String” could still refer to “String<UTF16Encoding>” > on OSX and maybe “String<UTF8Encoding>” on Linux. I think you may have misunderstood. String currently supports a few different underlying representations (ASCII, UTF-16, NSString), all of which happen to use a UTF-16-compatible encoding. The eventual goal is to expand the possible underlying representations of String to accomodate other encodings. That said, the underlying representation of String is *not* part of String's type, and we don't intend to change that. When String APIs access the underlying representation, that access is dynamically dispatched. If the encoding were a generic parameter, then it would be statically dispatched (at least in part), but it would also become part of String's type, and, for example, you would get an error when trying to pass a String<Unicode.ASCII> where a String<Unicode.UTF16> was expected. It's important that code passing Strings around remain smoothly interoperable, so we don't want to introduce this sort of type mismatch. Instead, the intention is that someone could make a UTF8String type that conformed to StringProtocol, and that String itself could be constructed from any instance of StringProtocol to be used as its underlying representation. That way, if you need the performance that comes with knowing and manipulating the underlying encoding, you can use UTF8String directly, and if you need to interoperate with code that uses the lingua-franca String type, you can wrap String around your UTF8String and pass that. > I would support a definition of encodedOffset that removed mention of > UTF-16 and phrased things in terms of String.Encoding and > code-units. Well, a few points about this: I support removing the text “(UTF-16)” from the initial documentation comments on these APIs, which is, AFAICT, the only source of the concern you and others have expressed. That said, Strings are in fact currently encoded as UTF-16 and as long as Cocoa interop is important, that too is important and useful information, so it should be documented somewhere. I don't support describing anything in terms of String.Encoding at this time. That enum was added to String by the Foundation overlay, and is not part of the plan for String except insofar as it is required for source compatibility and Cocoa interop. A more appropriate way to describe the encoding in terms of the language would be as something like Unicode.UTF16 (at compile-time) or an instance of Unicode.Encoding.Type (at runtime). But I see no need to describe it in language terms until we are ready to add APIs to String that can support multiple encodings and/or report the underlying encoding, and we are not ready to do that yet. > For example, I would like to be able to construct new String indices > from a known index plus a quantity of code-units known to represent a > sequence of characters: > > var stringOne = “Hello,“ > let stringTwo = “ world" > > var idx = stringOne.endIndex > stringOne.append(contentsOf: stringTwo) > idx = String.Index(encodedOffset: idx.encodedOffset + > stringTwo.codeUnits.count) > assert(idx == stringOne.endIndex) I'm not sure what you mean by “represent a sequence of characters” in this context. Don't a sequence of code units always represent a sequence of characters? The code you wrote above would (almost) work as written under this proposal, given that Strings always have an encoding that's compatible with some default. In other words, making it work *depends* on the fact that the encoding of stringTwo is compatible with (has a non-strict sub/superset relation with) that of stringOne. If stringOne were encoded as today but stringTwo were encoded with some other encoding, say, Shift-JIS, the code might not work. So making code like this work depends on the very information that you have expressed a concern abut seeing in the documentation. The reason I wrote “(almost)” above is that we are not yet proposing to expose a “codeUnits” view on String, and shouldn't do so until we are ready to introduce the more flexible encoding options discussed earlier. Today, you'd use the utf16 view to get that information. We are headed down the road in your vision, but we can't arrive there in this release. Hope this helps, -- -Dave _______________________________________________ swift-evolution mailing list swift-evolution@swift.org https://lists.swift.org/mailman/listinfo/swift-evolution