I’m coming to this conversation rather late, so forgive the naive question:
Your proposal claims that current code with failable APIs is needlessly awkward and that most code only interchanges indices that are known to succeed. So, why is it not simply a precondition of string slicing that the index be correctly aligned? It seems like this would simplify the behavior greatly. On Tue, Jun 13, 2017 at 19:04 Dave Abrahams via swift-evolution < swift-evolution@swift.org> wrote: > > on Tue Jun 06 2017, Dave Abrahams <swift-evolution@swift.org> wrote: > > >> Overall it looks pretty good. But unfortunately the answer to "Will > >> applications still compile but produce different behavior than they > >> used to?" is actually "Yes", when using APIs provided by > >> Foundation. This is because Foundation is currently able to return > >> String.Index values that don't point to Character boundaries. > >> > >> Specifically, in Swift 3, the following code: > >> > >> import Foundation > >> > >> let str = "e\u{301}galite\u{301}" > >> let r = str.rangeOfCharacter(from: ["\u{301}"])! > >> print(str[r] == "\u{301}") > >> > >> will print “true”, because the returned range identifies the combining > >> acute accent only. But with the proposed String.Index revisions, the > >> `str[r]` subscript will return the whole "e\u{301}” combined > >> character. > > > > Hmm, true. > > > > This doesn't totally invalidate the concern, but... > > > > The existing behavior is a bug in the way Foundation interfaces with the > > 3.0 standard library. str.rangeOfCharacter (which should be > > str.rangeOfUnicodeScalar) should be returning > > Range<String.UnicodeScalarView.Index> but is returning a misaligned > > Range<String.Index>. Everything in the 3.0 standard library design is > > engineered to ensure that misaligned String indices don't happen at all > > (although they still can—just use an index from string1 in string2), > > thus the rigorous failable index conversion APIs. > > > > It's easy to produce results with this API that don't make sense in > > Swift 3: > > > > let str = "e\u{301}\u{302}galite\u{301}" > > str.rangeOfCharacter(from: ["\u{301}"])! > > print(str[r.lowerBound] == "\u{301}") // false > > > >> This is, of course, an edge case, but we need to consider the > >> implications of this and determine if it actually affects anything > >> that’s likely to be a problem in practice. > > > > I agree. It would also be reasonable to pick a different behavior for > > misaligned indices, for example: > > > > Indices *that don't fall on a code unit boundary* are “rounded down” > > before use. > > > > The existing behaviors for these cases are a cluster of coincidences, > > and were never designed. I doubt that preserving them in their current > > form makes sense and will lead to a usable string semantics for the long > > term, but if they do in fact happen to make sense, we'd still need to > > codify the rules so we can keep future behaviors consistent. > > Having considered this further, I'd like to propose these revised > semantics for > misaligned indices, to preserve the behavior of rangeOfCharacter and its > ilk: > > * Definition: an index i is aligned with respect to a string view v iff > > v.indices.contains(i) || v.endIndex == i > > If i is not aligned with respect to v it is *misaligned* with respect > to v. > > * When i is misaligned with respect to a String/Substring view s.xxx > (imagining s itself could also be spelled as s.xxx), combining s.xxx > and i is done in terms of underlying code units and i.encodedOffset. > > It's very hard to write these semantics down precisely in terms of > existing constructs, but this should give you a sense of what I have > in mind: > > 1. the suffix beginning at i is formed by slicing the underlying > codeUnits at i.encodedOffset, forming a new Substring around that > slice, and getting its corresponding xxx view > > s.xxx[i...] > > is roughly equivalent to: > > Substring(s.utf16[String.Index(encodedOffset: i.encodedOffset)...]).xxx > > (given that we currently have UTF-16 code units) > > 2. similarly > > s.xxx[..<i] > > is equivalent to something like: > > Substring(s.utf16[..<String.Index(encodedOffset: i.encodedOffset)]).xxx > > 3. s.xxx[i] is equivalent to s.xxx[i...].first! > > 4. s.xxx.index(after: i) is equivalent to > s.xxx[i...].indices.dropFirst().first! > > 5. s.xxx.index(before: i) is equivalent to s.xxx[..<i].indices.last! > > I'm concerned that we have no precise way to specify the semantics of #1 > and #2, to the point where it might be better to implement them that way > but leave the semantics unspecified. Another alternative would be to > add the APIs needed to make it possible to express a precise equivalence > instead of a rough equivalence. If anyone has better ideas, I'm all ears. > > -- > -Dave > > _______________________________________________ > swift-evolution mailing list > swift-evolution@swift.org > https://lists.swift.org/mailman/listinfo/swift-evolution >
_______________________________________________ swift-evolution mailing list swift-evolution@swift.org https://lists.swift.org/mailman/listinfo/swift-evolution