On Thu, Mar 30, 2017 at 10:38 AM, Ben Cohen <ben_co...@apple.com> wrote:
> > On Mar 29, 2017, at 6:59 PM, Xiaodi Wu <xiaodi...@gmail.com> wrote: > > This looks great. The restored conformances to *Collection will be huge. > > Is this to be the first of several or the only major part of the manifesto > to be delivered in Swift 4? > > > First of several. This lays the ground work for the changes to the > underlying implementation. Other changes will mostly be additive on top. > > Nits on naming: are we calling it Substring or SubString (à la > SubSequence)? > > > This is venturing into subjective territory, so these are just my feelings > rather than something definitive (Dave may differ) but: > > It should definitely be Substring. My rule of thumb: if you might > hyphenate it, you can capitalize it. I don’t think anyone spells it > "sub-string". OTOH one *might* write "sub-sequence". Generally hyphens > disappear in english as things come into common usage i.e. it used to be > e-mail but now it’s mostly just email. Substring is enough of a term of > art in programming that this has happened. Admittedly, Subsequence is a > term of art too – unfortunately one that has a different meaning to ours > ("a sequence that can be derived from another sequence by deleting some > elements without changing the order of the remaining elements" e.g. <A,C,E> > is a Subsequence of <A,B,C,D,E> – see https://en.wikipedia.org/ > wiki/Subsequence). Even worse, the mathematical term for what we are > calling a subsequence is a Substring! > > If we were change anything, my vote would be to lowercase Subsequence. We > can typealias SubSequence = Subsequence to aid migration, with a slow burn > on deprecating it since it’ll be quite a footling deprecation. I don’t know > if it’s worth it though – the main use of “SubSequence” is currently in > those pesky where clauses you have to put on all your Collection extensions > if you want to use slicing, and many of these will be eliminated once we > have the ability to put where clauses on associated types. > I regret bringing this up. `Substring` is totally fine. `SubSequence` is too. Just wanted to get some clarification that this was the proposed spelling. I doubt it's worth a whole migration to change the capitalization of `SubSequence`, which after all prevents the word from being read like "consequence." and shouldn't it be UnicodeParsedResult rather than UnicodeParseResult? > > > I think Parse. As in, this is the result of a parse, not these are the > parsed results (though it does contain parsed results in some cases, but > not all). > Ah, then `UnicodeParsingResult`, maybe? Something about nouning that verb doesn't sit right. OK, done with bikeshedding. > On Wed, Mar 29, 2017 at 19:32 Ben Cohen via swift-evolution < > swift-evolution@swift.org> wrote: > > Hi Swift Evolution, > > Below is a pitch for the first part of the String revision. This covers a > number of changes that would allow the basic internals to be overhauled. > > Online version here: https://github.com/airspeedswift/swift-evolution/ > blob/3a822c799011ace682712532cfabfe32e9203fbb/proposals/0161- > StringRevision1.md > > > String Revision: Collection Conformance, C Interop, Transcoding > > - Proposal: SE-0161 > - Authors: Ben Cohen <https://github.com/airspeedswift>, Dave Abrahams > <http://github.com/dabrahams/> > - Review Manager: TBD > - Status: *Awaiting review* > > Introduction > > This proposal is to implement a subset of the changes from the Swift 4 > String Manifesto > <https://github.com/apple/swift/blob/master/docs/StringManifesto.md>. > > Specifically: > > - Make String conform to BidirectionalCollection > - Make String conform to RangeReplaceableCollection > - Create a Substring type for String.SubSequence > - Create a Unicode protocol to allow for generic operations over both > types. > - Consolidate on a concise set of C interop methods. > - Revise the transcoding infrastructure. > > Other existing aspects of String remain unchanged for the purposes of > this proposal. > Motivation > > This proposal follows up on a number of recommendations found in the > manifesto: > > Collection conformance was dropped from String in Swift 2. After > reevaluation, the feeling is that the minor semantic discrepancies (mainly > with RangeReplaceableCollection) are outweighed by the significant > benefits of restoring these conformances. For more detail on the reasoning, > see here > <https://github.com/apple/swift/blob/master/docs/StringManifesto.md#string-should-be-a-collection-of-characters-again> > > While it is not a collection, the Swift 3 string does have slicing > operations. String is currently serving as its own subsequence, allowing > substrings to share storage with their “owner”. This can lead to memory > leaks when small substrings of larger strings are stored long-term (see > here > <https://github.com/apple/swift/blob/master/docs/StringManifesto.md#substrings> > for > more detail on this problem). Introducing a separate type of Substring to > serve as String.Subsequence is recommended to resolve this issue, in a > similar fashion to ArraySlice. > > As noted in the manifesto, support for interoperation with nul-terminated > C strings in Swift 3 is scattered and incoherent, with 6 ways to transform > a C string into a String and four ways to do the inverse. These APIs > should be replaced with a simpler set of methods on String. > Proposed solution > > A new type, Substring, will be introduced. Similar to ArraySlice it will > be documented as only for short- to medium-term storage: > > *Important* > Long-term storage of Substring instances is discouraged. A substring > holds a reference to the entire storage of a larger string, not just to the > portion it presents, even after the original string’s lifetime ends. > Long-term storage of a substring may therefore prolong the lifetime of > elements that are no longer otherwise accessible, which can appear to be > memory leakage. > > Aside from minor differences, such as having a SubSequence of Self and a > larger size to describe the range of the subsequence, Substring will be > near-identical from a user perspective. > > In order to be able to write extensions accross both String and Substring, > a new Unicode protocol to which the two types will conform will be > introduced. For the purposes of this proposal, Unicode will be defined as > a protocol to be used whenver you would previously extend String. It > should be possible to substitute extension Unicode { ... } in Swift 4 > wherever extension String { ... } was written in Swift 3, with one > exception: any passing of self into an API that takes a concrete String will > need to be rewritten as String(self). If Self is a String then this > should effectively optimize to a no-op, whereas if Self is a Substring then > this will force a copy, helping to avoid the “memory leak” problems > described above. > > The exact nature of the protocol – such as which methods should be > protocol requirements vs which can be implemented as protocol extensions, > are considered implementation details and so not covered in this proposal. > > Unicode will conform to BidirectionalCollection. Ra > ngeReplaceableCollection conformance will be added directly onto the > String and Substring types, as it is possible future Unicode-conforming > types might not be range-replaceable (e.g. an immutable type that wraps a > const > char *). > > The C string interop methods will be updated to those described here > <https://github.com/apple/swift/blob/master/docs/StringManifesto.md#c-string-interop>: > a single withCString operation and two init(cString:) constructors, one > for UTF8 and one for arbitrary encodings. The primary change is to remove > “non-repairing” variants of construction from nul-terminated C strings. In > both of the construction APIs, any invalid encoding sequence detected will > have its longest valid prefix replaced by U+FFFD, the Unicode replacement > character, per the Unicode specification. This covers the common case. The > replacement is done physically in the underlying storage and the validity > of the result is recorded in the String’s encoding such that future > accesses need not be slowed down by possible error repair separately. > Construction that is aborted when encoding errors are detected can be > accomplished using APIs on the encoding. > > The current transcoding support will be updated to improve usability and > performance. The primary changes will be: > > - to allow transcoding directly from one encoding to another without > having to triangulate through an intermediate scalar value > - to add the ability to transcode an input collection in reverse, > allowing the different views on String to be made bi-directional > - to have decoding take a collection rather than an iterator, and > return an index of its progress into the source, allowing that method to be > static > > The standard library currently lacks a Latin1 codec, so a enum Latin1: > UnicodeEncoding type will be added. > Detailed design > > The following additions will be made to the standard library: > > protocol Unicode: BidirectionalCollection { > // Implementation detail as described above > } > extension String: Unicode, RangeReplaceableCollection { > typealias SubSequence = Substring > } > struct Substring: Unicode, RangeReplaceableCollection { > typealias SubSequence = Substring > // near-identical API surface area to String > } > > The subscript operations on String will be amended to return Substring: > > struct String { > subscript(bounds: Range<String.Index>) -> Substring { get } > subscript(bounds: ClosedRange<String.Index>) -> Substring { get } > } > > Note that properties or methods that due to their nature create new String > storage > (such as lowercased()) will *not* change. > > C string interop will be consolidated on the following methods: > > extension String { > /// Constructs a `String` having the same contents as `nulTerminatedUTF8`. > /// > /// - Parameter nulTerminatedUTF8: a sequence of contiguous UTF-8 encoded > /// bytes ending just before the first zero byte (NUL character). > init(cString nulTerminatedUTF8: UnsafePointer<CChar>) > > /// Constructs a `String` having the same contents as > `nulTerminatedCodeUnits`. > /// > /// - Parameter nulTerminatedCodeUnits: a sequence of contiguous code units > in > /// the given `encoding`, ending just before the first zero code unit. > /// - Parameter encoding: describes the encoding in which the code units > /// should be interpreted. > init<Encoding: UnicodeEncoding>( > cString nulTerminatedCodeUnits: UnsafePointer<Encoding.CodeUnit>, > encoding: Encoding) > > /// Invokes the given closure on the contents of the string, represented as > a > /// pointer to a null-terminated sequence of UTF-8 code units. > func withCString<Result>( > _ body: (UnsafePointer<CChar>) throws -> Result) rethrows -> Result > } > > Additionally, the current ability to pass a Swift String into C methods > that take a C string will remain as-is. > > A new protocol, UnicodeEncoding, will be added to replace the current > UnicodeCodec protocol: > > public enum UnicodeParseResult<T, Index> {/// Indicates valid input was > recognized.////// `resumptionPoint` is the end of the parsed regioncase > valid(T, resumptionPoint: Index) // FIXME: should these be reordered?/// > Indicates invalid input was recognized.////// `resumptionPoint` is the next > position at which to continue parsing after/// the invalid input is > repaired.case error(resumptionPoint: Index) > /// Indicates that there was no more input to consume.case emptyInput > > /// If any input was consumed, the point from which to continue parsing. > var resumptionPoint: Index? { > switch self { > case .valid(_,let r): return r > case .error(let r): return r > case .emptyInput: return nil > } > } > } > /// An encoding for text with UnicodeScalar as a common currency typepublic > protocol UnicodeEncoding { > /// The maximum number of code units in an encoded unicode scalar value > static var maxLengthOfEncodedScalar: Int { get } > > /// A type that can represent a single UnicodeScalar as it is encoded in > this > /// encoding. > associatedtype EncodedScalar : EncodedScalarProtocol > > /// Produces a scalar of this encoding if possible; returns `nil` otherwise. > static func encode<Scalar: EncodedScalarProtocol>( > _:Scalar) -> Self.EncodedScalar? > > /// Parse a single unicode scalar forward from `input`. > /// > /// - Parameter knownCount: a number of code units known to exist in > `input`. > /// **Note:** passing a known compile-time constant is strongly advised, > /// even if it's zero. > static func parseScalarForward<C: Collection>( > _ input: C, knownCount: Int /* = 0, via extension */ > ) -> ParseResult<EncodedScalar, C.Index> > where C.Iterator.Element == EncodedScalar.Iterator.Element > > /// Parse a single unicode scalar in reverse from `input`. > /// > /// - Parameter knownCount: a number of code units known to exist in > `input`. > /// **Note:** passing a known compile-time constant is strongly advised, > /// even if it's zero. > static func parseScalarReverse<C: BidirectionalCollection>( > _ input: C, knownCount: Int /* = 0 , via extension */ > ) -> ParseResult<EncodedScalar, C.Index> > where C.Iterator.Element == EncodedScalar.Iterator.Element > } > /// Parsing multiple unicode scalar valuesextension UnicodeEncoding { > @discardableResult > public static func parseForward<C: Collection>( > _ input: C, > repairingIllFormedSequences makeRepairs: Bool = true, > into output: (EncodedScalar) throws->Void > ) rethrows -> (remainder: C.SubSequence, errorCount: Int) > > @discardableResult > public static func parseReverse<C: BidirectionalCollection>( > _ input: C, > repairingIllFormedSequences makeRepairs: Bool = true, > into output: (EncodedScalar) throws->Void > ) rethrows -> (remainder: C.SubSequence, errorCount: Int) > where C.SubSequence : BidirectionalCollection, > C.SubSequence.SubSequence == C.SubSequence, > C.SubSequence.Iterator.Element == EncodedScalar.Iterator.Element > } > > UnicodeCodec will be updated to refine UnicodeEncoding, and all existing > codecs will conform to it. > > Note, depending on whether this change lands before or after some of the > generics features, generic where clauses may need to be added temporarily. > Source compatibility > > Adding collection conformance to String should not materially impact > source stability as it is purely additive: Swift 3’s String interface > currently fulfills all of the requirements for a bidirectional range > replaceable collection. > > Altering String’s slicing operations to return a different type is source > breaking. The following mitigating steps are proposed: > > - > > Add a deprecated subscript operator that will run in Swift 3 > compatibility mode and which will return a String not a Substring. > - > > Add deprecated versions of all current slicing methods to similarly > return a String. > > i.e.: > > extension String { > @available(swift, obsoleted: 4) > subscript(bounds: Range<Index>) -> String { > return String(characters[bounds]) > } > > @available(swift, obsoleted: 4) > subscript(bounds: ClosedRange<Index>) -> String { > return String(characters[bounds]) > } > } > > In a review of 77 popular Swift projects found on GitHub, these changes > resolved any build issues in the 12 projects that assumed an explicit > String type returned from slicing operations. > > Due to the change in internal implementation, this means that these > operations will be *O(n)* rather than *O(1)*. This is not expected to be > a major concern, based on experiences from a similar change made to Java, > but projects will be able to work around performance issues without > upgrading to Swift 4 by explicitly typing slices as Substring, which will > call the Swift 4 variant, and which will be available but not invoked by > default in Swift 3 mode. > > The C string interoperability methods outside the ones described in the > detailed design will remain in Swift 3 mode, be deprecated in Swift 4 mode, > and be removed in a subsequent release. UnicodeCodec will be similarly > deprecated. > Effect on ABI stability > > As a fundamental currency type for Swift, it is essential that the String type > (and its associated subsequence) is in a good long-term state before being > locked down when Swift declares ABI stability. Shrinking the size of > String to be 64 bits is an important part of this. > Effect on API resilience > > Decisions about the API resilience of the String type are still to be > determined, but are not adversely affected by this proposal. > Alternatives considered > > For a more in-depth discussion of some of the trade-offs in string design, > see the manifesto and associated evolution thread > <https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20170116/thread.html#30497> > . > > This proposal does not yet introduce an implicit conversion from Substring > to String. The decision on whether to add this will be deferred pending > feedback on the initial implementation. The intention is to make a preview > toolchain available for feedback, including on whether this implicit > conversion is necessary, prior to the release of Swift 4. > Several of the types related to String, such as the encodings, would > ideally reside inside a namespace rather than live at the top level of the > standard library. The best namespace for this is probably Unicode, but > this is also the name of the protocol. At some point if we gain the ability > to nest enums and types inside protocols, they should be moved there. > Putting them inside String or some other enum namespace is probably not > worthwhile in the mean-time. > _______________________________________________ > swift-evolution mailing list > swift-evolution@swift.org > https://lists.swift.org/mailman/listinfo/swift-evolution > > >
_______________________________________________ swift-evolution mailing list swift-evolution@swift.org https://lists.swift.org/mailman/listinfo/swift-evolution