on Thu Jan 19 2017, Saagar Jha <swift-evolution@swift.org> wrote:

> Looks pretty good in general from my quick glance–at least, it’s much
> better than the current situation. I do have a couple of comments and
> questions, which I’ve inlined below.
>
> Saagar Jha
>
>> On Jan 19, 2017, at 6:56 PM, Ben Cohen via swift-evolution
> <swift-evolution@swift.org> wrote:
>> 
>> Hi all,
>> 
>> Below is our take on a design manifesto for Strings in Swift 4 and beyond.
>> 
>> Probably best read in rendered markdown on GitHub:
>> https://github.com/apple/swift/blob/master/docs/StringManifesto.md
>> 
>> We’re eager to hear everyone’s thoughts.
>> 
>> Regards,
>> Ben and Dave
>> 
>> 
>> # String Processing For Swift 4
>> 
>> * Authors: [Dave Abrahams](https://github.com/dabrahams), [Ben
> Cohen](https://github.com/airspeedswift)
>> 
>> The goal of re-evaluating Strings for Swift 4 has been fairly ill-defined 
>> thus
>> far, with just this short blurb in the
>> [list of
> goals](https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160725/025676.html):
>> 
>>> **String re-evaluation**: String is one of the most important fundamental
>>> types in the language.  The standard library leads have numerous ideas of 
>>> how
>>> to improve the programming model for it, without jeopardizing the goals of
>>> providing a unicode-correct-by-default model.  Our goal is to be better at
>>> string processing than Perl!
>> 
>> For Swift 4 and beyond we want to improve three dimensions of text 
>> processing:
>> 
>>  1. Ergonomics
>>  2. Correctness
>>  3. Performance
>> 
>> This document is meant to both provide a sense of the long-term vision 
>> (including undecided issues and possible approaches), and to define the 
>> scope of
>> work that could be done in the Swift 4 timeframe.
>> 
>> ## General Principles
>> 
>> ### Ergonomics
>> 
>> It's worth noting that ergonomics and correctness are mutually-reinforcing.  
>> An
>> API that is easy to use—but incorrectly—cannot be considered an ergonomic
>> success.  Conversely, an API that's simply hard to use is also hard to use
>> correctly.  Acheiving optimal performance without compromising ergonomics or
>> correctness is a greater challenge.
>
> Minor typo: acheiving->achieving
>
>> Consistency with the Swift language and idioms is also important for
>> ergonomics. There are several places both in the standard library and in the
>> foundation additions to `String` where patterns and practices found elsewhere
>> could be applied to improve usability and familiarity.
>> 
>> ### API Surface Area
>> 
>> Primary data types such as `String` should have APIs that are easily 
>> understood
>> given a signature and a one-line summary.  Today, `String` fails that test.  
>> As
>> you can see, the Standard Library and Foundation both contribute 
>> significantly to
>> its overall complexity.
>> 
>> **Method Arity** | **Standard Library** | **Foundation**
>> ---|:---:|:---:
>> 0: `ƒ()` | 5 | 7
>> 1: `ƒ(:)` | 19 | 48
>> 2: `ƒ(::)` | 13 | 19
>> 3: `ƒ(:::)` | 5 | 11
>> 4: `ƒ(::::)` | 1 | 7
>> 5: `ƒ(:::::)` | - | 2
>> 6: `ƒ(::::::)` | - | 1
>> 
>> **API Kind** | **Standard Library** | **Foundation**
>> ---|:---:|:---:
>> `init` | 41 | 18
>> `func` | 42 | 55
>> `subscript` | 9 | 0
>> `var` | 26 | 14
>> 
>> **Total: 205 APIs**
>> 
>> By contrast, `Int` has 80 APIs, none with more than two
> parameters.[0] String processing is complex enough; users shouldn't
> have
>> to press through physical API sprawl just to get started.
>> 
>> Many of the choices detailed below contribute to solving this problem,
>> including:
>> 
>>  * Restoring `Collection` conformance and dropping the `.characters` view.
>>  * Providing a more general, composable slicing syntax.
>>  * Altering `Comparable` so that parameterized
>>    (e.g. case-insensitive) comparison fits smoothly into the basic syntax.
>>  * Clearly separating language-dependent operations on text produced 
>>    by and for humans from language-independent
>>    operations on text produced by and for machine processing.
>>  * Relocating APIs that fall outside the domain of basic string processing 
>> and
>>    discouraging the proliferation of ad-hoc extensions.
>> 
>> 
>> ### Batteries Included
>> 
>> While `String` is available to all programs out-of-the-box, crucial APIs for
>> basic string processing tasks are still inaccessible until `Foundation` is
>> imported.  While it makes sense that `Foundation` is needed for 
>> domain-specific
>> jobs such as
>> [linguistic 
>> tagging](https://developer.apple.com/reference/foundation/nslinguistictagger),
>> one should not need to import anything to, for example, do case-insensitive
>> comparison.
>> 
>> ### Unicode Compliance and Platform Support
>> 
>> The Unicode standard provides a crucial objective reference point for what
>> constitutes correct behavior in an extremely complex domain, so
>> Unicode-correctness is, and will remain, a fundamental design principle 
>> behind
>> Swift's `String`.  That said, the Unicode standard is an evolving document, 
>> so
>> this objective reference-point is not fixed.[1] While
>> many of the most important operations—e.g. string hashing, equality, and
>> non-localized comparison—will be stable, the semantics
>> of others, such as grapheme breaking and localized comparison and case
>> conversion, are expected to change as platforms are updated, so programs 
>> should
>> be written so their correctness does not depend on precise stability of these
>> semantics across OS versions or platforms.  Although it may be possible to
>> imagine static and/or dynamic analysis tools that will help users find such
>> errors, the only sure way to deal with this fact of life is to educate users.
>> 
>> ## Design Points
>> 
>> ### Internationalization
>> 
>> There is strong evidence that developers cannot determine how to use
>> internationalization APIs correctly.  Although documentation could and 
>> should be
>> improved, the sheer size, complexity, and diversity of these APIs is a major
>> contributor to the problem, causing novices to tune out, and more experienced
>> programmers to make avoidable mistakes.
>> 
>> The first step in improving this situation is to regularize all localized
>> operations as invocations of normal string operations with extra
>> parameters. Among other things, this means:
>> 
>> 1. Doing away with `localizedXXX` methods 
>> 2. Providing a terse way to name the current locale as a parameter
>> 3. Automatically adjusting defaults for options such
>>   as case sensitivity based on whether the operation is localized.
>> 4. Removing correctness traps like `localizedCaseInsensitiveCompare` (see
>>    guidance in the
>> [Internationalization and Localization
> Guide](https://developer.apple.com/library/content/documentation/MacOSX/Conceptual/BPInternational/InternationalizingYourCode/InternationalizingYourCode.html).
>> 
>> Along with appropriate documentation updates, these changes will make 
>> localized
>> operations more teachable, comprehensible, and approachable, thereby 
>> lowering a
>> barrier that currently leads some developers to ignore localization issues
>> altogether.
>> 
>> ####  The Default Behavior of `String`
>> 
>> Although this isn't well-known, the most accessible form of many operations 
>> on
>> Swift `String` (and `NSString`) are really only appropriate for text that is
>> intended to be processed for, and consumed by, machines.  The semantics of 
>> the
>> operations with the simplest spellings are always non-localized and
>> language-agnostic.
>> 
>> Two major factors play into this design choice:
>> 
>> 1. Machine processing of text is important, so we should have first-class,
>>   accessible functions appropriate to that use case.
>> 
>> 2. The most general localized operations require a locale parameter not 
>> required
>>   by their un-localized counterparts.  This naturally skews complexity 
>> towards
>>   localized operations.
>> 
>> Reaffirming that `String`'s simplest APIs have
>> language-independent/machine-processed semantics has the benefit of 
>> clarifying
>> the proper default behavior of operations such as comparison, and allows us 
>> to
>> make [significant optimizations](#collation-semantics) that were previously
>> thought to conflict with Unicode.
>> 
>> #### Future Directions
>> 
>> One of the most common internationalization errors is the unintentional
>> presentation to users of text that has not been localized, but regularizing 
>> APIs
>> and improving documentation can go only so far in preventing this error.
>> Combined with the fact that `String` operations are non-localized by default,
>> the environment for processing human-readable text may still be somewhat
>> error-prone in Swift 4.
>> 
>> For an audience of mostly non-experts, it is especially important that naïve
>> code is very likely to be correct if it compiles, and that more sophisticated
>> issues can be revealed progressively.  For this reason, we intend to
>> specifically and separately target localization and internationalization
>> problems in the Swift 5 timeframe.
>> 
>> ### Operations With Options
>> 
>> There are three categories of common string operation that commonly need to 
>> be
>> tuned in various dimensions:
>> 
>> **Operation**|**Applicable Options**
>> ---|---
>> sort ordering | locale, case/diacritic/width-insensitivity
>> case conversion | locale
>> pattern matching | locale, case/diacritic/width-insensitivity
>> 
>> The defaults for case-, diacritic-, and width-insensitivity are different for
>> localized operations than for non-localized operations, so for example a
>> localized sort should be case-insensitive by default, and a non-localized 
>> sort
>> should be case-sensitive by default.  We propose a standard “language” of
>> defaulted parameters to be used for these purposes, with usage roughly like 
>> this:
>> 
>> ```swift
>>  x.compared(to: y, case: .sensitive, in: swissGerman)
>> 
>>  x.lowercased(in: .currentLocale)
>> 
>>  x.allMatches(
>>    somePattern, case: .insensitive, diacritic: .insensitive)
>> ```
>> 
>> This usage might be supported by code like this:
>> 
>> ```swift
>> enum StringSensitivity {
>> case sensitive
>> case insensitive
>> }
>> 
>> extension Locale {
>>  static var currentLocale: Locale { ... }
>> }
>> 
>> extension Unicode {
>>  // An example of the option language in declaration context,
>>  // with nil defaults indicating unspecified, so defaults can be
>>  // driven by the presence/absence of a specific Locale
>>  func frobnicated(
>>    case caseSensitivity: StringSensitivity? = nil,
>>    diacritic diacriticSensitivity: StringSensitivity? = nil,
>>    width widthSensitivity: StringSensitivity? = nil,
>>    in locale: Locale? = nil
>>  ) -> Self { ... }
>> }
>> ```
>
> Any reason why Locale is defaulted to nil, instead of currentLocale?
> It seems more useful to me.

We're establishing a repeating pattern: string (and Unicode) operations
are locale-insensitive by default, meaning the string is treated as
machine-readable rather than human-readable text.

>> ### Comparing and Hashing Strings
>> 
>> #### Collation Semantics
>> 
>> What Unicode says about collation—which is used in `<`, `==`, and hashing— 
>> turns
>> out to be quite interesting, once you pick it apart.  The full Unicode 
>> Collation
>> Algorithm (UCA) works like this:
>> 
>> 1. Fully normalize both strings
>> 2. Convert each string to a sequence of numeric triples to form a collation 
>> key
>> 3. “Flatten” the key by concatenating the sequence of first elements to the
>>   sequence of second elements to the sequence of third elements
>> 4. Lexicographically compare the flattened keys 
>> 
>> While step 1 can usually
>> be [done quickly](http://unicode.org/reports/tr15/#Description_Norm) and
>> incrementally, step 2 uses a collation table that maps matching *sequences* 
>> of
>> unicode scalars in the normalized string to *sequences* of triples, which get
>> accumulated into a collation key.  Predictably, this is where the real costs
>> lie.
>> 
>> *However*, there are some bright spots to this story.  First, as it turns 
>> out,
>> string sorting (localized or not) should be done down to what's called
>> the
>> [“identical” level](http://unicode.org/reports/tr10/#Multi_Level_Comparison),
>> which adds a step 3a: append the string's normalized form to the flattened
>> collation key.  At first blush this just adds work, but consider what it does
>> for equality: two strings that normalize the same, naturally, will collate 
>> the
>> same.  But also, *strings that normalize differently will always collate
>> differently*.  In other words, for equality, it is sufficient to compare the
>> strings' normalized forms and see if they are the same.  We can therefore
>> entirely skip the expensive part of collation for equality comparison.
>> 
>> Next, naturally, anything that applies to equality also applies to hashing: 
>> it
>> is sufficient to hash the string's normalized form, bypassing collation keys.
>> This should provide significant speedups over the current implementation.
>> Perhaps more importantly, since comparison down to the “identical” level 
>> applies
>> even to localized strings, it means that hashing and equality can be 
>> implemented
>> exactly the same way for localized and non-localized text, and hash tables 
>> with
>> localized keys will remain valid across current-locale changes.
>> 
>> Finally, once it is agreed that the *default* role for `String` is to handle
>> machine-generated and machine-readable text, the default ordering of 
>> `String`s
>> need no longer use the UCA at all.  It is sufficient to order them in any way
>> that's consistent with equality, so `String` ordering can simply be a
>> lexicographical comparison of normalized forms,[4]
>> (which is equivalent to lexicographically comparing the sequences of grapheme
>> clusters), again bypassing step 2 and offering another speedup.
>> 
>> This leaves us executing the full UCA *only* for localized sorting, and ICU's
>> implementation has apparently been very well optimized.
>> 
>> Following this scheme everywhere would also allow us to make sorting behavior
>> consistent across platforms.  Currently, we sort `String` according to the 
>> UCA,
>> except that—*only on Apple platforms*—pairs of ASCII characters are ordered 
>> by
>> unicode scalar value.
>> 
>> #### Syntax
>> 
>> Because the current `Comparable` protocol expresses all comparisons with 
>> binary
>> operators, string comparisons—which may require
>> additional [options](#operations-with-options)—do not fit smoothly into the
>> existing syntax.  At the same time, we'd like to solve other problems with
>> comparison, as outlined
>> in
>> [this 
>> proposal](https://gist.github.com/CodaFi/f0347bd37f1c407bf7ea0c429ead380e)
>> (implemented by changes at the head
>> of
>> [this 
>> branch](https://github.com/CodaFi/swift/commits/space-the-final-frontier)).
>> We should adopt a modification of that proposal that uses a method rather 
>> than
>> an operator `<=>`:
>
> Why not both? Have the “UFO” operator, with the methods as support for
> more complicated use cases where the sugar doesn’t hold up.

Two reasons:

1. It's more API surface area for very little benefit

2. <,<=,==,>=, and > offer more than enough sugar.  We don't see many
  circumstances where <=> would actually get used, and those few cases
  can live with the weight of x.compared(to:y).

>> ```swift
>> enum SortOrder { case before, same, after }
>> 
>> protocol Comparable : Equatable {
>> func compared(to: Self) -> SortOrder
>> ...
>> }
>> ```
>> 
>> This change will give us a syntactic platform on which to implement methods 
>> with
>> additional, defaulted arguments, thereby unifying and regularizing comparison
>> across the library.
>> 
>> ```swift
>> extension String {
>> func compared(to: Self) -> SortOrder
>> 
>> }
>> ```
>> 
>> **Note:** `SortOrder` should bridge to `NSComparisonResult`.  It's also 
>> possible
>> that the standard library simply adopts Foundation's `ComparisonResult` as 
>> is,
>> but we believe the community should at least consider alternate naming before
>> that happens.  There will be an opportunity to discuss the choices in detail
>> when the modified
>> [Comparison 
>> Proposal](https://gist.github.com/CodaFi/f0347bd37f1c407bf7ea0c429ead380e) 
>> comes
>> up for review.
>> 
>> ### `String` should be a `Collection` of `Character`s Again
>> 
>> In Swift 2.0, `String`'s `Collection` conformance was dropped, because we
>> convinced ourselves that its semantics differed from those of `Collection` 
>> too
>> significantly.
>> 
>> It was always well understood that if strings were treated as sequences of
>> `UnicodeScalar`s, algorithms such as `lexicographicalCompare`, 
>> `elementsEqual`,
>> and `reversed` would produce nonsense results. Thus, in Swift 1.0, `String` 
>> was
>> a collection of `Character` (extended grapheme clusters). During 2.0
>> development, though, we realized that correct string concatenation could
>> occasionally merge distinct grapheme clusters at the start and end of 
>> combined
>> strings.
>> 
>> This quirk aside, every aspect of strings-as-collections-of-graphemes 
>> appears to
>> comport perfectly with Unicode. We think the concatenation problem is 
>> tolerable,
>> because the cases where it occurs all represent partially-formed constructs. 
>> The
>> largest class—isolated combining characters such as ◌́ (U+0301 COMBINING 
>> ACUTE
>> ACCENT)—are explicitly called out in the Unicode standard as
>> “[degenerate](http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries)” 
>> or
>> “[defective](http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf)”. The 
>> other
>> cases—such as a string ending in a zero-width joiner or half of a regional
>> indicator—appear to be equally transient and unlikely outside of a text 
>> editor.
>> 
>> Admitting these cases encourages exploration of grapheme composition and is
>> consistent with what appears to be an overall Unicode philosophy that “no
>> special provisions are made to get marginally better behavior for… cases that
>> never occur in practice.”[2] Furthermore, it seems
>> unlikely to disturb the semantics of any plausible algorithms. We can handle
>> these cases by documenting them, explicitly stating that the elements of a
>> `String` are an emergent property based on Unicode rules.
>> 
>> The benefits of restoring `Collection` conformance are substantial: 
>> 
>>  * Collection-like operations encourage experimentation with strings to
>>    investigate and understand their behavior. This is useful for teaching new
>>    programmers, but also good for experienced programmers who want to
>>    understand more about strings/unicode.
>> 
>>  * Extended grapheme clusters form a natural element boundary for Unicode
>>    strings.  For example, searching and matching operations will always 
>> produce
>>    results that line up on grapheme cluster boundaries.
>> 
>>  * Character-by-character processing is a legitimate thing to do in many real
>>    use-cases, including parsing, pattern matching, and language-specific
>>    transformations such as transliteration.
>> 
>>  * `Collection` conformance makes a wide variety of powerful operations
>>    available that are appropriate to `String`'s default role as the vehicle 
>> for
>>    machine processed text.
>> 
>>    The methods `String` would inherit from `Collection`, where similar to
>>    higher-level string algorithms, have the right semantics.  For example,
>>    grapheme-wise `lexicographicalCompare`, `elementsEqual`, and application 
>> of
>>    `flatMap` with case-conversion, produce the same results one would expect
>>    from whole-string ordering comparison, equality comparison, and
>>    case-conversion, respectively.  `reverse` operates correctly on graphemes,
>>    keeping diacritics moored to their base characters and leaving emoji 
>> intact.
>>    Other methods such as `indexOf` and `contains` make obvious sense. A few
>>    `Collection` methods, like `min` and `max`, may not be particularly useful
>>    on `String`, but we don't consider that to be a problem worth solving, in
>>    the same way that we wouldn't try to suppress `min` and `max` on a
>>    `Set([UInt8])` that was used to store IP addresses.
>> 
>>  * Many of the higher-level operations that we want to provide for `String`s,
>>    such as parsing and pattern matching, should apply to any `Collection`, 
>> and
>>    many of the benefits we want for `Collections`, such
>>    as unified slicing, should accrue
>>    equally to `String`.  Making `String` part of the same protocol hierarchy
>>    allows us to write these operations once and not worry about keeping the
>>    benefits in sync.
>> 
>>  * Slicing strings into substrings is a crucial part of the vocabulary of
>>    string processing, and all other sliceable things are `Collection`s.
>>    Because of its collection-like behavior, users naturally think of `String`
>>    in collection terms, but run into frustrating limitations where it fails 
>> to
>>    conform and are left to wonder where all the differences lie.  Many simply
>>    “correct” this limitation by declaring a trivial conformance:
>> 
>>    ```swift
>>  extension String : BidirectionalCollection {}
>>    ```
>> 
>>    Even if we removed indexing-by-element from `String`, users could still do
>>    this:
>> 
>>    ```swift
>>      extension String : BidirectionalCollection {
>>        subscript(i: Index) -> Character { return characters[i] }
>>      }
>>    ```
>> 
>>    It would be much better to legitimize the conformance to `Collection` and
>>    simply document the oddity of any concatenation corner-cases, than to deny
>>    users the benefits on the grounds that a few cases are confusing.
>> 
>
> Will String also conform to SequenceType? 

You mean Sequence, I presume (SequenceType is the old name).  Every
Collection is-a Sequence, so yes.

> I’ve seen many users (coming from other languages) confused that they
> can’t “just” loop over a String’s characters.
>
>> Note that the fact that `String` is a collection of graphemes does *not* mean
>> that string operations will necessarily have to do grapheme boundary
>> recognition.  See the Unicode protocol section for details.
>> 
>> ### `Character` and `CharacterSet`
>> 
>> `Character`, which represents a
>> Unicode
>> [extended grapheme 
>> cluster](http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries),
>> is a bit of a black box, requiring conversion to `String` in order to
>> do any introspection, including interoperation with ASCII.  To fix this, we 
>> should:
>> 
>> - Add a `unicodeScalars` view much like `String`'s, so that the sub-structure
>>   of grapheme clusters is discoverable.
>> - Add a failable `init` from sequences of scalars (returning nil for 
>> sequences
>>   that contain 0 or 2+ graphemes).
>> - (Lower priority) expose some operations, such as `func uppercase() ->
>>   String`, `var isASCII: Bool`, and, to the extent they can be sensibly
>>   generalized, queries of unicode properties that should also be exposed on
>>   `UnicodeScalar` such as `isAlphabetic` and `isGraphemeBase` .
>> 
>> Despite its name, `CharacterSet` currently operates on the Swift 
>> `UnicodeScalar`
>> type. This means it is usable on `String`, but only by going through the 
>> unicode
>> scalar view. To deal with this clash in the short term, `CharacterSet` 
>> should be
>> renamed to `UnicodeScalarSet`.  In the longer term, it may be appropriate to
>> introduce a `CharacterSet` that provides similar functionality for extended
>> grapheme clusters.[5]
>> 
>> ### Unification of Slicing Operations
>> 
>> Creating substrings is a basic part of String processing, but the slicing
>> operations that we have in Swift are inconsistent in both their spelling and
>> their naming: 
>> 
>>  * Slices with two explicit endpoints are done with subscript, and support
>>    in-place mutation:
>> 
>>    ```swift
>>        s[i..<j].mutate()
>>    ```
>> 
>>  * Slicing from an index to the end, or from the start to an index, is done
>>    with a method and does not support in-place mutation:
>>    ```swift
>>        s.prefix(upTo: i).readOnly()
>>    ```
>> 
>> Prefix and suffix operations should be migrated to be subscripting operations
>> with one-sided ranges i.e. `s.prefix(upTo: i)` should become `s[..<i]`, as
>> in
>> [this
> proposal](https://github.com/apple/swift-evolution/blob/9cf2685293108ea3efcbebb7ee6a8618b83d4a90/proposals/0132-sequence-end-ops.md).
>> With generic subscripting in the language, that will allow us to collapse a 
>> wide
>> variety of methods and subscript overloads into a single implementation, and
>> give users an easy-to-use and composable way to describe subranges.
>> 
>> Further extending this EDSL to integrate use-cases like `s.prefix(maxLength: 
>> 5)`
>> is an ongoing research project that can be considered part of the potential
>> long-term vision of text (and collection) processing.
>> 
>> ### Substrings
>> 
>> When implementing substring slicing, languages are faced with three options:
>> 
>> 1. Make the substrings the same type as string, and share storage.
>> 2. Make the substrings the same type as string, and copy storage when making 
>> the substring.
>> 3. Make substrings a different type, with a storage copy on conversion to 
>> string.
>> 
>> We think number 3 is the best choice. A walk-through of the tradeoffs 
>> follows.
>> 
>> #### Same type, shared storage
>> 
>> In Swift 3.0, slicing a `String` produces a new `String` that is a view into 
>> a
>> subrange of the original `String`'s storage. This is why `String` is 3 words 
>> in
>> size (the start, length and buffer owner), unlike the similar `Array` type
>> which is only one.
>> 
>> This is a simple model with big efficiency gains when chopping up strings 
>> into
>> multiple smaller strings. But it does mean that a stored substring keeps the
>> entire original string buffer alive even after it would normally have been
>> released.
>> 
>> This arrangement has proven to be problematic in other programming languages,
>> because applications sometimes extract small strings from large ones and keep
>> those small strings long-term. That is considered a memory leak and was 
>> enough
>> of a problem in Java that they changed from substrings sharing storage to
>> making a copy in 1.7.
>> 
>> #### Same type, copied storage
>> 
>> Copying of substrings is also the choice made in C#, and in the default
>> `NSString` implementation. This approach avoids the memory leak issue, but 
>> has
>> obvious performance overhead in performing the copies.
>> 
>> This in turn encourages trafficking in string/range pairs instead of in
>> substrings, for performance reasons, leading to API challenges. For example:
>> 
>> ```swift
>> foo.compare(bar, range: start..<end)
>> ```
>> 
>> Here, it is not clear whether `range` applies to `foo` or `bar`. This
>> relationship is better expressed in Swift as a slicing operation:
>> 
>> ```swift
>> foo[start..<end].compare(bar)
>> ```
>> 
>> Not only does this clarify to which string the range applies, it also brings
>> this sub-range capability to any API that operates on `String` "for free". So
>> these other combinations also work equally well:
>> 
>> ```swift
>> // apply range on argument rather than target
>> foo.compare(bar[start..<end])
>> // apply range on both
>> foo[start..<end].compare(bar[start1..<end1])
>> // compare two strings ignoring first character
>> foo.dropFirst().compare(bar.dropFirst())
>> ```
>> 
>> In all three cases, an explicit range argument need not appear on the 
>> `compare`
>> method itself. The implementation of `compare` does not need to know anything
>> about ranges. Methods need only take range arguments when that was an
>> integral part of their purpose (for example, setting the start and end of a
>> user's current selection in a text box).
>> 
>> #### Different type, shared storage
>> 
>> The desire to share underlying storage while preventing accidental memory 
>> leaks
>> occurs with slices of `Array`. For this reason we have an `ArraySlice` type.
>> The inconvenience of a separate type is mitigated by most operations used on
>> `Array` from the standard library being generic over `Sequence` or 
>> `Collection`.
>> 
>> We should apply the same approach for `String` by introducing a distinct
>> `SubSequence` type, `Substring`. Similar advice given for `ArraySlice` would 
>> apply to `Substring`:
>> 
>>> Important: Long-term storage of `Substring` instances is discouraged. A
>>> substring holds a reference to the entire storage of a larger string, not
>>> just to the portion it presents, even after the original string's lifetime
>>> ends. Long-term storage of a `Substring` may therefore prolong the lifetime
>>> of large strings that are no longer otherwise accessible, which can appear
>>> to be memory leakage.
>> 
>> When assigning a `Substring` to a longer-lived variable (usually a stored
>> property) explicitly of type `String`, a type conversion will be performed, 
>> and
>> at this point the substring buffer is copied and the original string's 
>> storage
>> can be released.
>> 
>> A `String` that was not its own `Substring` could be one word—a single tagged
>> pointer—without requiring additional allocations. `Substring`s would be a 
>> view
>> onto a `String`, so are 3 words - pointer to owner, pointer to start, and a
>> length. The small string optimization for `Substring` would take advantage of
>> the larger size, probably with a less compressed encoding for speed.
>> 
>> The downside of having two types is the inconvenience of sometimes having a
>> `Substring` when you need a `String`, and vice-versa. It is likely this would
>> be a significantly bigger problem than with `Array` and `ArraySlice`, as
>> slicing of `String` is such a common operation. It is especially relevant to
>> existing code that assumes `String` is the currency type. To ease the pain of
>> type mismatches, `Substring` should be a subtype of `String` in the same way
>> that `Int` is a subtype of `Optional<Int>`. This would give users an implicit
>> conversion from `Substring` to `String`, as well as the usual implicit
>> conversions such as `[Substring]` to `[String]` that other subtype
>> relationships receive.
>> 
>> In most cases, type inference combined with the subtype relationship should
>> make the type difference a non-issue and users will not care which type they
>> are using. For flexibility and optimizability, most operations from the
>> standard library will traffic in generic models of
>> [`Unicode`](#the--code-unicode--code--protocol).
>> 
>> ##### Guidance for API Designers
>> 
>> In this model, **if a user is unsure about which type to use, `String` is 
>> always
>> a reasonable default**. A `Substring` passed where `String` is expected will 
>> be
>> implicitly copied. When compared to the “same type, copied storage” model, we
>> have effectively deferred the cost of copying from the point where a 
>> substring
>> is created until it must be converted to `String` for use with an API.
>> 
>> A user who needs to optimize away copies altogether should use this 
>> guideline:
>> if for performance reasons you are tempted to add a `Range` argument to your
>> method as well as a `String` to avoid unnecessary copies, you should instead
>> use `Substring`.
>> 
>> ##### The “Empty Subscript”
>> 
>> To make it easy to call such an optimized API when you only have a `String` 
>> (or
>> to call any API that takes a `Collection`'s `SubSequence` when all you have 
>> is
>> the `Collection`), we propose the following “empty subscript” operation,
>> 
>> ```swift
>> extension Collection {
>>  subscript() -> SubSequence { 
>>    return self[startIndex..<endIndex] 
>>  }
>> }
>> ```
>> 
>> which allows the following usage:
>> 
>> ```swift
>> funcThatIsJustLooking(at: person.name[]) // pass person.name as Substring
>> ```
>> 
>> The `[]` syntax can be offered as a fixit when needed, similar to `&` for an
>> `inout` argument. While it doesn't help a user to convert `[String]` to
>> `[Substring]`, the need for such conversions is extremely rare, can be done 
>> with
>> a simple `map` (which could also be offered by a fixit):
>> 
>> ```swift
>> takesAnArrayOfSubstring(arrayOfString.map { $0[] })
>> ```
>> 
>> #### Other Options Considered
>> 
>> As we have seen, all three options above have downsides, but it's possible
>> these downsides could be eliminated/mitigated by the compiler. We are 
>> proposing
>> one such mitigation—implicit conversion—as part of the the "different type,
>> shared storage" option, to help avoid the cognitive load on developers of
>> having to deal with a separate `Substring` type.
>> 
>> To avoid the memory leak issues of a "same type, shared storage" substring
>> option, we considered whether the compiler could perform an implicit copy of
>> the underlying storage when it detects the string is being "stored" for long
>> term usage, say when it is assigned to a stored property. The trouble with 
>> this
>> approach is it is very difficult for the compiler to distinguish between
>> long-term storage versus short-term in the case of abstractions that rely on
>> stored properties. For example, should the storing of a substring inside an
>> `Optional` be considered long-term? Or the storing of multiple substrings
>> inside an array? The latter would not work well in the case of a
>> `components(separatedBy:)` implementation that intended to return an array of
>> substrings. It would also be difficult to distinguish intentional medium-term
>> storage of substrings, say by a lexer. There does not appear to be an 
>> effective
>> consistent rule that could be applied in the general case for detecting when 
>> a
>> substring is truly being stored long-term.
>> 
>> To avoid the cost of copying substrings under "same type, copied storage", 
>> the
>> optimizer could be enhanced to to reduce the impact of some of those copies.
>> For example, this code could be optimized to pull the invariant substring out
>> of the loop:
>> 
>> ```swift
>> for _ in 0..<lots { 
>>  someFunc(takingString: bigString[bigRange]) 
>> }
>> ```
>> 
>> It's worth noting that a similar optimization is needed to avoid an 
>> equivalent
>> problem with implicit conversion in the "different type, shared storage" 
>> case:
>> 
>> ```swift
>> let substring = bigString[bigRange]
>> for _ in 0..<lots { someFunc(takingString: substring) }
>> ```
>> 
>> However, in the case of "same type, copied storage" there are many use cases
>> that cannot be optimized as easily. Consider the following simple definition 
>> of
>> a recursive `contains` algorithm, which when substring slicing is linear 
>> makes
>> the overall algorithm quadratic:
>> 
>> ```swift
>> extension String {
>>    func containsChar(_ x: Character) -> Bool {
>>        return !isEmpty && (first == x || dropFirst().containsChar(x))
>>    }
>> }
>> ```
>> 
>> For the optimizer to eliminate this problem is unrealistic, forcing the user 
>> to
>> remember to optimize the code to not use string slicing if they want it to be
>> efficient (assuming they remember):
>> 
>> ```swift
>> extension String {
>>    // add optional argument tracking progress through the string
>>    func containsCharacter(_ x: Character, atOrAfter idx: Index? = nil) -> 
>> Bool {
>>        let idx = idx ?? startIndex
>>        return idx != endIndex
>>            && (self[idx] == x || containsCharacter(x, atOrAfter: 
>> index(after: idx)))
>>    }
>> }
>> ```
>> 
>> #### Substrings, Ranges and Objective-C Interop
>> 
>> The pattern of passing a string/range pair is common in several Objective-C
>> APIs, and is made especially awkward in Swift by the non-interchangeability 
>> of
>> `Range<String.Index>` and `NSRange`.  
>> 
>> ```swift
>> s2.find(s2, sourceRange: NSRange(j..<s2.endIndex, in: s2))
>> ```
>> 
>> In general, however, the Swift idiom for operating on a sub-range of a
>> `Collection` is to *slice* the collection and operate on that:
>> 
>> ```swift
>> s2.find(s2[j..<s2.endIndex])
>> ```
>> 
>> Therefore, APIs that operate on an `NSString`/`NSRange` pair should be 
>> imported
>> without the `NSRange` argument.  The Objective-C importer should be changed 
>> to
>> give these APIs special treatment so that when a `Substring` is passed, 
>> instead
>> of being converted to a `String`, the full `NSString` and range are passed to
>> the Objective-C method, thereby avoiding a copy.
>> 
>> As a result, you would never need to pass an `NSRange` to these APIs, which
>> solves the impedance problem by eliminating the argument, resulting in more
>> idiomatic Swift code while retaining the performance benefit.  To help users
>> manually handle any cases that remain, Foundation should be augmented to 
>> allow
>> the following syntax for converting to and from `NSRange`:
>> 
>> ```swift
>> let nsr = NSRange(i..<j, in: s) // An NSRange corresponding to s[i..<j]
>> let iToJ = Range(nsr, in: s)    // Equivalent to i..<j
>> ```
>> 
>> ### The `Unicode` protocol
>> 
>> With `Substring` and `String` being distinct types and sharing almost all
>> interface and semantics, and with the highest-performance string processing
>> requiring knowledge of encoding and layout that the currency types can't
>> provide, it becomes important to capture the common “string API” in a 
>> protocol.
>> Since Unicode conformance is a key feature of string processing in swift, we
>> call that protocol `Unicode`:
>
> Another minor typo: capitalize “Swift"
>
>> 
>> **Note:** The following assumes several features that are planned but not 
>> yet implemented in
>>  Swift, and should be considered a sketch rather than a final design.
>> 
>> ```swift
>> protocol Unicode 
>>  : Comparable, BidirectionalCollection where Element == Character {
>> 
>>  associatedtype Encoding : UnicodeEncoding
>>  var encoding: Encoding { get }
>> 
>>  associatedtype CodeUnits 
>>    : RandomAccessCollection where Element == Encoding.CodeUnit
>>  var codeUnits: CodeUnits { get }
>> 
>>  associatedtype UnicodeScalars 
>>    : BidirectionalCollection  where Element == UnicodeScalar
>>  var unicodeScalars: UnicodeScalars { get }
>> 
>>  associatedtype ExtendedASCII 
>>    : BidirectionalCollection where Element == UInt32
>>  var extendedASCII: ExtendedASCII { get }
>> 
>>  var unicodeScalars: UnicodeScalars { get }
>> }
>> 
>> extension Unicode {
>>  // ... define high-level non-mutating string operations, e.g. search ...
>> 
>>  func compared<Other: Unicode>(
>>    to rhs: Other,
>>    case caseSensitivity: StringSensitivity? = nil,
>>    diacritic diacriticSensitivity: StringSensitivity? = nil,
>>    width widthSensitivity: StringSensitivity? = nil,
>>    in locale: Locale? = nil
>>  ) -> SortOrder { ... }
>> }
>> 
>> extension Unicode : RangeReplaceableCollection where CodeUnits :
>>  RangeReplaceableCollection {
>>    // Satisfy protocol requirement
>>    mutating func replaceSubrange<C : Collection>(_: Range<Index>, with: C) 
>>      where C.Element == Element
>> 
>>  // ... define high-level mutating string operations, e.g. replace ...
>> }
>> 
>> ```
>> 
>> The goal is that `Unicode` exposes the underlying encoding and code units in
>> such a way that for types with a known representation (e.g. a 
>> high-performance
>> `UTF8String`) that information can be known at compile-time and can be used 
>> to
>> generate a single path, while still allowing types like `String` that admit
>> multiple representations to use runtime queries and branches to fast path
>> specializations.
>> 
>> **Note:** `Unicode` would make a fantastic namespace for much of
>> what's in this proposal if we could get the ability to nest types and
>> protocols in protocols.
>> 
>> 
>> ### Scanning, Matching, and Tokenization
>> 
>> #### Low-Level Textual Analysis
>> 
>> We should provide convenient APIs processing strings by character.  For 
>> example,
>> it should be easy to cleanly express, “if this string starts with `"f"`, 
>> process
>> the rest of the string as follows…”  Swift is well-suited to expressing this
>> common pattern beautifully, but we need to add the APIs.  Here are two 
>> examples
>> of the sort of code that might be possible given such APIs:
>> 
>> ```swift
>> if let firstLetter = input.droppingPrefix(alphabeticCharacter) {
>>  somethingWith(input) // process the rest of input
>> }
>> 
>> if let (number, restOfInput) = input.parsingPrefix(Int.self) {
>>   ...
>> }
>> ```
>> 
>> The specific spelling and functionality of APIs like this are TBD.  The 
>> larger
>> point is to make sure matching-and-consuming jobs are well-supported.
>> 
>
> +100, this kind of work is currently quite painful in Swift. Looking forward 
> to seeing this
> implemented!
>
>> #### Unified Pattern Matcher Protocol
>> 
>> Many of the current methods that do matching are overloaded to do the same
>> logical operations in different ways, with the following axes:
>> 
>> - Logical Operation: `find`, `split`, `replace`, match at start
>> - Kind of pattern: `CharacterSet`, `String`, a regex, a closure
>> - Options, e.g. case/diacritic sensitivity, locale.  Sometimes a part of
>>  the method name, and sometimes an argument
>> - Whole string or subrange.
>> 
>> We should represent these aspects as orthogonal, composable components,
>> abstracting pattern matchers into a protocol like
>> [this 
>> one](https://github.com/apple/swift/blob/master/test/Prototypes/PatternMatching.swift#L33),
>> that can allow us to define logical operations once, without introducing
>> overloads, and massively reducing API surface area.
>> 
>> For example, using the strawman prefix `%` syntax to turn string literals 
>> into
>> patterns, the following pairs would all invoke the same generic methods:
>> 
>> ```swift
>> if let found = s.firstMatch(%"searchString") { ... }
>> if let found = s.firstMatch(someRegex) { ... }
>> 
>> for m in s.allMatches((%"searchString"), case: .insensitive) { ... }
>> for m in s.allMatches(someRegex) { ... }
>> 
>> let items = s.split(separatedBy: ", ")
>> let tokens = s.split(separatedBy: CharacterSet.whitespace)
>> ```
>> 
>> Note that, because Swift requires the indices of a slice to match the 
>> indices of
>> the range from which it was sliced, operations like `firstMatch` can return a
>> `Substring?` in lieu of a `Range<String.Index>?`: the indices of the match in
>> the string being searched, if needed, can easily be recovered as the
>> `startIndex` and `endIndex` of the `Substring`.
>> 
>> Note also that matching operations are useful for collections in general, and
>> would fall out of this proposal:
>> 
>> ```
>> // replace subsequences of contiguous NaNs with zero
>> forces.replace(oneOrMore([Float.nan]), [0.0])
>> ```
>> 
>> #### Regular Expressions
>> 
>> Addressing regular expressions is out of scope for this proposal.
>> That said, it is important that to note the pattern matching protocol 
>> mentioned
>> above provides a suitable foundation for regular expressions, and types such 
>> as
>> `NSRegularExpression` can easily be retrofitted to conform to it.  In the
>> future, support for regular expression literals in the compiler could allow 
>> for
>> compile-time syntax checking and optimization.
>> 
>> ### String Indices
>> 
>> `String` currently has four views—`characters`, `unicodeScalars`, `utf8`, and
>> `utf16`—each with its own opaque index type.  The APIs used to translate 
>> indices
>> between views add needless complexity, and the opacity of indices makes them
>> difficult to serialize.
>> 
>> The index translation problem has two aspects:
>> 
>>  1. `String` views cannot consume one anothers' indices without a cumbersome
>>    conversion step.  An index into a `String`'s `characters` must be 
>> translated
>>    before it can be used as a position in its `unicodeScalars`.  Although 
>> these
>>    translations are rarely needed, they add conceptual and API complexity.
>>  2. Many APIs in the core libraries and other frameworks still expose 
>> `String`
>>    positions as `Int`s and regions as `NSRange`s, which can only reference a
>>    `utf16` view and interoperate poorly with `String` itself.
>> 
>> #### Index Interchange Among Views
>> 
>> String's need for flexible backing storage and reasonably-efficient indexing
>> (i.e. without dynamically allocating and reference-counting the indices
>> themselves) means indices need an efficient underlying storage type.  
>> Although
>> we do not wish to expose `String`'s indices *as* integers, `Int` offsets into
>> underlying code unit storage makes a good underlying storage type, provided
>> `String`'s underlying storage supports random-access.  We think random-access
>> *code-unit storage* is a reasonable requirement to impose on all `String`
>> instances.
>> 
>> Making these `Int` code unit offsets conveniently accessible and 
>> constructible
>> solves the serialization problem:
>> 
>> ```swift
>> clipboard.write(s.endIndex.codeUnitOffset)
>> let offset = clipboard.read(Int.self)
>> let i = String.Index(codeUnitOffset: offset)
>> ```
>> 
>> Index interchange between `String` and its `unicodeScalars`, `codeUnits`,
>> and [`extendedASCII`](#parsing-ascii-structure) views can be made entirely
>> seamless by having them share an index type (semantics of indexing a `String`
>> between grapheme cluster boundaries are TBD—it can either trap or be 
>> forgiving).
>> Having a common index allows easy traversal into the interior of graphemes,
>> something that is often needed, without making it likely that someone will 
>> do it
>> by accident.
>> 
>> - `String.index(after:)` should advance to the next grapheme, even when the
>>   index points partway through a grapheme.
>> 
>> - `String.index(before:)` should move to the start of the grapheme before
>>   the current position.
>> 
>> Seamless index interchange between `String` and its UTF-8 or UTF-16 views is 
>> not
>> crucial, as the specifics of encoding should not be a concern for most use
>> cases, and would impose needless costs on the indices of other views.  That
>> said, we can make translation much more straightforward by exposing simple
>> bidirectional converting `init`s on both index types:
>> 
>> ```swift
>> let u8Position = String.UTF8.Index(someStringIndex)
>> let originalPosition = String.Index(u8Position)
>> ```
>> 
>> #### Index Interchange with Cocoa
>> 
>> We intend to address `NSRange`s that denote substrings in Cocoa APIs as
>> described [later in this 
>> document](#substrings--ranges-and-objective-c-interop).
>> That leaves the interchange of bare indices with Cocoa APIs trafficking in
>> `Int`.  Hopefully such APIs will be rare, but when needed, the following
>> extension, which would be useful for all `Collections`, can help:
>> 
>> ```swift
>> extension Collection {
>>  func index(offset: IndexDistance) -> Index {
>>    return index(startIndex, offsetBy: offset)
>>  }
>>  func offset(of i: Index) -> IndexDistance {
>>    return distance(from: startIndex, to: i)
>>  }
>> }
>> ```
>> 
>> Then integers can easily be translated into offsets into a `String`'s `utf16`
>> view for consumption by Cocoa:
>> 
>> ```swift
>> let cocoaIndex = s.utf16.offset(of: String.UTF16Index(i))
>> let swiftIndex = s.utf16.index(offset: cocoaIndex)
>> ```
>> 
>> ### Formatting
>> 
>> A full treatment of formatting is out of scope of this proposal, but
>> we believe it's crucial for completing the text processing picture.  This
>> section details some of the existing issues and thinking that may guide 
>> future
>> development.
>> 
>> #### Printf-Style Formatting
>> 
>> `String.format` is designed on the `printf` model: it takes a format string 
>> with
>> textual placeholders for substitution, and an arbitrary list of other 
>> arguments.
>> The syntax and meaning of these placeholders has a long history in
>> C, but for anyone who doesn't use them regularly they are cryptic and 
>> complex,
>> as the `printf (3)` man page attests.
>> 
>> Aside from complexity, this style of API has two major problems: First, the
>> spelling of these placeholders must match up to the types of the arguments, 
>> in
>> the right order, or the behavior is undefined.  Some limited support for
>> compile-time checking of this correspondence could be implemented, but only 
>> for
>> the cases where the format string is a literal. Second, there's no reasonable
>> way to extend the formatting vocabulary to cover the needs of new types: you 
>> are
>> stuck with what's in the box.
>> 
>> #### Foundation Formatters
>> 
>> The formatters supplied by Foundation are highly capable and versatile, 
>> offering
>> both formatting and parsing services.  When used for formatting, though, the
>> design pattern demands more from users than it should:
>> 
>>  * Matching the type of data being formatted to a formatter type
>>  * Creating an instance of that type
>>  * Setting stateful options (`currency`, `dateStyle`) on the type.  Note: the
>>    need for this step prevents the instance from being used and discarded in
>>    the same expression where it is created.
>>  * Overall, introduction of needless verbosity into source
>> 
>> These may seem like small issues, but the experience of Apple localization
>> experts is that the total drag of these factors on programmers is such that 
>> they
>> tend to reach for `String.format` instead.
>> 
>> #### String Interpolation
>> 
>> Swift string interpolation provides a user-friendly alternative to printf's
>> domain-specific language (just write ordinary swift code!) and its type 
>> safety
>> problems (put the data right where it belongs!) but the following issues 
>> prevent
>> it from being useful for localized formatting (among other jobs):
>> 
>>  * [SR-2303](https://bugs.swift.org/browse/SR-2303) We are unable to restrict
>>    types used in string interpolation.
>>  * [SR-1260](https://bugs.swift.org/browse/SR-1260) String interpolation 
>> can't
>>    distinguish (fragments of) the base string from the string substitutions.
>> 
>> In the long run, we should improve Swift string interpolation to the point 
>> where
>> it can participate in most any formatting job.  Mostly this centers around
>> fixing the interpolation protocols per the previous item, and supporting
>> localization.
>> 
>> To be able to use formatting effectively inside interpolations, it needs to 
>> be
>> both lightweight (because it all happens in-situ) and discoverable.  One 
>> approach would be to standardize on `format` methods, e.g.:
>> 
>> ```swift
>> "Column 1: \(n.format(radix:16, width:8)) *** \(message)"
>> 
>> "Something with leading zeroes: \(x.format(fill: zero, width:8))"
>> ```
>
> Another thing that might limit adoption is the verbosity of this
> format. It works fine if I need to print one or two things, but it
> gets unwieldy very quickly.

I'd like to see examples of the sorts of uses you're concerned about.

>> ### C String Interop
>> 
>> Our support for interoperation with nul-terminated C strings is scattered and
>> incoherent, with 6 ways to transform a C string into a `String` and four 
>> ways to
>> do the inverse.  These APIs should be replaced with the following
>> 
>> ```swift
>> extension String {
>>  /// Constructs a `String` having the same contents as `nulTerminatedUTF8`.
>>  ///
>>  /// - Parameter nulTerminatedUTF8: a sequence of contiguous UTF-8 encoded 
>>  ///   bytes ending just before the first zero byte (NUL character).
>>  init(cString nulTerminatedUTF8: UnsafePointer<CChar>)
>> 
>>  /// Constructs a `String` having the same contents as 
>> `nulTerminatedCodeUnits`.
>>  ///
>>  /// - Parameter nulTerminatedCodeUnits: a sequence of contiguous code units 
>> in
>>  ///   the given `encoding`, ending just before the first zero code unit.
>>  /// - Parameter encoding: describes the encoding in which the code units
>>  ///   should be interpreted.
>>  init<Encoding: UnicodeEncoding>(
>>    cString nulTerminatedCodeUnits: UnsafePointer<Encoding.CodeUnit>,
>>    encoding: Encoding)
>> 
>>  /// Invokes the given closure on the contents of the string, represented as 
>> a
>>  /// pointer to a null-terminated sequence of UTF-8 code units.
>>  func withCString<Result>(
>>    _ body: (UnsafePointer<CChar>) throws -> Result) rethrows -> Result
>> }
>> ```
>> 
>> In both of the construction APIs, any invalid encoding sequence detected will
>> have its longest valid prefix replaced by U+FFFD, the Unicode replacement
>> character, per Unicode specification.  This covers the common case.  The
>> replacement is done *physically* in the underlying storage and the validity 
>> of
>> the result is recorded in the `String`'s `encoding` such that future accesses
>> need not be slowed down by possible error repair separately.
>> 
>> Construction that is aborted when encoding errors are detected can be
>> accomplished using APIs on the `encoding`.  String types that retain their
>> physical encoding even in the presence of errors and are repaired on-the-fly 
>> can
>> be built as different instances of the `Unicode` protocol.
>> 
>> ### Unicode 9 Conformance
>> 
>> Unicode 9 (and MacOS 10.11) brought us support for family emoji, which 
>> changes
>> the process of properly identifying `Character` boundaries.  We need to 
>> update
>> `String` to account for this change.
>> 
>> ### High-Performance String Processing
>> 
>> Many strings are short enough to store in 64 bits, many can be stored using 
>> only
>> 8 bits per unicode scalar, others are best encoded in UTF-16, and some come 
>> to
>> us already in some other encoding, such as UTF-8, that would be costly to
>> translate.  Supporting these formats while maintaining usability for
>> general-purpose APIs demands that a single `String` type can be backed by 
>> many
>> different representations.
>> 
>> That said, the highest performance code always requires static knowledge of 
>> the
>> data structures on which it operates, and for this code, dynamic selection of
>> representation comes at too high a cost.  Heavy-duty text processing demands 
>> a
>> way to opt out of dynamism and directly use known encodings.  Having this
>> ability can also make it easy to cleanly specialize code that handles dynamic
>> cases for maximal efficiency on the most common representations.
>> 
>> To address this need, we can build models of the `Unicode` protocol that 
>> encode
>> representation information into the type, such as `NFCNormalizedUTF16String`.
>> 
>> ### Parsing ASCII Structure
>> 
>> Although many machine-readable formats support the inclusion of arbitrary
>> Unicode text, it is also common that their fundamental structure lies 
>> entirely
>> within the ASCII subset (JSON, YAML, many XML formats).  These formats are 
>> often
>> processed most efficiently by recognizing ASCII structural elements as ASCII,
>> and capturing the arbitrary sections between them in more-general strings.  
>> The
>> current String API offers no way to efficiently recognize ASCII and skip past
>> everything else without the overhead of full decoding into unicode scalars.
>> 
>> For these purposes, strings should supply an `extendedASCII` view that is a
>> collection of `UInt32`, where values less than `0x80` represent the
>> corresponding ASCII character, and other values represent data that is 
>> specific
>> to the underlying encoding of the string.
>
> There are some things that are know to lie entirely with ASCII–are
> there any plans to add a way to work with them in a simple manner
> (subscripting, looping, etc.), possibly through the use of a
> Array<ASCIIChar>? property or whatever?

Maybe I'm misunderstanding what you have in mind but it sounds like
that's exactly what extendedASCII is designed for.


-- 
-Dave

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

Reply via email to