> On Jan 19, 2017, at 6:56 PM, Ben Cohen via swift-evolution 
> <swift-evolution@swift.org> wrote:
> 
> ### Formatting
> 
> A full treatment of formatting is out of scope of this proposal, but
> we believe it's crucial for completing the text processing picture.  This
> section details some of the existing issues and thinking that may guide future
> development.
> 

Filesystem paths are Strings on Apple platforms but not on Linux. How are we 
going to square that circle? What about Swift on the server, where 
distinguishing HTML and JavaScript is security-critical? There are huge 
security implications to string processing, often around platforms making it 
easy to do the wrong thing in a careless way and promoting ad-hoc formatting, 
serialization and parsing. That’s a huge area to consider of course but it 
might be worth thinking about how a ergonomic API for a few example cases would 
work. 

I guess my point is that formatting and interpolation is far more than “just 
formatting”; making the right thing difficult will directly lead to exploitable 
security vulnerabilities or not as the case may be. (To be clear I’m not saying 
the follow-on proposals from this need to solve those problems, maybe just give 
them some consideration).



> ## Open Questions
> 
> ### Must `String` be limited to storing UTF-16 subset encodings?
> 
> - The ability to handle `UTF-8`-encoded strings (models of `Unicode`) is not 
> in
>  question here; this is about what encodings must be storable, without
>  transcoding, in the common currency type called “`String`”.
> - ASCII, Latin-1, UCS-2, and UTF-16 are UTF-16 subsets.  UTF-8 is not.

Depending on who you believe UTF-8 is the encoding of ~65-88% of all text 
content transmitted over the web. JSON and XML represent the lion’s share of 
REST and non-REST APIs in use and both are almost exclusively transmitted as 
UTF-8. As you point out with extendedASCII, a lot of markup and structure is 
ASCII even if the content is not so UTF-8 represents a significant size savings 
even on Chinese/Japanese web pages that require 3 bytes to represent many 
characters (the savings on markup overwhelming the loss on textual content).

Any model that makes using UTF-8 backed Strings difficult or cumbersome to use 
can have a negative performance and memory impact. I don’t have a good idea of 
the actual cost but it might be worth doing some test to determine that.

Is NSString interop the only reason to not just use UTF-8 as the default 
storage? If so, is that a solvable problem? Could one choose by typealias or a 
compiler flag which default storage they wanted?


> - If we have a way to get at a `String`'s code units, we need a concrete type 
> in
>  which to express them in the API of `String`, which is a concrete type
> - If String needs to be able to represent UTF-32, presumably the code units 
> need
>  to be `UInt32`.
> - Not supporting UTF-32-encoded text seems like one reasonable design choice.
> - Maybe we can allow UTF-8 storage in `String` and expose its code units as
>  `UInt16`, just as we would for Latin-1.
> - Supporting only UTF-16-subset encodings would imply that `String` indices 
> can
>  be serialized without recording the `String`'s underlying encoding.

I suppose you could be clever on 64-bit platforms by stealing some bits to 
indicate the encoding… not that I recommend that :D

> 
> ### Do we need a type-erasable base protocol for UnicodeEncoding?
> 
> UnicodeEncoding has an associated type, but it may be important to be able to
> traffic in completely dynamic encoding values, e.g. for “tell me the most
> efficient encoding for this string.”

Generalized Existentials 
tis but happiness by another name
For we who live 
in The Land of Protocols and Faeries

> 
> ### Should there be a string “facade?”
> 
> One possible design alternative makes `Unicode` a vehicle for expressing
> the storage and encoding of code units, but does not attempt to give it an API
> appropriate for `String`.  Instead, string APIs would be provided by a generic
> wrapper around an instance of `Unicode`:
> 
> ```swift
> struct StringFacade<U: Unicode> : BidirectionalCollection {
> 
>  // ...APIs for high-level string processing here...
> 
>  var unicode: U // access to lower-level unicode details
> }
> 
> typealias String = StringFacade<StringStorage>
> typealias Substring = StringFacade<StringStorage.SubSequence>
> ```
> 
> This design would allow us to de-emphasize lower-level `String` APIs such as
> access to the specific encoding, by putting them behind a `.unicode` property.
> A similar effect in a facade-less design would require a new top-level
> `StringProtocol` playing the role of the facade with an an `associatedtype
> Storage : Unicode`.
> 
> An interesting variation on this design is possible if defaulted generic
> parameters are introduced to the language:
> 
> ```swift
> struct String<U: Unicode = StringStorage> 
>  : BidirectionalCollection {
> 
>  // ...APIs for high-level string processing here...
> 
>  var unicode: U // access to lower-level unicode details
> }
> 
> typealias Substring = String<StringStorage.SubSequence>
> ```
> 
> One advantage of such a design is that naïve users will always extend “the 
> right
> type” (`String`) without thinking, and the new APIs will show up on 
> `Substring`,
> `MyUTF8String`, etc.  That said, it also has downsides that should not be
> overlooked, not least of which is the confusability of the meaning of the word
> “string.”  Is it referring to the generic or the concrete type?

Fair point, but I do like the idea of separating the two and encouraging people 
to extend String while automatically extending all the String-ish types. This 
would compose well with a hypothetical HTMLString, JavaScriptString, etc 
(assuming one could design a model where those things compose well, e.g. 
appending MyUTF8String to HTMLString performs automatic HTML-escaping whereas 
appending HTMLString to HTMLString does not). 

Anything that avoids forcing the average app or library author to stop and 
think about which String type to use is probably a net win if the performance 
isn’t horrible; someone writing a web server pipeline will need to write their 
own String-ish type for performance reasons anyway so a slight perf hit may be 
no great loss.


Thanks to you and Ben for the hard work so far; I can’t even imagine taking on 
such a task!

Russ

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

Reply via email to