Re: [swift-evolution] [Pitch] String revision proposal #1

Jean-Daniel via swift-evolution Fri, 31 Mar 2017 04:02:59 -0700

I’m with you for a C intro API that support taking a non-null terminated 
string. I often work with API that support efficient parsing by providing 
pointer to a global buffer + length to report parsed strings.


Without a way to create a Swift string from buffer + length, interop with such 
API will be difficult for no good reason, as Swift string don’t event have to 
be null terminated.

> Le 30 mars 2017 à 18:35, Félix Cloutier via swift-evolution 
> <swift-evolution@swift.org> a écrit :
> 
> I don't have much non-nitpick issues that I greatly care about; I'm in favor 
> of this.
> 
> My only request: it's currently painful to create a String from a fixed-size 
> C array. For instance, if I have a pointer to a `struct foo { char name[16]; 
> }` in Swift where the last character doesn't have to be a NUL, it's hard to 
> create a String from it. Real-world examples of this are Mach-O LC_SEGMENT 
> and LC_SEGMENT_64 commands.
> 
> The generally-accepted wisdom <http://stackoverflow.com/a/27456220/251153> is 
> that you take a pointer to the CChar tuple that represents the fixed-size 
> array, but this still requires the string to be NUL-terminated. What do we 
> think of an additional init(cString:) overload that takes an 
> UnsafeBufferPointer and reads up to the first NUL or the end of the buffer, 
> whichever comes first?
> 
>> Le 30 mars 2017 à 02:48, Brent Royal-Gordon via swift-evolution 
>> <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> a écrit :
>> 
>>> On Mar 29, 2017, at 5:32 PM, Ben Cohen via swift-evolution 
>>> <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:
>>> 
>>> Hi Swift Evolution,
>>> 
>>> Below is a pitch for the first part of the String revision. This covers a 
>>> number of changes that would allow the basic internals to be overhauled.
>>> 
>>> Online version here: 
>>> https://github.com/airspeedswift/swift-evolution/blob/3a822c799011ace682712532cfabfe32e9203fbb/proposals/0161-StringRevision1.md
>>>  
>>> <https://github.com/airspeedswift/swift-evolution/blob/3a822c799011ace682712532cfabfe32e9203fbb/proposals/0161-StringRevision1.md>
>> 
>> Really great stuff, guys. Thanks for your work on this!
>> 
>>> In order to be able to write extensions accross both String and Substring, 
>>> a new Unicode protocol to which the two types will conform will be 
>>> introduced. For the purposes of this proposal, Unicode will be defined as a 
>>> protocol to be used whenver you would previously extend String. It should 
>>> be possible to substitute extension Unicode { ... } in Swift 4 wherever 
>>> extension String { ... } was written in Swift 3, with one exception: any 
>>> passing of self into an API that takes a concrete String will need to be 
>>> rewritten as String(self). If Self is a String then this should effectively 
>>> optimize to a no-op, whereas if Self is a Substring then this will force a 
>>> copy, helping to avoid the “memory leak” problems described above.
>> 
>> I continue to feel that `Unicode` is the wrong name for this protocol, 
>> essentially because it sounds like a protocol for, say, a version of Unicode 
>> or some kind of encoding machinery instead of a Unicode string. I won't 
>> rehash that argument since I made it already in the manifesto thread, but I 
>> would like to make a couple new suggestions in this area.
>> 
>> Later on, you note that it would be nice to namespace many of these types:
>> 
>>> Several of the types related to String, such as the encodings, would 
>>> ideally reside inside a namespace rather than live at the top level of the 
>>> standard library. The best namespace for this is probably Unicode, but this 
>>> is also the name of the protocol. At some point if we gain the ability to 
>>> nest enums and types inside protocols, they should be moved there. Putting 
>>> them inside String or some other enum namespace is probably not worthwhile 
>>> in the mean-time.
>> 
>> Perhaps we should use an empty enum to create a `Unicode` namespace and then 
>> nest the protocol within it via typealias. If we do that, we can consider 
>> names like `Unicode.Collection` or even `Unicode.String` which would shadow 
>> existing types if they were top-level.
>> 
>> If not, then given this:
>> 
>>> The exact nature of the protocol – such as which methods should be protocol 
>>> requirements vs which can be implemented as protocol extensions, are 
>>> considered implementation details and so not covered in this proposal.
>> 
>> We may simply want to wait to choose a name. As the protocol develops, we 
>> may discover a theme in its requirements which would suggest a good name. 
>> For instance, we may realize that the core of what the protocol abstracts is 
>> grouping code units into characters, which might suggest a name like 
>> `Characters`, or `Unicode.Characters`, or `CharacterCollection`, or 
>> what-have-you.
>> 
>> (By the way, I hope that the eventual protocol requirements will be put 
>> through the review process, if only as an amendment, once they're 
>> determined.)
>> 
>>> Unicode will conform to BidirectionalCollection. RangeReplaceableCollection 
>>> conformance will be added directly onto the String and Substring types, as 
>>> it is possible future Unicode-conforming types might not be 
>>> range-replaceable (e.g. an immutable type that wraps a const char *).
>> 
>> I'm a little worried about this because it seems to imply that the protocol 
>> cannot include any mutation operations that aren't in 
>> `RangeReplaceableCollection`. For instance, it won't be possible to include 
>> an in-place `applyTransform` method in the protocol. Do you anticipate that 
>> being an issue? Might it be a good idea to define a parallel `Mutable` or 
>> `RangeReplaceable` protocol?
>> 
>>> The C string interop methods will be updated to those described here: a 
>>> single withCString operation and two init(cString:) constructors, one for 
>>> UTF8 and one for arbitrary encodings.
>> 
>> Sorry if I'm repeating something that was already discussed, but is there a 
>> reason you don't include a `withCString` variant for arbitrary encodings? It 
>> seems like an odd asymmetry.
>> 
>>> The standard library currently lacks a Latin1 codec, so a enum Latin1: 
>>> UnicodeEncoding type will be added.
>> 
>> Nice. I wrote one of those once; I'll enjoy deleting it.
>> 
>>> A new protocol, UnicodeEncoding, will be added to replace the current 
>>> UnicodeCodec protocol:
>>> 
>>> public enum UnicodeParseResult<T, Index> {
>> 
>> Either `T` should be given a more specific name, or the enum should be given 
>> a less specific one, becoming `ParseResult` and being oriented towards 
>> incremental parsing of anything from any kind of collection.
>> 
>>> /// Indicates valid input was recognized.
>>> ///
>>> /// `resumptionPoint` is the end of the parsed region
>>> case valid(T, resumptionPoint: Index)  // FIXME: should these be reordered?
>> 
>> No, I think this is the right order. The thing that's valid is the code 
>> point.
>> 
>>> /// Indicates invalid input was recognized.
>>> ///
>>> /// `resumptionPoint` is the next position at which to continue parsing 
>>> after
>>> /// the invalid input is repaired.
>>> case error(resumptionPoint: Index)
>> 
>> I know this is abbreviated documentation, but I hope the full version 
>> includes a good usage example demonstrating, among other things, how to 
>> detect partial characters and defer processing of them instead of rejecting 
>> them as erroneous.
>> 
>>> /// An encoding for text with UnicodeScalar as a common currency type
>>> public protocol UnicodeEncoding {
>>>  /// The maximum number of code units in an encoded unicode scalar value
>>>  static var maxLengthOfEncodedScalar: Int { get }
>>> 
>>>  /// A type that can represent a single UnicodeScalar as it is encoded in 
>>> this
>>>  /// encoding.
>>>  associatedtype EncodedScalar : EncodedScalarProtocol
>> 
>> There's an `EncodedScalarProtocol`-shaped hole in this proposal. What does 
>> it do? What are its semantics? How does `EncodedScalar` relate to the old 
>> `CodeUnit`?
>> 
>>>  @discardableResult
>>>  public static func parseForward<C: Collection>(
>>>    _ input: C,
>>>    repairingIllFormedSequences makeRepairs: Bool = true,
>>>    into output: (EncodedScalar) throws->Void
>>>  ) rethrows -> (remainder: C.SubSequence, errorCount: Int)
>>> 
>>>  @discardableResult    
>>>  public static func parseReverse<C: BidirectionalCollection>(
>>>    _ input: C,
>>>    repairingIllFormedSequences makeRepairs: Bool = true,
>>>    into output: (EncodedScalar) throws->Void
>>>  ) rethrows -> (remainder: C.SubSequence, errorCount: Int)
>>>  where C.SubSequence : BidirectionalCollection,
>>>        C.SubSequence.SubSequence == C.SubSequence,
>>>        C.SubSequence.Iterator.Element == EncodedScalar.Iterator.Element
>>> }
>> 
>> Are there constraints missing on `parseForward`?
>> 
>> What do these do if `makeRepairs` is false? Would it be clearer if we made 
>> an enum that described the behaviors and changed the label to something like 
>> `ifIllFormed:`?
>> 
>>> Due to the change in internal implementation, this means that these 
>>> operations will be O(n) rather than O(1). This is not expected to be a 
>>> major concern, based on experiences from a similar change made to Java, but 
>>> projects will be able to work around performance issues without upgrading 
>>> to Swift 4 by explicitly typing slices as Substring, which will call the 
>>> Swift 4 variant, and which will be available but not invoked by default in 
>>> Swift 3 mode.
>> 
>> Will there be a way to make this also work with a real Swift 3 compiler? For 
>> instance, can you define `typealias Substring = String` in such a way that 
>> real Swift 3 will parse and use it, but Swift 4 in Swift 3 mode will ignore 
>> it?
>> 
>>> This proposal does not yet introduce an implicit conversion from Substring 
>>> to String. The decision on whether to add this will be deferred pending 
>>> feedback on the initial implementation. The intention is to make a preview 
>>> toolchain available for feedback, including on whether this implicit 
>>> conversion is necessary, prior to the release of Swift 4.
>> 
>> This is a sensible approach.
>> 
>> Thank you for developing this into a full proposal. I discussed the plans 
>> for Swift 4 with a local group of programmers recently, and everyone was 
>> pleased to hear that `String` would get an overhaul, that the `characters` 
>> view would be integrated into the string, etc. We even talked a little about 
>> `Substring` and people thought it was a good idea. This proposal is shaping 
>> up to impact a lot of people, but in a good way!
>> 
>> -- 
>> Brent Royal-Gordon
>> Architechies
>> 
>> _______________________________________________
>> swift-evolution mailing list
>> swift-evolution@swift.org <mailto:swift-evolution@swift.org>
>> https://lists.swift.org/mailman/listinfo/swift-evolution
> 
> _______________________________________________
> swift-evolution mailing list
> swift-evolution@swift.org
> https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

Re: [swift-evolution] [Pitch] String revision proposal #1

Reply via email to