on Mon Jan 30 2017, Olivier Tardieu <tardieu-AT-us.ibm.com> wrote: > Thanks for the clarifications. > More comments below. > > dabrah...@apple.com wrote on 01/24/2017 05:50:59 PM: > >> Maybe it wasn't clear from the document, but the intention is that >> String would be able to use any model of Unicode as a backing store, and >> that you could easily build unsafe models of Unicode... but also that >> you could use your unsafe model of Unicode directly, in string-ish ways. > > I see. If I understand correctly, it will be possible for instance to > implement an unsafe model of Unicode with a UInt8 code unit and a > maxLengthOfEncodedScalar equal to 1 by only keeping the 8 lowest bits of > Unicode scalars.
Eh... I think you'd just use an unsafe Latin-1 for that; why waste a bit? Here's an example (work very much in-progress): https://github.com/apple/swift/blob/9defe9ded43c6f480f82a28d866ec73d803688db/test/Prototypes/Unicode.swift#L877 >> > A lot of machine processing of strings continues to deal with 8-bit >> > quantities (even 7-bit quantities, not UTF-8). Swift strings are >> > not very good at that. I see progress in the manifesto but nothing >> > to really close the performance gap with C. That's where "unsafe" >> > mechanisms could come into play. >> >> extendedASCII is supposed to address that. Given a smart enough >> optimizer, it should be possible to become competitive with C even >> without using unsafe constructs. However, we recognize the importance >> of being able to squeeze out that last bit of performance by dropping >> down to unsafe storage. > > I doubt a 32-bit encoding can bridge the performance gap with C in > particular because wire protocols will continue to favor compact > encodings. Incoming strings will have to be expanded to the > extendedASCII representation before processing and probably compacted > afterwards. So while this may address the needs of computationally > intensive string processing tasks, this does not help simple parsing > tasks on simple strings. I'm pretty sure it does; we're not going to change representations extendedASCII doesn't require anything to actually be expanded to 32-bits per code unit, except *maybe* in a register, and then only if the optimizer isn't smart enough to eliminate zero-extension followed by comparison with a known narrow value. You can always latin1.lazy.map { UInt32($0) } to produce 32-bit code units. All the common encodings are ASCII supersets, so this will “just work” for those. The only places where it becomes more complicated is in encodings like Shift-JIS (which might not even be important enough to support as a String backing-storage format). > >> > To guarantee Unicode correctness, a C string must be validated or >> > transformed to be considered a Swift string. >> >> Not really. You can do error-correction on the fly. However, I think >> pre-validation is often worthwhile because once you know something is >> valid it's much cheaper to decode correctly (especially for UTF-8). > > Sure. Eager vs. lazy validation is a valuable distinction, but what I am > after here is side-stepping validation altogether. I understand now that > user-defined encodings will make side-stepping validation possible. Right. > >> > If I understand the C String interop section correctly, in Swift 4, >> > this should not force a copy, but traversing the string is still >> > required. >> >> *What* should not force a copy? > > I would like to have a constructor that takes a pointer to a > null-terminated sequence of bytes (or a sequence of bytes and a length) > and turns it into a Swift string without allocation of a new backing store > for the string and without copying the bytes in the sequence from one > place in memory to another. We probably won't expose this at the top level of String, but you should be able to construct an UnsafeCString (which is-a Unicode) and then, if you really need the String type, construct a String from that: String(UnsafeCString(ntbs)) That would not do any copying. > I understand this may require the programmer to handle memory > management for the backing store. > >> > I hope I am correct about the no-copy thing, and I would also like to >> > permit promoting C strings to Swift strings without validation. This >> > is obviously unsafe in general, but I know my strings... and I care >> > about performance. ;) >> >> We intend to support that use-case. That's part of the reason for the >> ValidUTF8 and ValidUTF16 encodings you see here: >> https://github.com/apple/swift/blob/unicode-rethink/stdlib/public/ >> core/Unicode2.swift#L598 >> and here: >> https://github.com/apple/swift/blob/unicode-rethink/stdlib/public/ >> core/Unicode2.swift#L862 > > OK > >> > More importantly, it is not possible to mutate bytes in a Swift string >> > at will. Again it makes sense from the point of view of always >> > correct Unicode sequences. But it does not for machine processing of >> > C strings with C-like performance. Today, I can cheat using a >> > "_public" API for this, i.e., myString._core. _baseAddress!. This >> > should be doable from an official "unsafe" API. >> >> We intend to support that use-case. >> >> > Memory safety is also at play here, as well as ownership. A proper >> > API could guarantee the backing store is writable for instance, that >> > it is not shared. A memory-safe but not unicode-safe API could do >> > bounds checks. >> > >> > While low-level C string processing can be done using unsafe memory >> > buffers with performance, the lack of bridging with "real" Swift >> > strings kills the deal. No literals syntax (or costly coercions), >> > none of the many useful string APIs. >> > >> > To illustrate these points here is a simple experiment: code written >> > to synthesize an http date string from a bunch of integers. There are >> > four versions of the code going from nice high-level Swift code to >> > low-level C-like code. (Some of this code is also about avoiding ARC >> > overheads, and string interpolation overheads, hence the four >> > versions.) >> > >> > On my macbook pro (swiftc -O), the performance is as follows: >> > >> > interpolation + func: 2.303032365s >> > interpolation + array: 1.224858418s >> > append: 0.918512377s >> > memcpy: 0.182104674s >> > >> > While the benchmarking could be done more carefully, I think the main >> > observation is valid. The nice code is more than 10x slower than the >> > C-like code. Moreover, the ugly-but-still-valid-Swift code is still >> > about 5x slower than the C like code. For some applications, e.g. web >> > servers, this kind of numbers matter... >> > >> > Some of the proposed improvements would help with this, e.g., small >> > strings optimization, and maybe changes to the concatenation >> > semantics. But it seems to me that a big performance gap will remain. >> > (Concatenation even with strncat is significantly slower than memcpy >> > for fixed-size strings.) >> > >> > I believe there is a need and an opportunity for a fast "less safe" >> > String API. I hope it will be on the roadmap soon. >> >> I think it's already in the roadmap...the one that's in my head. If you >> want to submit a PR with amendments to the manifesto, that'd be great. >> Also thanks very much for the example below; we'll definitely >> be referring to it as we proceed forward. > > Here is a gist for the example code: > https://gist.github.com/tardieu/b6a9c4d53d56d089c58089ba8f6274b5 > > I can sketch key elements of an unsafe String API and some motivating > arguments in a pull request. Is this what you are asking for? That would be awesome, thanks! -- -Dave _______________________________________________ swift-evolution mailing list swift-evolution@swift.org https://lists.swift.org/mailman/listinfo/swift-evolution