Re: [swift-dev] State of String: ABI & Performance

Taylor Swift via swift-dev Thu, 11 Jan 2018 01:00:07 -0800


> On Jan 11, 2018, at 12:36 AM, Chris Lattner via swift-dev 
> <swift-dev@swift.org> wrote:
> 
>> On Jan 10, 2018, at 11:55 AM, Michael Ilseman via swift-dev 
>> <swift-dev@swift.org> wrote:
>> (A gist-formatted version of this email can be found at 
>> https://gist.github.com/milseman/bb39ef7f170641ae52c13600a512782f)
> 
> I’m very very excited for this, thank you for the detailed writeup and 
> consideration of the effects and tradeoffs involved.
> 
>> Given that ordering is not fit for human consumption, but rather machine 
>> processing, it might as well be fast. The current ordering differs on Darwin 
>> and Linux platforms, with Linux in particular suffering from poor 
>> performance due to choice of ordering (UCA with DUCET) and older versions of 
>> ICU. Instead, [String Comparison 
>> Prototype](https://github.com/apple/swift/pull/12115)  provides a simpler 
>> ordering that allows for many common-case fast-paths and optimizations. For 
>> all the Unicode enthusiasts out there, this is the lexicographical ordering 
>> of NFC-normalized UTF-16 code units.
> 
> Thank you for fixing this.  Your tradeoffs make perfect sense to me.
> 
>> ### Small String Optimization
> ..
>> For example, assuming a 16-byte String struct and 8 bits used for flags and 
>> discriminators (including discriminators for which small form a String is 
>> in), 120 bits are available for a small string payload. 120 bits can hold 7 
>> UTF-16 code units, which is sufficient for most graphemes and many common 
>> words and separators. 120 bits can also fit 15 ASCII/UTF-8 code units 
>> without any packing, which suffices for many programmer/system strings 
>> (which have a strong skew towards ASCII).
>> 
>> We may also want a compact 5-bit encoding for formatted numbers, such as 
>> 64-bit memory addresses in hex, `Int.max` in base-10, and `Double` in 
>> base-10, which would require 18, 19, and 24 characters respectively. 120 
>> bits with a 5-bit encoding would fit all of these. This would speed up the 
>> creation and interpolation of many strings containing numbers.
> 
> I think it is important to consider that having more special cases and 
> different representations slows down nearly *every* operation on string 
> because they have to check and detangle all of the possible representations.  
> Given the ability to hold 15 digits of ascii, I don’t see why it would be 
> worthwhile to worry about a 5-bit representation for digits.  String should 
> be an Any!
> 
> The tradeoff here is that you’d be biasing the design to favor creation and 
> interpolation of many strings containing *large* numbers, at the cost of 
> general string performance anywhere.   This doesn’t sound like a good 
> tradeoff for me, particularly when people writing extremely performance 
> sensitive code probably won’t find it good enough anyway.
> 
>> Final details are still in exploration. If the construction and 
>> interpretation of small bit patterns can remain behind a resilience barrier, 
>> new forms could be added in the future. However, the performance impact of 
>> this would need to be carefully evaluated.
> 
> I’d love to see performance numbers on this.  Have you considered a design 
> where you have exactly one small string representation: a sequence of 15 UTF8 
> bytes?  This holds your 15 bytes of ascii, probably more non-ascii characters 
> on average than 7 UTF16 codepoints, and only needs one determinator branch on 
> the entry point of hot functions that want to touch the bytes.  If you have a 
> lot of algorithms that are sped up by knowing they have ascii, you could go 
> with two bits:  “is small” and “isascii”, where isascii is set in both the 
> small string and normal string cases.
> 
> Finally, what tradeoffs do you see between a 1-word vs 2-word string?  Are we 
> really destined to have 2-words?  That’s still much better than the 3 words 
> we have now, but for some workloads it is a significant bloat.
> 
> 
>> ### Unmanaged Strings
>> 
>> Such unmanaged strings can be used when the lifetime of the storage is known 
>> to outlive all uses.
> 
> Just like StringRef!  +1, this concept can be a huge performance win… but can 
> also be a huge source of UB if used wrong.. :-(
> 
>> As such, they don’t need to participate in reference counting. In the 
>> future, perhaps with ownership annotations or sophisticated static analysis, 
>> Swift could convert managed strings into unmanaged ones as an optimization 
>> when it knows the contents would not escape. Similarly in the future, a 
>> String constructed from a Substring that is known to not outlive the 
>> Substring’s owner can use an unmanaged form to skip copying the contents. 
>> This would benefit String’s status as the recommended common-currency type 
>> for API design.
> 
> This could also have implications for StaticString.
> 
>> Some kind of unmanaged string construct is an often-requested feature for 
>> highly performance-sensitive code. We haven’t thought too deeply about how 
>> best to surface this as API and it can be sliced and diced many ways. Some 
>> considerations:
> 
> Other questions/considerations:
> - here and now, could we get the vast majority of this benefit by having the 
> ABI pass string arguments as +0 and guarantee their lifetime across calls?  
> What is the tradeoff of that?
> - does this concept even matter if/when we can write an argument as “borrowed 
> String”?  I suppose this is a bit more general in that it should be usable as 
> properties and other things that the (initial) ownership design isn’t scoped 
> to support since it doesn’t have lifetimes.
> - array has exactly the same sort of concern, why is this a string-specific 
> problem?
> - how does this work with mutation?  Wouldn’t we really have to have 
> something like MutableArrayRef since its talking about the mutability of the 
> elements, not something that var/let can support?
> 
> When I think about this, it seems like an “non-owning *slice*” of some sort.  
> If you went that direction, then you wouldn’t have to build it into the 
> String currency type, just like it isn’t in LLVM.
> 
>> ### Performance Flags
> 
> Nice.
> 
>> ### Miscellaneous
>> 
>> There are many other tweaks and improvements possible under the new ABI 
>> prior to stabilization, such as:
>> 
>> * Guaranteed nul-termination for String’s storage classes for faster C 
>> bridging.
> 
> This is probably best as another perf flag.
> 
>> * Unification of String and Substring’s various Views.
>> * Some form of opaque string, such as an existential, which can be used as 
>> an extension point.
>> * More compact/performant indices.
> 
> What is the current theory on eager vs lazy bridging with Objective-C?  Can 
> we get to an eager bridging model?  That alone would dramatically simplify 
> and speed up every operation on a Swift string.
> 
>> ## String Ergonomics
> 
> I need to run now, but I’ll read and comment on this section when I have a 
> chance.
> 
> That said, just today I had to write the code below and the ‘charToByte’ part 
> of it is absolutely tragic.  Is there any thoughts about how to improve 
> character literal processing?
> 
> -Chris
> 
> func decodeHex(_ string: String) -> [UInt8] {
>   func charToByte(_ c: String) -> UInt8 {
>     return c.utf8.first! // swift makes char processing grotty. :-(
>   }
> 
>   func hexToInt(_ c : UInt8) -> UInt8 {
>     switch c {
>     case charToByte("0")...charToByte("9"): return c - charToByte("0")
>     case charToByte("a")...charToByte("f"): return c - charToByte("a") + 10
>     case charToByte("A")...charToByte("F"): return c - charToByte("A") + 10
>     default: fatalError("invalid hexadecimal character")
>     }
>   }
> 
>   var result: [UInt8] = []
> 
>   assert(string.count & 1 == 0, "must get a pair of hexadecimal characters")
>   var it = string.utf8.makeIterator()
>   while let byte1 = it.next(),
>         let byte2 = it.next() {
>     result.append((hexToInt(byte1) << 4) | hexToInt(byte2))
>   }
> 
>   return result
> }
> 
> print(decodeHex("01FF"))
> print(decodeHex("8001"))
> print(decodeHex("80a1bcd3"))
> 
> 
> 
> 
> _______________________________________________
> swift-dev mailing list
> swift-dev@swift.org
> https://lists.swift.org/mailman/listinfo/swift-dev


RT. in my own code i’ve mostly taken to using [UInt8] or [Unicode.Scalar] and 
matching them against integer literals because that’s still easier than dealing 
with Strings and Characters and grapheme breaking functionality i don’t need 
for the use case.

If you ask me tho String should really be split into two types, with different 
purposes. String right now is very high level, for complex, generalized, 
emojified user-facing and user-input text. Now we’re trying to make it good for 
low-level stuff without compromising on the high-level stuff and it’s just 
making a mess of an API and permutations on permutations of flags. 

We need to have an StringArray<Codeunit> type which is like [UInt8] or 
[Unicode.Scalar] except with an API for stringy operations. Two tools for two 
jobs. StringArrays could be written with single quote literals in the syntax,, 
they’re completely unused and no one wants to use them for regexes so why not 
make them StringArrays. They’d use bitwise equalities so things like (e, ‘) 
won’t compare equal to (é) *by default*. Because how many HTML tags have names 
with a combining character in them? Last time you read a file format 
specification where the markers to match weren’t a table of 4-character ASCII 
strings?

_______________________________________________
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev

Re: [swift-dev] State of String: ABI & Performance

Reply via email to