> On 11. Jan 2018, at 06:36, Chris Lattner via swift-dev <[email protected]>
> wrote:
>
> On Jan 10, 2018, at 11:55 AM, Michael Ilseman via swift-dev
> <[email protected] <mailto:[email protected]>> wrote:
>> (A gist-formatted version of this email can be found at
>> https://gist.github.com/milseman/bb39ef7f170641ae52c13600a512782f
>> <https://gist.github.com/milseman/bb39ef7f170641ae52c13600a512782f>)
>
> I’m very very excited for this, thank you for the detailed writeup and
> consideration of the effects and tradeoffs involved.
>
>> Given that ordering is not fit for human consumption, but rather machine
>> processing, it might as well be fast. The current ordering differs on Darwin
>> and Linux platforms, with Linux in particular suffering from poor
>> performance due to choice of ordering (UCA with DUCET) and older versions of
>> ICU. Instead, [String Comparison
>> Prototype](https://github.com/apple/swift/pull/12115
>> <https://github.com/apple/swift/pull/12115>) provides a simpler ordering
>> that allows for many common-case fast-paths and optimizations. For all the
>> Unicode enthusiasts out there, this is the lexicographical ordering of
>> NFC-normalized UTF-16 code units.
>
> Thank you for fixing this. Your tradeoffs make perfect sense to me.
>
>> ### Small String Optimization
> ..
>> For example, assuming a 16-byte String struct and 8 bits used for flags and
>> discriminators (including discriminators for which small form a String is
>> in), 120 bits are available for a small string payload. 120 bits can hold 7
>> UTF-16 code units, which is sufficient for most graphemes and many common
>> words and separators. 120 bits can also fit 15 ASCII/UTF-8 code units
>> without any packing, which suffices for many programmer/system strings
>> (which have a strong skew towards ASCII).
>>
>> We may also want a compact 5-bit encoding for formatted numbers, such as
>> 64-bit memory addresses in hex, `Int.max` in base-10, and `Double` in
>> base-10, which would require 18, 19, and 24 characters respectively. 120
>> bits with a 5-bit encoding would fit all of these. This would speed up the
>> creation and interpolation of many strings containing numbers.
>
> I think it is important to consider that having more special cases and
> different representations slows down nearly *every* operation on string
> because they have to check and detangle all of the possible representations.
> Given the ability to hold 15 digits of ascii, I don’t see why it would be
> worthwhile to worry about a 5-bit representation for digits. String should
> be an Any!
>
> The tradeoff here is that you’d be biasing the design to favor creation and
> interpolation of many strings containing *large* numbers, at the cost of
> general string performance anywhere. This doesn’t sound like a good
> tradeoff for me, particularly when people writing extremely performance
> sensitive code probably won’t find it good enough anyway.
>
>> Final details are still in exploration. If the construction and
>> interpretation of small bit patterns can remain behind a resilience barrier,
>> new forms could be added in the future. However, the performance impact of
>> this would need to be carefully evaluated.
>
> I’d love to see performance numbers on this. Have you considered a design
> where you have exactly one small string representation: a sequence of 15 UTF8
> bytes? This holds your 15 bytes of ascii, probably more non-ascii characters
> on average than 7 UTF16 codepoints, and only needs one determinator branch on
> the entry point of hot functions that want to touch the bytes. If you have a
> lot of algorithms that are sped up by knowing they have ascii, you could go
> with two bits: “is small” and “isascii”, where isascii is set in both the
> small string and normal string cases.
>
> Finally, what tradeoffs do you see between a 1-word vs 2-word string? Are we
> really destined to have 2-words? That’s still much better than the 3 words
> we have now, but for some workloads it is a significant bloat.
>
>
>> ### Unmanaged Strings
>>
>> Such unmanaged strings can be used when the lifetime of the storage is known
>> to outlive all uses.
>
> Just like StringRef! +1, this concept can be a huge performance win… but can
> also be a huge source of UB if used wrong.. :-(
>
>> As such, they don’t need to participate in reference counting. In the
>> future, perhaps with ownership annotations or sophisticated static analysis,
>> Swift could convert managed strings into unmanaged ones as an optimization
>> when it knows the contents would not escape. Similarly in the future, a
>> String constructed from a Substring that is known to not outlive the
>> Substring’s owner can use an unmanaged form to skip copying the contents.
>> This would benefit String’s status as the recommended common-currency type
>> for API design.
>
> This could also have implications for StaticString.
>
>> Some kind of unmanaged string construct is an often-requested feature for
>> highly performance-sensitive code. We haven’t thought too deeply about how
>> best to surface this as API and it can be sliced and diced many ways. Some
>> considerations:
>
> Other questions/considerations:
> - here and now, could we get the vast majority of this benefit by having the
> ABI pass string arguments as +0 and guarantee their lifetime across calls?
> What is the tradeoff of that?
> - does this concept even matter if/when we can write an argument as “borrowed
> String”? I suppose this is a bit more general in that it should be usable as
> properties and other things that the (initial) ownership design isn’t scoped
> to support since it doesn’t have lifetimes.
> - array has exactly the same sort of concern, why is this a string-specific
> problem?
> - how does this work with mutation? Wouldn’t we really have to have
> something like MutableArrayRef since its talking about the mutability of the
> elements, not something that var/let can support?
>
> When I think about this, it seems like an “non-owning *slice*” of some sort.
> If you went that direction, then you wouldn’t have to build it into the
> String currency type, just like it isn’t in LLVM.
>
>> ### Performance Flags
>
> Nice.
>
>> ### Miscellaneous
>>
>> There are many other tweaks and improvements possible under the new ABI
>> prior to stabilization, such as:
>>
>> * Guaranteed nul-termination for String’s storage classes for faster C
>> bridging.
>
> This is probably best as another perf flag.
>
>> * Unification of String and Substring’s various Views.
>> * Some form of opaque string, such as an existential, which can be used as
>> an extension point.
>> * More compact/performant indices.
>
> What is the current theory on eager vs lazy bridging with Objective-C? Can
> we get to an eager bridging model? That alone would dramatically simplify
> and speed up every operation on a Swift string.
>
>> ## String Ergonomics
>
> I need to run now, but I’ll read and comment on this section when I have a
> chance.
>
> That said, just today I had to write the code below and the ‘charToByte’ part
> of it is absolutely tragic. Is there any thoughts about how to improve
> character literal processing?
>
> -Chris
>
> func decodeHex(_ string: String) -> [UInt8] {
> func charToByte(_ c: String) -> UInt8 {
> return c.utf8.first! // swift makes char processing grotty. :-(
> }
>
> func hexToInt(_ c : UInt8) -> UInt8 {
> switch c {
> case charToByte("0")...charToByte("9"): return c - charToByte("0")
> case charToByte("a")...charToByte("f"): return c - charToByte("a") + 10
> case charToByte("A")...charToByte("F"): return c - charToByte("A") + 10
> default: fatalError("invalid hexadecimal character")
> }
> }
>
> var result: [UInt8] = []
>
> assert(string.count & 1 == 0, "must get a pair of hexadecimal characters")
> var it = string.utf8.makeIterator()
> while let byte1 = it.next(),
> let byte2 = it.next() {
> result.append((hexToInt(byte1) << 4) | hexToInt(byte2))
> }
>
> return result
> }
>
> print(decodeHex("01FF"))
> print(decodeHex("8001"))
> print(decodeHex("80a1bcd3"))
In this particular example, you can use UInt8.init(ascii:) to make it a little
nicer (no need for the charToByte function):
func hexToInt(_ c : UInt8) -> UInt8 {
switch c {
case UInt8(ascii: "0")...UInt8(ascii: "9"):
return c - UInt8(ascii: "0")
case UInt8(ascii: "a")...UInt8(ascii: "f"):
return c - UInt8(ascii: "a") + 10
case UInt8(ascii: "A")...UInt8(ascii: "F"):
return c - UInt8(ascii: "A") + 10
default: fatalError("invalid hexadecimal character")
}
}
_______________________________________________
swift-dev mailing list
[email protected]
https://lists.swift.org/mailman/listinfo/swift-dev