On Tue, Oct 2, 2018 at 3:04 PM Mark Davis ☕️ <m...@macchiato.com> wrote:
> > * The Python 3.3 model mentions the disadvantages of memory usage >> cliffs but doesn't mention the associated perfomance cliffs. It would >> be good to also mention that when a string manipulation causes the >> storage to expand or contract, there's a performance impact that's not >> apparent from the nature of the operation if the programmer's >> intuition works on the assumption that the programmer is dealing with >> UTF-32. >> > > The focus was on immutable string models, but I didn't make that clear. > Added some text. > Thanks. > * The UTF-16/Latin1 model is missing. It's used by SpiderMonkey, DOM >> text node storage in Gecko, (I believe but am not 100% sure) V8 and, >> optionally, HotSpot >> ( >> https://docs.oracle.com/javase/9/vm/java-hotspot-virtual-machine-performance-enhancements.htm#JSJVM-GUID-3BB4C26F-6DE7-4299-9329-A3E02620D50A >> ). >> That is, text has UTF-16 semantics, but if the high half of every code >> unit in a string is zero, only the lower half is stored. This has >> properties analogous to the Python 3.3 model, except non-BMP doesn't >> expand to UTF-32 but uses UTF-16 surrogate pairs. >> > > Thanks, will add. > V8 source code shows it has a OneByteString storage option: https://cs.chromium.org/chromium/src/v8/src/objects/string.h?sq=package:chromium&g=0&l=494 . From hearsay, I'm convinced that it means Latin1, but I've failed to find a clear quotable statement from a V8 developer to that affect. > 3) Type-system-tagged UTF-8 (the model of Rust): Valid UTF-8 buffers >> have a different type in the type system than byte buffers. To go from >> a byte buffer to an UTF-8 buffer, UTF-8 validity is checked. Once data >> has been tagged as valid UTF-8, the validity is trusted completely so >> that iteration by code point does not have "else" branches for >> malformed sequences. If data that the type system indicates to be >> valid UTF-8 wasn't actually valid, it would be nasal demon time. The >> language has a default "safe" side and an opt-in "unsafe" side. The >> unsafe side is for performing low-level operations in a way where the >> responsibility of upholding invariants is moved from the compiler to >> the programmer. It's impossible to violate the UTF-8 validity >> invariant using the safe part of the language. >> > > Added a quote based on this; please check if it is ok. > Looks accurate. Thanks. -- Henri Sivonen hsivo...@hsivonen.fi https://hsivonen.fi/