On Sun, 07 Sep 2014 10:45:22 +0000 via Digitalmars-d <digitalmars-d@puremagic.com> wrote:
> For western text strings utf-8 is much better due to cache > efficiency. You can speed it up using SSE or dedicated > datastructures. that's what i call efficiency! using SIMD for string indexing! > The point of having unique immutable strings is that they compare > by reference only and that you can have auxillary datastructures > that classify them if needed. and this fill fail with compacting gc. heh. > I think the D approach to strings is unpleasant. You should not > have slices of strings, only slices of ubyte arrays. oh, no, thanks. casting strings back and forth for slicing is not fun. and writing parsers using string slicing is fun. > If you want real speedups for streams of symbols you have to move > into the landscape of huffman-encoding, tries, dedicated > datastructures… or just ditch utf-8 and use ucs-4. this will speedup the most frequently string operations: correct indexing and slicing. > Having uniform string support in libraries (i.e. only supporting > utf-8) is a clear advantage IMO, that will allow for APIs that > are SSE backed and performant. utf-8 was not invented as encoding for internal string representation. it's merely for data interchange. i myself believe that language should not do any encoding/decoding on given string without explicit asking. i.e. `foreach (dchar ch; s)` must be the same as `foreach (char ch; s)` when s is `string`. for any decoding i must use `foreach (ch; s.byUtf8Char)`. the whole "let's use utf-8 as internal string representation" was a mistake. and i'm not talking about D here.
signature.asc
Description: PGP signature