Re: Inconsitency

nickles Sun, 13 Oct 2013 09:35:35 -0700

This will _not_ return a trailing surrogate of a Cyrillicletter. It will return the second code unit of the "ä"character (U+00E4).


True. It's UTF-8, not UTF-16.

However, it could also yield the first code unit of the umlautdiacritic, depending on how the string is represented.


This is not true for UTF-8, which is not subject to "endianism".

If the string were in UTF-32, [2] could yield either theCyrillic character, or the umlaut diacritic.
The .length of the UTF-32 string could be either 3 or 4.

Both are not true for UTF-32. There is no interpretation (exceptfor the "endianism", which could be taken care of in alibrary/the core) for the code point.

There are multiple reasons why .length and index access isbased on code units rather than code points or any higher levelrepresentation, but one is that the complexity would suddenlybe O(n) instead of O(1).


see my last statement below

In-place modifications of char[] arrays also wouldn't bepossible anymore


They would be, but

as the size of the underlying array might have to change.

Well that's a point; on the other hand, D is constantly creatingand throwing away new strings, so this isn't quite an argument.The current solution puts the programmer in charge of dealingwith UTF-x, where a more consistent implementation would put theburden on the implementors of the libraries/core, i.e. the oneswho usually have a better understanding of Unicode than theaverage programmer.

Also, implementing such a semantics would not per se abandon abyte-wise access, would it?

So, how do you guys handle UTF-8 strings in D? What are yoursolutions to the problems described? Does it all come down toconverting "string"s and "wstring"s to "dstring"s, manipulatingthem, and re-convert them to "string"s? Btw, what would this meanin terms of speed?

These is no irony in my questions. I'm really looking forsolutions...

Re: Inconsitency

Reply via email to