== Quote from Andrei Alexandrescu (seewebsiteforem...@erdani.org)'s article > It's no secret that string et al. are not a magic recipe for writing > correct Unicode code. However, things are pretty good and could be > further improved by operating the following changes in std.array and > std.range: > - make front() and back() for UTF-8 and UTF-16 automatically decode the > first and last Unicode character > - make popFront() and popBack() skip one entire Unicode character > (instead of just one code unit) > - alter isRandomAccessRange to return false for UTF-8 and UTF-16 strings > - change hasLength to return false for UTF-8 and UTF-16 strings > These changes effectively make UTF-8 and UTF-16 bidirectional ranges, > with the quirk that you still have a sort of a random-access operator. > I'm very strongly in favor of this change. Bidirectional strings allow > beautiful correct algorithms to be written that handle encoded strings > without any additional effort; with these changes, everything applicable > of std.algorithm works out of the box (with the appropriate fixes here > and there), which is really remarkable. > The remaining WTF is the length property. Traditionally, a range > offering length also implies the expectation that a range of length n > allows you to call popFront n times and then assert that the range is > empty. However, if you check e.g. hasLength!string it will yield false, > although the string does have an accessible member by that name and of > the appropriate type. > Although Phobos always checks its assumptions, people might occasionally > write code that just uses .length without checking hasLength. Then, > they'll be annoyed when the code fails with UTF-8 and UTF-16 strings. > (The "real" length of the range is not stored, but can be computed by > using str.walkLength() in std.range.) > What can be done about that? I see a number of solutions: > (a) Do not operate the change at all. > (b) Operate the change and mention that in range algorithms you should > check hasLength and only then use "length" under the assumption that it > really means "elements count". > (c) Deprecate the name .length for UTF-8 and UTF-16 strings, and define > a different name for that. Any other name (codeUnits, codes etc.) would > do. The entire point is to not make algorithms believe strings have a > .length property. > (d) Have std.range define a distinct property called e.g. "count" and > then specialize it appropriately. Then change all references to .length > in std.algorithm and elsewhere to .count. > What would you do? Any ideas are welcome. > Andrei
I personally would find this extremely annoying because most of the code I write that involves strings is scientific computing code that will never be internationalized, let alone released to the general public. I basically just use ASCII because it's all I need and if your UTF-8 string contains only ASCII characters, it can be treated as random-access. I don't know how many people out there are in similar situations, but I doubt they'll be too happy. On the other hand, I guess it wouldn't be hard to write a simple wrapper struct on top of immutable(ubyte)[] and call it AsciiString. Once alias this gets fully debugged, I could even make it implicitly convert to immutable(char)[].