Making all strings UTF ranges has some risk of WTF

Andrei Alexandrescu Wed, 03 Feb 2010 18:05:30 -0800

It's no secret that string et al. are not a magic recipe for writingcorrect Unicode code. However, things are pretty good and could befurther improved by operating the following changes in std.array andstd.range:

- make front() and back() for UTF-8 and UTF-16 automatically decode thefirst and last Unicode character

- make popFront() and popBack() skip one entire Unicode character(instead of just one code unit)


- alter isRandomAccessRange to return false for UTF-8 and UTF-16 strings

- change hasLength to return false for UTF-8 and UTF-16 strings

These changes effectively make UTF-8 and UTF-16 bidirectional ranges,with the quirk that you still have a sort of a random-access operator.

I'm very strongly in favor of this change. Bidirectional strings allowbeautiful correct algorithms to be written that handle encoded stringswithout any additional effort; with these changes, everything applicableof std.algorithm works out of the box (with the appropriate fixes hereand there), which is really remarkable.

The remaining WTF is the length property. Traditionally, a rangeoffering length also implies the expectation that a range of length nallows you to call popFront n times and then assert that the range isempty. However, if you check e.g. hasLength!string it will yield false,although the string does have an accessible member by that name and ofthe appropriate type.

Although Phobos always checks its assumptions, people might occasionallywrite code that just uses .length without checking hasLength. Then,they'll be annoyed when the code fails with UTF-8 and UTF-16 strings.

(The "real" length of the range is not stored, but can be computed byusing str.walkLength() in std.range.)


What can be done about that? I see a number of solutions:

(a) Do not operate the change at all.

(b) Operate the change and mention that in range algorithms you shouldcheck hasLength and only then use "length" under the assumption that itreally means "elements count".

(c) Deprecate the name .length for UTF-8 and UTF-16 strings, and definea different name for that. Any other name (codeUnits, codes etc.) woulddo. The entire point is to not make algorithms believe strings have a.length property.

(d) Have std.range define a distinct property called e.g. "count" andthen specialize it appropriately. Then change all references to .lengthin std.algorithm and elsewhere to .count.


What would you do? Any ideas are welcome.


Andrei

Making all strings UTF ranges has some risk of WTF

Reply via email to