Jonathan M Davis wrote:
On Friday, March 02, 2012 20:41:35 Ali Çehreli wrote:
On 03/02/2012 06:30 PM, Piotr Szturmaj wrote:
  >  Hello,
  >
  >  For this code:
  >
  >  auto c = "test"c;
  >  auto w = "test"w;
  >  auto d = "test"d;
  >  pragma(msg, typeof(c.front));
  >  pragma(msg, typeof(w.front));
  >  pragma(msg, typeof(d.front));
  >
  >  compiler prints:
  >
  >  dchar
  >  dchar
  >  immutable(dchar)
  >
  >  IMO it should print this:
  >
  >  immutable(char)
  >  immutable(wchar)
  >  immutable(dchar)
  >
  >  Is it a bug?

No, that's by design. When used as InputRange ranges, slices of any
character type are exposed as ranges of dchar.

Indeed.

Strings are always treated as ranges of dchar, because it generally makes no
sense to operate on individual chars or wchars. A char is a UTF-8 code unit. A
wchar is a UTF-16 code unit. And a dchar is a UTF-32 code unit. The _only_ one
of those which is guranteed to be a code point is dchar, since in UTF-32, all
code points are a single code unit. If you were to operate on individual chars
or wchars, you'd be operating on pieces of characters rather than whole
characters, which wreaks havoc with unicode.

Now, technically speaking, a code point isn't necessarily a full character,
since you can also combine code points (e.g. adding a subscript to a letter),
and a full character is what's called a grapheme, and unfortunately, at the
moment, Phobos doesn't have a way to operate on graphemes, but operating on
code points is _far_ more correct than operating on code units. It's also more
efficient.

Unfortunately, in order to code completely efficiently with unicode, you have
understand quite a bit about it, which most programmers don't, but by
operating on ranges of code points, Phobos manages to be correct in the
majority of cases.

I know about Unicode, code units/points and their encoding.

So, yes. It's very much on purpose that all strings are treated as ranges of
dchar.

Foreach gives opportunity to handle any string by char, wchar or dchar, the default dchar is appropriate here, but why for ranges?

I was afraid it is on purpose, because it has some bad consequences. It breaks genericity when dealing with ranges. Consider a custom range of char:

struct CharRange
{
    @property bool empty();
    @property char front();
    void popFront();
}

typeof(CharRange.front) and ElementType!CharRange both return _char_ while for string they return _dchar_. This discrepancy pushes the range writer to handle special string cases. I'm currently trying to write ByDchar range:

template ByDchar(R)
     if (isInputRange!R && isSomeChar!(ElementType!R))
{
    alias ElementType!R E;
    static if (is(E == dchar))
        alias R ByDchar;
    else static if (is(E == char))
    {
        struct ByDchar
        {
            ...
        }
    }
    else static if (is(E == wchar))
    {
        ...
    }
}

The problem with that range is when it takes a string type, it aliases this type with itself, because ElementType!R yields dchar. This is why I'm talking about "bad consequences", I just want to iterate string by _char_, not _dchar_.

Reply via email to