On 18/03/11 5:53 PM, Jonathan M Davis wrote:
On Friday, March 18, 2011 03:32:35 spir wrote:
On 03/18/2011 10:29 AM, Peter Alexander wrote:
On 13/03/11 12:05 AM, Jonathan M Davis wrote:
So, when you're using a range of char[] or wchar[], you're really using
a range of dchar. These ranges are bi-directional. They can't be
sliced, and they can't be indexed (since doing so would likely be
invalid). This generally works very well. It's exactly what you want in
most cases. The problem is that that means that the range that you're
iterating over is effectively of a different type than
the original char[] or wchar[].

This has to be the worst language design decision /ever/.

You can't just mess around with fundamental principles like "the first
element in an array of T has type T" for the sake of a minor
convenience. How are we supposed to do generic programming if common
sense reasoning about types doesn't hold?

This is just std::vector<bool>  from C++ all over again. Can we not learn
from mistakes of the past?

I partially agree, but. Compare with a simple ascii text: you could iterate
over it chars (=codes=bytes), words, lines... Or according to specific
schemes for your app (eg reverse order, every number in it, every word at
start of line...). A piece of is not only a stream of codes.

The problem is there is no good decision, in the case of char[] or wchar[].
We should have to choose a kind of "natural" sense of what it means to
iterate over a text, but there no such thing. What does it *mean*? What is
the natural unit of a text?
Bytes or words are code units which mean nothing. Code units (<->  dchars)
are not guaranteed to mean anything neither (as shown by past discussion:
a code unit may be the base 'a', the following one be the composite '^',
both in "รข"). Code unit do not represent "characters" in the common sense.
So, it is very clear that implicitely iterating over dchars is a wrong
choice. But what else? I would rather get rid of wchar and dchar and deal
with plain stream of bytes supposed to represent utf8. Until we get a good
solution to operate at the level of "human" characters.

Iterating over dchars works in _most_ cases. Iterating over chars only works for
pure ASCII. The additional overhead for dealing with graphemes instead of code
points is almost certainly prohibitive, it _usually_ isn't necessary, and we
don't have an actualy grapheme solution yet. So, treating char[] and wchar[] as
if their elements were valid on their own is _not_ going to work. Treating them
along with dchar[] as ranges of dchar _mostly_ works. We definitely should have 
a
way to handle them as ranges of graphemes for those who need to, but the code
point vs grapheme issue is nowhere near as critical as the code unit vs code
point issue.

I don't really want to get into the whole unicode discussion again. It has been
discussed quite a bit on the D list already. There is no perfect solution. The
current solution _mostly_ works, and, for the most part IMHO, is the correct
solution. We _do_ need a full-on grapheme handling solution, but a lot of stuff
doesn't need that and the overhead for dealing with it would be prohibitive. The
main problem with using code points rather than graphemes is the lack of
normalization, and a _lot_ of string code can get by just fine without that.

So, we have a really good 90% solution and we still need a 100% solution, but
using the 100% all of the time would almost certainly not be acceptable due to
performance issues, and doing stuff by code unit instead of code point would be
_really_ bad. So, what we have is good and will likely stay as is. We just need
a proper grapheme solution for those who need it.

- Jonathan M Davis


P.S. Unicode is just plain ugly.... :(

I must be missing something, because the solution seems obvious to me:

char[], wchar[], and dchar[] should be simple arrays like int[] with no unicode semantics.

string, wstring, and dstring should not be aliases to arrays, but instead should be separate types that behave the way char[], wchar[], and dchar[] do currently.

Is there any problem with this approach?

Reply via email to