Re: Ranges

Jonathan M Davis Sat, 12 Mar 2011 16:06:24 -0800

On Saturday 12 March 2011 14:02:00 Jonas Drewsen wrote:
> Hi,
> 
>     I'm working a bit with ranges atm. but there are definitely some
> things that are not clear to me yet. Can anyone tell me why the char
> arrays cannot be copied but the int arrays can?
> 
> import std.stdio;
> import std.algorithm;
> 
> void main(string[] args) {
> 
>    // This works
>    int[]      a1 = [1,2,3,4];
>    int[] a2 = [5,6,7,8];
>    copy(a1, a2);
> 
>    // This does not!
>    char[] a3 = ['1','2','3','4'];
>    char[] a4 = ['5','6','7','8'];
>    copy(a3, a4);
> 
> }
> 
> Error message:
> 
> test2.d(13): Error: template std.algorithm.copy(Range1,Range2) if
> (isInputRange!(Range1) && isOutputRange!(Range2,ElementType!(Range1)))
> does not match any function template declaration
> 
> test2.d(13): Error: template std.algorithm.copy(Range1,Range2) if
> (isInputRange!(Range1) && isOutputRange!(Range2,ElementType!(Range1)))
> cannot deduce template function from argument types !()(char[],char[])


Character arrays / strings are not exactly normal. And there's a very good 
reason for it: unicode.

In unicode, a character is generally a single code point (there are also 
graphemes which involve combining code points to add accents and superscripts 
and whatnot to create a single character, but we'll ignore that in this 
discussion - it's complicated enough as it is). Depending on the encoding, that 
code point may be made up of one - or more - code units. UTF-8 uses 8 bit code 
units. UTF-16 uses 16 bit code units. And UTF-32 uses 32-bit code units. char 
is 
a UTF-8 code unit. wchar is a UTF-16 code unit. dchar is a UTF-32 code unit. 
UTF-32 is the _only_ one of those three which _always_ has one code unit per 
code point.

With an array of integers you can index it and slice it and be sure that 
everything that you're doing is valid. If you look at a single element, you 
know 
that it's a valid int. If you slice it, you know that every int in there is 
valid. If you're dealing with a dstring or dchar[], then the same still holds.

A dstring or dchar[] is an array of UTF-32 code units. Every code point is a 
single code unit, so every element in the array is a valid code point. You can 
take an arbitrary element in that array and know that it's a valid code point. 
You can slice it wherever you want and you still have a valid dstrin
g or dchar[]. The same does _not_ hold for char[] and wchar[].

char[] and wchar[] are arrays of UTF-8 and UTF-16 code units respectively. In 
both of those encodings, multiple code units are required to create a single 
code point. So, for instance, a code point could have 4 code units. That means 
that _4_ elements of that char[] make up a _single_ code point. You'd need 
_all_ 
4 of those elements to create a single, valid character. So, you _can't_ just 
take an arbitrary element in a char[] or wchar[] and expect it to be valid. You 
_can't_ just slice it anywhere. The resulting array stands a good chance of 
being invalid. You have to slice on code point boundaries - otherwise you could 
slice characters in hald and end up with an invalid string. So, unlike other 
arrays, it just doesn't work to treat char[] and wchar[] as random access 
ranges 
of their element type. What the programmer cares about is characters - dchars - 
not chars or wchars.

So, the way this is handled is that char[], wchar[], and dchar[] are all 
treated 
as ranges of dchar. In the case of dchar[], this is nothing special. You can 
index it and slice it as normal. So, it is a random access range.. However, in 
the case of char[] and wchar[], that means that when you're iterating over them 
that you're not dealing with a single element of the array at a time. front 
returns a dchar, and popFront() pops off however many elements made up front. 
It's like with foreach. If you iterate a char[] with auto or char, then each 
individual element is given

foreach(c; myStr) {}

But if you iterate over with dchar, then each code point is given as a dchar:

foreach(dchar c; myStr) {}

If you were to try and iterate over a char[] by char, then you would be looking 
at code units rather than code points which is _rarely_ what you want. If 
you're 
dealing with anything other than pure ASCII, you _will_ have bugs if you do 
that. You're supposed to use dchar with foreach and character arrays. That way, 
each value you process is a valid character. Ranges do the same, only you don't 
give them an iteration type, so they're _always_ iterating over dchar.

So, when you're using a range of char[] or wchar[], you're really using a range 
of dchar. These ranges are bi-directional. They can't be sliced, and they can't 
be indexed (since doing so would likely be invalid). This generally works very 
well. It's exactly what you want in most cases. The problem is that that means 
that the range that you're iterating over is effectively of a different type 
than 
the original char[] or wchar[].

You can't just take two ranges of dchar of the same length and necessarily have 
them fit in the same char[] or wchar[]. They have the same length, because they 
have the same number of code points. However, they could have a different 
number 
of code _units_, so the lengths of the actual arrays could differ. So, you 
can't 
just take an arbitrary dchar range and copy it to another arbitrary dchar range.

The way that this is dealt with in the case of a function like copy is that 
what 
you're copying _to_ must be an output range. char[] and wchar[] are _not_ 
output 
ranges, because of their differing number of code units per code point. So, 
they 
don't work with copy. You need to use a dchar[] as the output range if you want 
to use strings with copy.

Now, in some cases, it might be possible to special case some of the range 
functions to treat char[] and wchar[] as arrays instead of ranges (in the case 
of copy, that's probably possible if both arguments are of the same type), but 
that can't be done in the general case. You could open an enhancement request 
for copy to treat char[] and wchar[] as arrays if _both_ of the arguments are 
of 
the same type.

- Jonathan M Davis

Re: Ranges

Reply via email to