Re: The Case For Autodecode

2016-06-04 Thread Steven Schveighoffer via Digitalmars-d

On 6/4/16 4:57 AM, Patrick Schluter wrote:

On Friday, 3 June 2016 at 20:18:31 UTC, Steven Schveighoffer wrote:

On 6/3/16 3:52 PM, ag0aep6g wrote:



Does it work for for char -> wchar, too?


It does not. 0x is a valid code point, and I think so are all the
other values that would result. In fact, I think there are no invalid
code units for wchar.


https://codepoints.net/specials

U+ would be fine, better at least than a surrogate.



U+ is still a valid code point, even if it's not assigned any 
unicode character.


But the result would be U+ff80 to U+, and I'm sure some of those are 
valid code points.


-Steve


Re: The Case For Autodecode

2016-06-04 Thread Observer via Digitalmars-d

On Friday, 3 June 2016 at 11:24:40 UTC, ag0aep6g wrote:
Finally, this is not the only argument in favor of *keeping* 
autodecoding, of course. Not wanting to break user code is the 
big one there, I guess.


I'm not familiar with the details of autodecoding, but one thing
strikes me about this whole discussion.  It seems to me that it
is just nibbling around the edges of how one should implement full
Unicode support.  And it seems to me that that topic, and how
autodecoding plays into it, won't be properly understood except by
comparison with mature software that has undergone many years of
testing and revision.  Two examples stand out to me:

* Perl 5 has undergone a gradual evolution, over many releases,
  to get this right.  It might also be the case that Perl 6 is
  even cleaner.

* The International Components for Unicode (ICU) package, with
  supported libraries for C, C++, and Java.  This is the industry-
  standard definition of what it means to handle Unicode in these
  languages.  See http://site.icu-project.org/ for details.

Both of these implementations have seen many years of real-world
use, so I would tend to look to them for guidance over trying to
develop my own opinion based on some small set of particular use
cases I might happen to have encountered.


Re: The Case For Autodecode

2016-06-04 Thread Patrick Schluter via Digitalmars-d
On Friday, 3 June 2016 at 20:18:31 UTC, Steven Schveighoffer 
wrote:

On 6/3/16 3:52 PM, ag0aep6g wrote:

On 06/03/2016 09:09 PM, Steven Schveighoffer wrote:

Except many chars *do* properly convert. This should work:

char c = 'a';
dchar d = c;
assert(d == 'a');


Yeah, that's what I meant by "standalone code unit". Code 
units that on

their own represent a code point would not be touched.


But you can get a standalone code unit that is part of a coded 
sequence quite easily


foo(string s)
{
   auto x = s[0];
   dchar d = x;
}



As I mentioned in my earlier reply, some kind of "bounds 
checking" for

the conversion could be a possibility.

Hm... an interesting possiblity:

dchar _dchar_convert(char c)
{
return cast(int)cast(byte)c; // get sign extension for 
non-ASCII

}


So when the char's most significant bit is set, this fills the 
upper
bits of the dchar with 1s, right? And a set most significant 
bit in a
char means it's part of a multibyte sequence, while in a dchar 
it means
that the dchar is invalid, because they only go up to 
U+10. Huh. Neat.


An interesting thing is that I think the CPU can do this for us.


Does it work for for char -> wchar, too?


It does not. 0x is a valid code point, and I think so are 
all the other values that would result. In fact, I think there 
are no invalid code units for wchar.


https://codepoints.net/specials

U+ would be fine, better at least than a surrogate.



Re: The Case For Autodecode

2016-06-03 Thread ag0aep6g via Digitalmars-d

On 06/03/2016 11:13 PM, Steven Schveighoffer wrote:

No, but I like the idea of preserving the erroneous character you tried
to convert.


Makes sense.


But is there an invalid wchar? I looked through the wikipedia article on
UTF 16, and it didn't seem to say there was one.

If we use U+FFFD, that signifies a coding problem but is still a valid
code point. However, doing a wchar in the D800 - D8FF range without
being followed by a code unit in the DC00 - DFFF range is an invalid
sequence. D throws if it encounters such a thing.


The Unicode FAQ has an answer to this exact question, but it also only 
says that "[u]npaired surrogates are invalid" [1].


It also mentions "noncharacters" which are "permanently reserved [...] 
for internal use". "For example, they might be used internally as a 
particular kind of object placeholder in a string." [2] - Not too bad.


And then there is the replacement character, of course. "[U]sed to 
replace an incoming character whose value is unknown or unrepresentable 
in Unicode" [3].



[1] http://www.unicode.org/faq/utf_bom.html#utf16-7
[2] http://www.unicode.org/faq/private_use.html#noncharacters
[3] http://www.fileformat.info/info/unicode/char/0fffd/index.htm


Re: The Case For Autodecode

2016-06-03 Thread Steven Schveighoffer via Digitalmars-d

On 6/3/16 4:39 PM, ag0aep6g wrote:

On 06/03/2016 10:18 PM, Steven Schveighoffer wrote:

But you can get a standalone code unit that is part of a coded sequence
quite easily

foo(string s)
{
auto x = s[0];
dchar d = x;
}


I don' think we're disagreeing on anything.

I'm calling UTF-8 code units below 0x80 "standalone" code units. They're
never part of multibyte sequences. Your _dchar_convert returns them
unscathed.


Ah, I thought you meant standalone as in it was assigned to a standalone 
char variable vs. part of an array or range. My mistake.


Re-reading your original message, I see that should have been clear to me...


So we need most efficient logic that does this:

if(c & 0x80)
 return wchar(0xd800 + c);


Is this going to be faster than returning a constant invalid wchar?


No, but I like the idea of preserving the erroneous character you tried 
to convert.


But is there an invalid wchar? I looked through the wikipedia article on 
UTF 16, and it didn't seem to say there was one.


If we use U+FFFD, that signifies a coding problem but is still a valid 
code point. However, doing a wchar in the D800 - D8FF range without 
being followed by a code unit in the DC00 - DFFF range is an invalid 
sequence. D throws if it encounters such a thing.


-Steve


Re: The Case For Autodecode

2016-06-03 Thread ag0aep6g via Digitalmars-d

On 06/03/2016 10:18 PM, Steven Schveighoffer wrote:

But you can get a standalone code unit that is part of a coded sequence
quite easily

foo(string s)
{
auto x = s[0];
dchar d = x;
}


I don' think we're disagreeing on anything.

I'm calling UTF-8 code units below 0x80 "standalone" code units. They're 
never part of multibyte sequences. Your _dchar_convert returns them 
unscathed.


Higher code units are always part of multibyte sequences (or invalid 
already). Your function returns invalid code points for them.


_dchar_convert does exactly what I meant, except that I had in mind 
returning the replacement character for non-standalone code units. But I 
see that that may not be feasible, and it's probably not necessary.


[...]

So we need most efficient logic that does this:

if(c & 0x80)
 return wchar(0xd800 + c);


Is this going to be faster than returning a constant invalid wchar?


else
 return wchar(c);

More expensive, but more correct!

wchar to dchar conversion is pretty sound, as the surrogate pairs are
invalid code points for dchar.

-Steve




Re: The Case For Autodecode

2016-06-03 Thread Steven Schveighoffer via Digitalmars-d

On 6/3/16 3:52 PM, ag0aep6g wrote:

On 06/03/2016 09:09 PM, Steven Schveighoffer wrote:

Except many chars *do* properly convert. This should work:

char c = 'a';
dchar d = c;
assert(d == 'a');


Yeah, that's what I meant by "standalone code unit". Code units that on
their own represent a code point would not be touched.


But you can get a standalone code unit that is part of a coded sequence 
quite easily


foo(string s)
{
   auto x = s[0];
   dchar d = x;
}




As I mentioned in my earlier reply, some kind of "bounds checking" for
the conversion could be a possibility.

Hm... an interesting possiblity:

dchar _dchar_convert(char c)
{
return cast(int)cast(byte)c; // get sign extension for non-ASCII
}


So when the char's most significant bit is set, this fills the upper
bits of the dchar with 1s, right? And a set most significant bit in a
char means it's part of a multibyte sequence, while in a dchar it means
that the dchar is invalid, because they only go up to U+10. Huh. Neat.


An interesting thing is that I think the CPU can do this for us.


Does it work for for char -> wchar, too?


It does not. 0x is a valid code point, and I think so are all the 
other values that would result. In fact, I think there are no invalid 
code units for wchar. Of course, a surrogate pair requires another code 
unit to be valid, so we can at least promote a char to a wchar in the 
surrogate pair range (and always in the low or high surrogate range so a 
naive transcoding of a char range to wchar will result in an invalid 
sequence if there are any non-ascii characters).


So we need most efficient logic that does this:

if(c & 0x80)
return wchar(0xd800 + c);
else
return wchar(c);

More expensive, but more correct!

wchar to dchar conversion is pretty sound, as the surrogate pairs are 
invalid code points for dchar.


-Steve


Re: The Case For Autodecode

2016-06-03 Thread ag0aep6g via Digitalmars-d

On 06/03/2016 09:09 PM, Steven Schveighoffer wrote:

Except many chars *do* properly convert. This should work:

char c = 'a';
dchar d = c;
assert(d == 'a');


Yeah, that's what I meant by "standalone code unit". Code units that on 
their own represent a code point would not be touched.



As I mentioned in my earlier reply, some kind of "bounds checking" for
the conversion could be a possibility.

Hm... an interesting possiblity:

dchar _dchar_convert(char c)
{
return cast(int)cast(byte)c; // get sign extension for non-ASCII
}


So when the char's most significant bit is set, this fills the upper 
bits of the dchar with 1s, right? And a set most significant bit in a 
char means it's part of a multibyte sequence, while in a dchar it means 
that the dchar is invalid, because they only go up to U+10. Huh. Neat.


Does it work for for char -> wchar, too?


Re: The Case For Autodecode

2016-06-03 Thread Steven Schveighoffer via Digitalmars-d

On 6/3/16 3:12 PM, Steven Schveighoffer wrote:

On 6/3/16 3:09 PM, Steven Schveighoffer wrote:


Hm... an interesting possiblity:

dchar _dchar_convert(char c)
{
   return cast(int)cast(byte)c; // get sign extension for non-ASCII
}


Allows this too:

dchar d = char.init; // calls conversion function
assert(d == dchar.init);


Hm... actually doesn't work. dchar.init is 0x

-Steve


Re: The Case For Autodecode

2016-06-03 Thread Steven Schveighoffer via Digitalmars-d

On 6/3/16 3:09 PM, Steven Schveighoffer wrote:


Hm... an interesting possiblity:

dchar _dchar_convert(char c)
{
   return cast(int)cast(byte)c; // get sign extension for non-ASCII
}


Allows this too:

dchar d = char.init; // calls conversion function
assert(d == dchar.init);

:)

-Steve


Re: The Case For Autodecode

2016-06-03 Thread Steven Schveighoffer via Digitalmars-d

On 6/3/16 2:55 PM, ag0aep6g wrote:

On 06/03/2016 08:36 PM, Steven Schveighoffer wrote:

but a direct cast
of the bits from char does NOT mean the same thing as a dchar.


That gives me an idea. A bitwise reinterpretation of int to float is
nonsensical, too. Yet int implicitly converts to float and (for small
values) preserves the meaning. I mean, implicit conversion doesn't have
to mean bitwise reinterpretation.


I'm pretty sure the CPU handles this, though.


How about replacing non-standalone code units with replacement character
(U+FFFD) in implicit widening conversions?

For example:


char c = "ö"[0];
wchar w = c;
assert(w == '\uFFFD');


Would probably just be band-aid, though.


Except many chars *do* properly convert. This should work:

char c = 'a';
dchar d = c;
assert(d == 'a');

As I mentioned in my earlier reply, some kind of "bounds checking" for 
the conversion could be a possibility.


Hm... an interesting possiblity:

dchar _dchar_convert(char c)
{
   return cast(int)cast(byte)c; // get sign extension for non-ASCII
}

-Steve


Re: The Case For Autodecode

2016-06-03 Thread ag0aep6g via Digitalmars-d

On 06/03/2016 08:36 PM, Steven Schveighoffer wrote:

but a direct cast
of the bits from char does NOT mean the same thing as a dchar.


That gives me an idea. A bitwise reinterpretation of int to float is 
nonsensical, too. Yet int implicitly converts to float and (for small 
values) preserves the meaning. I mean, implicit conversion doesn't have 
to mean bitwise reinterpretation.


How about replacing non-standalone code units with replacement character 
(U+FFFD) in implicit widening conversions?


For example:


char c = "ö"[0];
wchar w = c;
assert(w == '\uFFFD');


Would probably just be band-aid, though.


Re: The Case For Autodecode

2016-06-03 Thread Patrick Schluter via Digitalmars-d
On Friday, 3 June 2016 at 18:36:45 UTC, Steven Schveighoffer 
wrote:


The real problem here is that char implicitly casts to dchar. 
That should not be allowed.



Indeed.




Re: The Case For Autodecode

2016-06-03 Thread ag0aep6g via Digitalmars-d

On 06/03/2016 07:51 PM, Patrick Schluter wrote:

You mean that '¶' is represented internally as 1 byte 0xB6 and that it
can be handled as such without error? This would mean that char literals
are broken. The only valid way to represent '¶' in memory is 0xC3 0x86.
Sorry if I misunderstood, I'm only starting to learn D.


There is no single char for '¶', that's right, and D gets that right. 
That's not what happens.


But there is a single wchar for it. wchar is a UTF-16 code unit, 2 
bytes. UTF-16 encodes '¶' as a single code unit, so that's correct.


The problem is that you can accidentally search for a wchar in a range 
of chars. Every char is compared to the wchar by numeric value. But the 
numeric values of a char don't mean the same as those of a wchar, so you 
get nonsensical results.


A similar implicit conversion lets you search for a large number in a 
byte[]:



byte[] arr = [1, 2, 3];
foreach(x; arr) if (x == 1000) writeln("found it!");


You won't ever find 1000 in a byte[], of course. The byte type simply 
can't store the value. But you can compare a byte with an int. And that 
comparison is meaningful, unlike the comparison of a char with a wchar.


You can also produce false positives with numeric types, by mixing 
signed and unsigned types:



int[] arr = [1, -1, 3];
foreach(x; arr) if (x == uint.max) writeln("found it!");


uint.max is a large number, -1 is a small number. They're considered 
equal here because of an implicit conversion that messes with the 
meaning of the bits.


False negatives are not possible with numeric types. At least not in the 
same way as with differently sized Unicode code units.


Re: The Case For Autodecode

2016-06-03 Thread Steven Schveighoffer via Digitalmars-d

On 6/3/16 1:51 PM, Patrick Schluter wrote:

On Friday, 3 June 2016 at 11:24:40 UTC, ag0aep6g wrote:

This is mostly me trying to make sense of the discussion.

So everyone hates autodecoding. But Andrei seems to hate it a good bit
less than everyone else. As far as I could follow, he has one reason
for that, which might not be clear to everyone:

char converts implicitly to dchar, so the compiler lets you search for
a dchar in a range of chars. But that gives nonsensical results. For
example, you won't find 'ö' in  "ö".byChar, but you will find '¶' in
there ('¶' is U+00B6, 'ö' is U+00F6, and 'ö' is encoded as 0xC3 0xB6
in UTF-8).


You mean that '¶' is represented internally as 1 byte 0xB6 and that it
can be handled as such without error? This would mean that char literals
are broken. The only valid way to represent '¶' in memory is 0xC3 0x86.
Sorry if I misunderstood, I'm only starting to learn D.


Not if '¶' is a dchar.

What is happening in the example is that find is looking at the 
"ö".byChar range and saying "hm... can I compare dchar('¶') to char? 
Well, char implicitly casts to dchar, so I'm good!", but a direct cast 
of the bits from char does NOT mean the same thing as a dchar. It has to 
go through a decoding first.


The real problem here is that char implicitly casts to dchar. That 
should not be allowed.


-Steve


Re: The Case For Autodecode

2016-06-03 Thread Patrick Schluter via Digitalmars-d

On Friday, 3 June 2016 at 11:24:40 UTC, ag0aep6g wrote:

This is mostly me trying to make sense of the discussion.

So everyone hates autodecoding. But Andrei seems to hate it a 
good bit less than everyone else. As far as I could follow, he 
has one reason for that, which might not be clear to everyone:


char converts implicitly to dchar, so the compiler lets you 
search for a dchar in a range of chars. But that gives 
nonsensical results. For example, you won't find 'ö' in  
"ö".byChar, but you will find '¶' in there ('¶' is U+00B6, 'ö' 
is U+00F6, and 'ö' is encoded as 0xC3 0xB6 in UTF-8).


You mean that '¶' is represented internally as 1 byte 0xB6 and 
that it can be handled as such without error? This would mean 
that char literals are broken. The only valid way to represent 
'¶' in memory is 0xC3 0x86.

Sorry if I misunderstood, I'm only starting to learn D.




Re: The Case For Autodecode

2016-06-03 Thread ag0aep6g via Digitalmars-d

On 06/03/2016 03:56 PM, Kagamin wrote:

A lot of discussion is disagreement on understanding of correctness of
unicode support. I see 4 possible meanings here:
1. Implemented according to spec.
2. Provides level 1 unicode support.
3. Provides level 2 unicode support.
4. Achieves the goal of unicode, i.e. text processing according to
natural language rules.


Speaking of that, the document that Walter dug up [1], which talks about 
supports levels, is about regular expression engines in particular. It's 
not about general language support.


The version he linked to is also pretty old. A more recent revision [2] 
calls level 1 (code points) the "minimally useful level of support", 
speaks warmly about level 2 (graphemes), and says that level 3 (locale 
dependent behavior) is "only useful for specific applications".



[1] http://unicode.org/reports/tr18/tr18-5.1.html
[2] http://www.unicode.org/reports/tr18/tr18-17.html


Re: The Case For Autodecode

2016-06-03 Thread Kagamin via Digitalmars-d

On Friday, 3 June 2016 at 11:24:40 UTC, ag0aep6g wrote:
Finally, this is not the only argument in favor of *keeping* 
autodecoding, of course. Not wanting to break user code is the 
big one there, I guess.


A lot of discussion is disagreement on understanding of 
correctness of unicode support. I see 4 possible meanings here:

1. Implemented according to spec.
2. Provides level 1 unicode support.
3. Provides level 2 unicode support.
4. Achieves the goal of unicode, i.e. text processing according 
to natural language rules.


Re: The Case For Autodecode

2016-06-03 Thread Steven Schveighoffer via Digitalmars-d

On 6/3/16 7:24 AM, ag0aep6g wrote:

This is mostly me trying to make sense of the discussion.

So everyone hates autodecoding. But Andrei seems to hate it a good bit
less than everyone else. As far as I could follow, he has one reason for
that, which might not be clear to everyone:


I don't hate autodecoding. What I hate is that char[] autodecodes.

If strings were some auto-decoding type that wasn't immutable(char)[], 
that would be absolutely fine with me. In fact, I see this as the only 
way to fix this, since it shouldn't break any code.



char converts implicitly to dchar, so the compiler lets you search for a
dchar in a range of chars. But that gives nonsensical results. For
example, you won't find 'ö' in  "ö".byChar, but you will find '¶' in
there ('¶' is U+00B6, 'ö' is U+00F6, and 'ö' is encoded as 0xC3 0xB6 in
UTF-8).


Question: why couldn't the compiler emit (in non-release builds) a 
runtime check to make sure you aren't converting non-ASCII characters to 
dchars? That is, like out of bounds checking, but for char -> dchar 
conversions, or any other invalid mechanism?


Yep, it's going to kill a lot of performance. But it's going to catch a 
lot of problems.


One thing to point out here, is that autodecoding only happens on 
arrays, and even then, only in certain cases.


-Steve