Re: The Case For Autodecode
On 6/4/16 4:57 AM, Patrick Schluter wrote: On Friday, 3 June 2016 at 20:18:31 UTC, Steven Schveighoffer wrote: On 6/3/16 3:52 PM, ag0aep6g wrote: Does it work for for char -> wchar, too? It does not. 0x is a valid code point, and I think so are all the other values that would result. In fact, I think there are no invalid code units for wchar. https://codepoints.net/specials U+ would be fine, better at least than a surrogate. U+ is still a valid code point, even if it's not assigned any unicode character. But the result would be U+ff80 to U+, and I'm sure some of those are valid code points. -Steve
Re: The Case For Autodecode
On Friday, 3 June 2016 at 11:24:40 UTC, ag0aep6g wrote: Finally, this is not the only argument in favor of *keeping* autodecoding, of course. Not wanting to break user code is the big one there, I guess. I'm not familiar with the details of autodecoding, but one thing strikes me about this whole discussion. It seems to me that it is just nibbling around the edges of how one should implement full Unicode support. And it seems to me that that topic, and how autodecoding plays into it, won't be properly understood except by comparison with mature software that has undergone many years of testing and revision. Two examples stand out to me: * Perl 5 has undergone a gradual evolution, over many releases, to get this right. It might also be the case that Perl 6 is even cleaner. * The International Components for Unicode (ICU) package, with supported libraries for C, C++, and Java. This is the industry- standard definition of what it means to handle Unicode in these languages. See http://site.icu-project.org/ for details. Both of these implementations have seen many years of real-world use, so I would tend to look to them for guidance over trying to develop my own opinion based on some small set of particular use cases I might happen to have encountered.
Re: The Case For Autodecode
On Friday, 3 June 2016 at 20:18:31 UTC, Steven Schveighoffer wrote: On 6/3/16 3:52 PM, ag0aep6g wrote: On 06/03/2016 09:09 PM, Steven Schveighoffer wrote: Except many chars *do* properly convert. This should work: char c = 'a'; dchar d = c; assert(d == 'a'); Yeah, that's what I meant by "standalone code unit". Code units that on their own represent a code point would not be touched. But you can get a standalone code unit that is part of a coded sequence quite easily foo(string s) { auto x = s[0]; dchar d = x; } As I mentioned in my earlier reply, some kind of "bounds checking" for the conversion could be a possibility. Hm... an interesting possiblity: dchar _dchar_convert(char c) { return cast(int)cast(byte)c; // get sign extension for non-ASCII } So when the char's most significant bit is set, this fills the upper bits of the dchar with 1s, right? And a set most significant bit in a char means it's part of a multibyte sequence, while in a dchar it means that the dchar is invalid, because they only go up to U+10. Huh. Neat. An interesting thing is that I think the CPU can do this for us. Does it work for for char -> wchar, too? It does not. 0x is a valid code point, and I think so are all the other values that would result. In fact, I think there are no invalid code units for wchar. https://codepoints.net/specials U+ would be fine, better at least than a surrogate.
Re: The Case For Autodecode
On 06/03/2016 11:13 PM, Steven Schveighoffer wrote: No, but I like the idea of preserving the erroneous character you tried to convert. Makes sense. But is there an invalid wchar? I looked through the wikipedia article on UTF 16, and it didn't seem to say there was one. If we use U+FFFD, that signifies a coding problem but is still a valid code point. However, doing a wchar in the D800 - D8FF range without being followed by a code unit in the DC00 - DFFF range is an invalid sequence. D throws if it encounters such a thing. The Unicode FAQ has an answer to this exact question, but it also only says that "[u]npaired surrogates are invalid" [1]. It also mentions "noncharacters" which are "permanently reserved [...] for internal use". "For example, they might be used internally as a particular kind of object placeholder in a string." [2] - Not too bad. And then there is the replacement character, of course. "[U]sed to replace an incoming character whose value is unknown or unrepresentable in Unicode" [3]. [1] http://www.unicode.org/faq/utf_bom.html#utf16-7 [2] http://www.unicode.org/faq/private_use.html#noncharacters [3] http://www.fileformat.info/info/unicode/char/0fffd/index.htm
Re: The Case For Autodecode
On 6/3/16 4:39 PM, ag0aep6g wrote: On 06/03/2016 10:18 PM, Steven Schveighoffer wrote: But you can get a standalone code unit that is part of a coded sequence quite easily foo(string s) { auto x = s[0]; dchar d = x; } I don' think we're disagreeing on anything. I'm calling UTF-8 code units below 0x80 "standalone" code units. They're never part of multibyte sequences. Your _dchar_convert returns them unscathed. Ah, I thought you meant standalone as in it was assigned to a standalone char variable vs. part of an array or range. My mistake. Re-reading your original message, I see that should have been clear to me... So we need most efficient logic that does this: if(c & 0x80) return wchar(0xd800 + c); Is this going to be faster than returning a constant invalid wchar? No, but I like the idea of preserving the erroneous character you tried to convert. But is there an invalid wchar? I looked through the wikipedia article on UTF 16, and it didn't seem to say there was one. If we use U+FFFD, that signifies a coding problem but is still a valid code point. However, doing a wchar in the D800 - D8FF range without being followed by a code unit in the DC00 - DFFF range is an invalid sequence. D throws if it encounters such a thing. -Steve
Re: The Case For Autodecode
On 06/03/2016 10:18 PM, Steven Schveighoffer wrote: But you can get a standalone code unit that is part of a coded sequence quite easily foo(string s) { auto x = s[0]; dchar d = x; } I don' think we're disagreeing on anything. I'm calling UTF-8 code units below 0x80 "standalone" code units. They're never part of multibyte sequences. Your _dchar_convert returns them unscathed. Higher code units are always part of multibyte sequences (or invalid already). Your function returns invalid code points for them. _dchar_convert does exactly what I meant, except that I had in mind returning the replacement character for non-standalone code units. But I see that that may not be feasible, and it's probably not necessary. [...] So we need most efficient logic that does this: if(c & 0x80) return wchar(0xd800 + c); Is this going to be faster than returning a constant invalid wchar? else return wchar(c); More expensive, but more correct! wchar to dchar conversion is pretty sound, as the surrogate pairs are invalid code points for dchar. -Steve
Re: The Case For Autodecode
On 6/3/16 3:52 PM, ag0aep6g wrote: On 06/03/2016 09:09 PM, Steven Schveighoffer wrote: Except many chars *do* properly convert. This should work: char c = 'a'; dchar d = c; assert(d == 'a'); Yeah, that's what I meant by "standalone code unit". Code units that on their own represent a code point would not be touched. But you can get a standalone code unit that is part of a coded sequence quite easily foo(string s) { auto x = s[0]; dchar d = x; } As I mentioned in my earlier reply, some kind of "bounds checking" for the conversion could be a possibility. Hm... an interesting possiblity: dchar _dchar_convert(char c) { return cast(int)cast(byte)c; // get sign extension for non-ASCII } So when the char's most significant bit is set, this fills the upper bits of the dchar with 1s, right? And a set most significant bit in a char means it's part of a multibyte sequence, while in a dchar it means that the dchar is invalid, because they only go up to U+10. Huh. Neat. An interesting thing is that I think the CPU can do this for us. Does it work for for char -> wchar, too? It does not. 0x is a valid code point, and I think so are all the other values that would result. In fact, I think there are no invalid code units for wchar. Of course, a surrogate pair requires another code unit to be valid, so we can at least promote a char to a wchar in the surrogate pair range (and always in the low or high surrogate range so a naive transcoding of a char range to wchar will result in an invalid sequence if there are any non-ascii characters). So we need most efficient logic that does this: if(c & 0x80) return wchar(0xd800 + c); else return wchar(c); More expensive, but more correct! wchar to dchar conversion is pretty sound, as the surrogate pairs are invalid code points for dchar. -Steve
Re: The Case For Autodecode
On 06/03/2016 09:09 PM, Steven Schveighoffer wrote: Except many chars *do* properly convert. This should work: char c = 'a'; dchar d = c; assert(d == 'a'); Yeah, that's what I meant by "standalone code unit". Code units that on their own represent a code point would not be touched. As I mentioned in my earlier reply, some kind of "bounds checking" for the conversion could be a possibility. Hm... an interesting possiblity: dchar _dchar_convert(char c) { return cast(int)cast(byte)c; // get sign extension for non-ASCII } So when the char's most significant bit is set, this fills the upper bits of the dchar with 1s, right? And a set most significant bit in a char means it's part of a multibyte sequence, while in a dchar it means that the dchar is invalid, because they only go up to U+10. Huh. Neat. Does it work for for char -> wchar, too?
Re: The Case For Autodecode
On 6/3/16 3:12 PM, Steven Schveighoffer wrote: On 6/3/16 3:09 PM, Steven Schveighoffer wrote: Hm... an interesting possiblity: dchar _dchar_convert(char c) { return cast(int)cast(byte)c; // get sign extension for non-ASCII } Allows this too: dchar d = char.init; // calls conversion function assert(d == dchar.init); Hm... actually doesn't work. dchar.init is 0x -Steve
Re: The Case For Autodecode
On 6/3/16 3:09 PM, Steven Schveighoffer wrote: Hm... an interesting possiblity: dchar _dchar_convert(char c) { return cast(int)cast(byte)c; // get sign extension for non-ASCII } Allows this too: dchar d = char.init; // calls conversion function assert(d == dchar.init); :) -Steve
Re: The Case For Autodecode
On 6/3/16 2:55 PM, ag0aep6g wrote: On 06/03/2016 08:36 PM, Steven Schveighoffer wrote: but a direct cast of the bits from char does NOT mean the same thing as a dchar. That gives me an idea. A bitwise reinterpretation of int to float is nonsensical, too. Yet int implicitly converts to float and (for small values) preserves the meaning. I mean, implicit conversion doesn't have to mean bitwise reinterpretation. I'm pretty sure the CPU handles this, though. How about replacing non-standalone code units with replacement character (U+FFFD) in implicit widening conversions? For example: char c = "ö"[0]; wchar w = c; assert(w == '\uFFFD'); Would probably just be band-aid, though. Except many chars *do* properly convert. This should work: char c = 'a'; dchar d = c; assert(d == 'a'); As I mentioned in my earlier reply, some kind of "bounds checking" for the conversion could be a possibility. Hm... an interesting possiblity: dchar _dchar_convert(char c) { return cast(int)cast(byte)c; // get sign extension for non-ASCII } -Steve
Re: The Case For Autodecode
On 06/03/2016 08:36 PM, Steven Schveighoffer wrote: but a direct cast of the bits from char does NOT mean the same thing as a dchar. That gives me an idea. A bitwise reinterpretation of int to float is nonsensical, too. Yet int implicitly converts to float and (for small values) preserves the meaning. I mean, implicit conversion doesn't have to mean bitwise reinterpretation. How about replacing non-standalone code units with replacement character (U+FFFD) in implicit widening conversions? For example: char c = "ö"[0]; wchar w = c; assert(w == '\uFFFD'); Would probably just be band-aid, though.
Re: The Case For Autodecode
On Friday, 3 June 2016 at 18:36:45 UTC, Steven Schveighoffer wrote: The real problem here is that char implicitly casts to dchar. That should not be allowed. Indeed.
Re: The Case For Autodecode
On 06/03/2016 07:51 PM, Patrick Schluter wrote: You mean that '¶' is represented internally as 1 byte 0xB6 and that it can be handled as such without error? This would mean that char literals are broken. The only valid way to represent '¶' in memory is 0xC3 0x86. Sorry if I misunderstood, I'm only starting to learn D. There is no single char for '¶', that's right, and D gets that right. That's not what happens. But there is a single wchar for it. wchar is a UTF-16 code unit, 2 bytes. UTF-16 encodes '¶' as a single code unit, so that's correct. The problem is that you can accidentally search for a wchar in a range of chars. Every char is compared to the wchar by numeric value. But the numeric values of a char don't mean the same as those of a wchar, so you get nonsensical results. A similar implicit conversion lets you search for a large number in a byte[]: byte[] arr = [1, 2, 3]; foreach(x; arr) if (x == 1000) writeln("found it!"); You won't ever find 1000 in a byte[], of course. The byte type simply can't store the value. But you can compare a byte with an int. And that comparison is meaningful, unlike the comparison of a char with a wchar. You can also produce false positives with numeric types, by mixing signed and unsigned types: int[] arr = [1, -1, 3]; foreach(x; arr) if (x == uint.max) writeln("found it!"); uint.max is a large number, -1 is a small number. They're considered equal here because of an implicit conversion that messes with the meaning of the bits. False negatives are not possible with numeric types. At least not in the same way as with differently sized Unicode code units.
Re: The Case For Autodecode
On 6/3/16 1:51 PM, Patrick Schluter wrote: On Friday, 3 June 2016 at 11:24:40 UTC, ag0aep6g wrote: This is mostly me trying to make sense of the discussion. So everyone hates autodecoding. But Andrei seems to hate it a good bit less than everyone else. As far as I could follow, he has one reason for that, which might not be clear to everyone: char converts implicitly to dchar, so the compiler lets you search for a dchar in a range of chars. But that gives nonsensical results. For example, you won't find 'ö' in "ö".byChar, but you will find '¶' in there ('¶' is U+00B6, 'ö' is U+00F6, and 'ö' is encoded as 0xC3 0xB6 in UTF-8). You mean that '¶' is represented internally as 1 byte 0xB6 and that it can be handled as such without error? This would mean that char literals are broken. The only valid way to represent '¶' in memory is 0xC3 0x86. Sorry if I misunderstood, I'm only starting to learn D. Not if '¶' is a dchar. What is happening in the example is that find is looking at the "ö".byChar range and saying "hm... can I compare dchar('¶') to char? Well, char implicitly casts to dchar, so I'm good!", but a direct cast of the bits from char does NOT mean the same thing as a dchar. It has to go through a decoding first. The real problem here is that char implicitly casts to dchar. That should not be allowed. -Steve
Re: The Case For Autodecode
On Friday, 3 June 2016 at 11:24:40 UTC, ag0aep6g wrote: This is mostly me trying to make sense of the discussion. So everyone hates autodecoding. But Andrei seems to hate it a good bit less than everyone else. As far as I could follow, he has one reason for that, which might not be clear to everyone: char converts implicitly to dchar, so the compiler lets you search for a dchar in a range of chars. But that gives nonsensical results. For example, you won't find 'ö' in "ö".byChar, but you will find '¶' in there ('¶' is U+00B6, 'ö' is U+00F6, and 'ö' is encoded as 0xC3 0xB6 in UTF-8). You mean that '¶' is represented internally as 1 byte 0xB6 and that it can be handled as such without error? This would mean that char literals are broken. The only valid way to represent '¶' in memory is 0xC3 0x86. Sorry if I misunderstood, I'm only starting to learn D.
Re: The Case For Autodecode
On 06/03/2016 03:56 PM, Kagamin wrote: A lot of discussion is disagreement on understanding of correctness of unicode support. I see 4 possible meanings here: 1. Implemented according to spec. 2. Provides level 1 unicode support. 3. Provides level 2 unicode support. 4. Achieves the goal of unicode, i.e. text processing according to natural language rules. Speaking of that, the document that Walter dug up [1], which talks about supports levels, is about regular expression engines in particular. It's not about general language support. The version he linked to is also pretty old. A more recent revision [2] calls level 1 (code points) the "minimally useful level of support", speaks warmly about level 2 (graphemes), and says that level 3 (locale dependent behavior) is "only useful for specific applications". [1] http://unicode.org/reports/tr18/tr18-5.1.html [2] http://www.unicode.org/reports/tr18/tr18-17.html
Re: The Case For Autodecode
On Friday, 3 June 2016 at 11:24:40 UTC, ag0aep6g wrote: Finally, this is not the only argument in favor of *keeping* autodecoding, of course. Not wanting to break user code is the big one there, I guess. A lot of discussion is disagreement on understanding of correctness of unicode support. I see 4 possible meanings here: 1. Implemented according to spec. 2. Provides level 1 unicode support. 3. Provides level 2 unicode support. 4. Achieves the goal of unicode, i.e. text processing according to natural language rules.
Re: The Case For Autodecode
On 6/3/16 7:24 AM, ag0aep6g wrote: This is mostly me trying to make sense of the discussion. So everyone hates autodecoding. But Andrei seems to hate it a good bit less than everyone else. As far as I could follow, he has one reason for that, which might not be clear to everyone: I don't hate autodecoding. What I hate is that char[] autodecodes. If strings were some auto-decoding type that wasn't immutable(char)[], that would be absolutely fine with me. In fact, I see this as the only way to fix this, since it shouldn't break any code. char converts implicitly to dchar, so the compiler lets you search for a dchar in a range of chars. But that gives nonsensical results. For example, you won't find 'ö' in "ö".byChar, but you will find '¶' in there ('¶' is U+00B6, 'ö' is U+00F6, and 'ö' is encoded as 0xC3 0xB6 in UTF-8). Question: why couldn't the compiler emit (in non-release builds) a runtime check to make sure you aren't converting non-ASCII characters to dchars? That is, like out of bounds checking, but for char -> dchar conversions, or any other invalid mechanism? Yep, it's going to kill a lot of performance. But it's going to catch a lot of problems. One thing to point out here, is that autodecoding only happens on arrays, and even then, only in certain cases. -Steve