Re: Switch ignores case (?)
On 11/24/16 7:16 AM, ketmar wrote: On Thursday, 24 November 2016 at 11:40:24 UTC, Antonio Corbi wrote: Could it be possible to ping M. Nowak to include the fix for this bug (due to its importance) in the final 2.072.1 release? there is no real fix yet. what i provided is a quick hack, not tested on anything except GNU/Linux, and without unitests. somebody should create a real fix first. and it seems that we doesn't even have a bug opened in bugzilla yet! ;-) Yeah, there is: https://issues.dlang.org/show_bug.cgi?id=16739 Having a fix is obviously possible, we can do it very slowly on Windows :) slow is better than incorrect! -Steve
Re: Switch ignores case (?)
On Thursday, 24 November 2016 at 11:40:24 UTC, Antonio Corbi wrote: Could it be possible to ping M. Nowak to include the fix for this bug (due to its importance) in the final 2.072.1 release? there is no real fix yet. what i provided is a quick hack, not tested on anything except GNU/Linux, and without unitests. somebody should create a real fix first. and it seems that we doesn't even have a bug opened in bugzilla yet! ;-)
Re: Switch ignores case (?)
On Wednesday, 23 November 2016 at 22:13:38 UTC, ketmar wrote: On Wednesday, 23 November 2016 at 22:00:58 UTC, Steven Schveighoffer wrote: I can't see why you need to deal with the glue layer at all -- just tell the glue layer that it's a list of strings and not dstrings ;) 'cause that is how s2ir.d is done in dmd. ;-) it actually sorts *objects*, and objects knows how to order 'emselves. at this stage it is not trivial to treat objects as byte arrays (easy, but not a one-liner). sure, other compilers may do that differently, and it doesn't really matter how exactly the code for switch is generated until it works correctly. ;-) Could it be possible to ping M. Nowak to include the fix for this bug (due to its importance) in the final 2.072.1 release? Antonio
Re: Switch ignores case (?)
On Thursday, 24 November 2016 at 10:12:40 UTC, ketmar wrote: thanks. tbh, i am surprised myself -- it is completely fubared, but nobody noticed. maybe that says something about real-world useability of dstring/wstring... ;-) Well, I came across it, because I wanted to work around autodecode, our old "friend" (who need enemies, if you have a friend like that?). Except, it doesn't work as expected. For my needs, I can work around it with `(c == "\u2019")`, because switch works for the rest of the (common) cases like full stop, question mark etc. The whole string handling issue in D needs to be fixed asap, because text processing is one of _the_ big things in IT these days. Think of all the data harvesting and evaluation and whatnot.
Re: Switch ignores case (?)
On Thursday, 24 November 2016 at 09:56:25 UTC, Chris wrote: Great job, ketmar! I'm only surprised that this bug wasn't discovered earlier, I mean it goes back to (at least) dmd 2.040. thanks. tbh, i am surprised myself -- it is completely fubared, but nobody noticed. maybe that says something about real-world useability of dstring/wstring... ;-)
Re: Switch ignores case (?)
On Wednesday, 23 November 2016 at 22:13:38 UTC, ketmar wrote: On Wednesday, 23 November 2016 at 22:00:58 UTC, Steven Schveighoffer wrote: I can't see why you need to deal with the glue layer at all -- just tell the glue layer that it's a list of strings and not dstrings ;) 'cause that is how s2ir.d is done in dmd. ;-) it actually sorts *objects*, and objects knows how to order 'emselves. at this stage it is not trivial to treat objects as byte arrays (easy, but not a one-liner). sure, other compilers may do that differently, and it doesn't really matter how exactly the code for switch is generated until it works correctly. ;-) Great job, ketmar! I'm only surprised that this bug wasn't discovered earlier, I mean it goes back to (at least) dmd 2.040.
Re: Switch ignores case (?)
On Wednesday, 23 November 2016 at 22:00:58 UTC, Steven Schveighoffer wrote: I can't see why you need to deal with the glue layer at all -- just tell the glue layer that it's a list of strings and not dstrings ;) 'cause that is how s2ir.d is done in dmd. ;-) it actually sorts *objects*, and objects knows how to order 'emselves. at this stage it is not trivial to treat objects as byte arrays (easy, but not a one-liner). sure, other compilers may do that differently, and it doesn't really matter how exactly the code for switch is generated until it works correctly. ;-)
Re: Switch ignores case (?)
On 11/23/16 4:44 PM, ketmar wrote: On Wednesday, 23 November 2016 at 21:40:52 UTC, Steven Schveighoffer wrote: So the better way is to sort based on byte array, and just use memcmp for everything. i am not completely sure that this is really better. sorting and generating tables is done in glue layer, which can be different for different codegens. and `.compare` method, that is used for sorting strings may be used in other places too. by introducing something like `switchCompare()` we will add unnecessary complexity to the compiler, will make sort order less obvious, and won't really get significant speedup, as binary search doesn't really do alot of comparisons anyway. I'm not certain of how difficult this will be, or how much risk there is. What I am certain of is that the user doesn't care that the compiler really isn't sorting the strings alphabetically, and that he wants the fastest code possible for a switch statement. I can't see why you need to deal with the glue layer at all -- just tell the glue layer that it's a list of strings and not dstrings ;) I also don't actually know that memcmp is faster than wmemcmp, so maybe there is even an advantage to changing behavior for dstring searching. i thing it is better to be fixed in druntime. This can be a solution, for sure. And in reality, it's not *that* much slower -- you are still doing a binary search. I wonder if there is a "binary search among arrays" algorithm that can be optimized for this. -Steve
Re: Switch ignores case (?)
On Wednesday, 23 November 2016 at 21:40:52 UTC, Steven Schveighoffer wrote: So the better way is to sort based on byte array, and just use memcmp for everything. i am not completely sure that this is really better. sorting and generating tables is done in glue layer, which can be different for different codegens. and `.compare` method, that is used for sorting strings may be used in other places too. by introducing something like `switchCompare()` we will add unnecessary complexity to the compiler, will make sort order less obvious, and won't really get significant speedup, as binary search doesn't really do alot of comparisons anyway. i thing it is better to be fixed in druntime.
Re: Switch ignores case (?)
On Wednesday, 23 November 2016 at 21:31:08 UTC, Steven Schveighoffer wrote: I think it makes the most sense to remove the memcmp, and do binary search based on actual char values. yeah. there is wmemcmp, which can be used to speed up one of the cases ('cause wchar_t has different size on windows and GNU/Linux), as i did in my hackfix. yet i am not sure that wmemcmp is really there on windows in all C runtimes (hence HACKfix ;-).
Re: Switch ignores case (?)
On 11/23/16 4:31 PM, Steven Schveighoffer wrote: On 11/23/16 2:30 PM, ketmar wrote: this can be fixed either by using slow char-by-char comparisons in druntime, or by fixing codegen, so it would sort strings as byte arrays. I think it makes the most sense to remove the memcmp, and do binary search based on actual char values. On second thought, this is compiler-generated data, will never change. We may as well make it as efficient as possible. So the better way is to sort based on byte array, and just use memcmp for everything. -Steve
Re: Switch ignores case (?)
On 11/23/16 2:30 PM, ketmar wrote: On Wednesday, 23 November 2016 at 19:07:49 UTC, Chris wrote: It has something to do with the smart quote, e.g.: it is wrong binary search in `_d_switch_string()`. strings for switch are lexically sorted, and compiler calls `_d_switch_string()` to select one. the thing is that comparison in `_d_switch_string()` is done with `memcmp()`. still not clear? ok, let's see how cases are encoded: body _d_switch_dstring() 'U0027' (ca) table[0] = 1, 'U0027' table[1] = 1, 'U2019' or, in memory: table[0] = 1, 0x27, 0x00 table[1] = 1, 0x19, 0x20 so, memcmp for `table[1]` returns... 1! 'cause 0x27 is greater than 0x19. and binsearch is broken from here on. the same is true for `_d_switch_ustring()`, of course. this can be fixed either by using slow char-by-char comparisons in druntime, or by fixing codegen, so it would sort strings as byte arrays. Oh wow, so this is really an endian issue. On a big endian machine, the code would work. Interesting! I think it makes the most sense to remove the memcmp, and do binary search based on actual char values. Thanks for finding this. -Steve
Re: Switch ignores case (?)
quickfix: diff --git a/src/rt/switch_.d b/src/rt/switch_.d index ec124e3..83572fe 100644 --- a/src/rt/switch_.d +++ b/src/rt/switch_.d @@ -27,6 +27,32 @@ private import core.stdc.string; extern (C): +import core.stdc.wchar_ : wchar_t, wmemcmp; + + +static if (wchar_t.sizeof == wchar.sizeof) { + alias memcmpw = wmemcmp; + int memcmpd (const(void)* s0, const(void)* s1, size_t len) { +while (len--) { + if (int r = *cast(const(int)*)s0-*cast(const(int)*)s1) return (r < 0 ? -1 : 1); + s0 += dchar.sizeof; + s1 += dchar.sizeof; +} +return 0; + } +} else static if (wchar_t.sizeof == dchar.sizeof) { + int memcmpw (const(void)* s0, const(void)* s1, size_t len) { +while (len--) { + if (int r = *cast(const(ushort)*)s0-*cast(const(ushort)*)s1) return (r < 0 ? -1 : 1); + s0 += wchar.sizeof; + s1 += wchar.sizeof; +} +return 0; + } + alias memcmpd = wmemcmp; +} else static assert(0, "wut?!"); + + int _d_switch_string(char[][] table, char[] ca) in { @@ -189,7 +215,7 @@ in { int c; -c = memcmp(table[j - 1].ptr, table[j].ptr, len1 * wchar.sizeof); +c = memcmpw(table[j - 1].ptr, table[j].ptr, len1); assert(c < 0); // c==0 means a duplicate } } @@ -205,7 +231,7 @@ out (result) for (auto i = 0u; i < table.length; i++) { if (table[i].length == ca.length) -{ c = memcmp(table[i].ptr, ca.ptr, ca.length * wchar.sizeof); +{ c = memcmpw(table[i].ptr, ca.ptr, ca.length); assert(c != 0); } } @@ -218,7 +244,7 @@ out (result) assert(i < table.length); if (table[i].length == ca.length) { -c = memcmp(table[i].ptr, ca.ptr, ca.length * wchar.sizeof); +c = memcmpw(table[i].ptr, ca.ptr, ca.length); if (c == 0) { assert(i == result); @@ -253,7 +279,7 @@ body auto c = cast(sizediff_t)(ca.length - pca.length); if (c == 0) { -c = memcmp(ca.ptr, pca.ptr, ca.length * wchar.sizeof); +c = memcmpw(ca.ptr, pca.ptr, ca.length); if (c == 0) { //printf("found %d\n", mid); return cast(int)mid; @@ -317,7 +343,7 @@ in assert(len1 <= len2); if (len1 == len2) { -auto c = memcmp(table[j - 1].ptr, table[j].ptr, len1 * dchar.sizeof); +auto c = memcmpd(table[j - 1].ptr, table[j].ptr, len1); assert(c < 0); // c==0 means a duplicate } } @@ -331,7 +357,7 @@ out (result) for (auto i = 0u; i < table.length; i++) { if (table[i].length == ca.length) -{ auto c = memcmp(table[i].ptr, ca.ptr, ca.length * dchar.sizeof); +{ auto c = memcmpd(table[i].ptr, ca.ptr, ca.length); assert(c != 0); } } @@ -344,7 +370,7 @@ out (result) assert(i < table.length); if (table[i].length == ca.length) { -auto c = memcmp(table[i].ptr, ca.ptr, ca.length * dchar.sizeof); +auto c = memcmpd(table[i].ptr, ca.ptr, ca.length); if (c == 0) { assert(i == result); @@ -379,7 +405,7 @@ body auto c = cast(sizediff_t)(ca.length - pca.length); if (c == 0) { -c = memcmp(ca.ptr, pca.ptr, ca.length * dchar.sizeof); +c = memcmpd(ca.ptr, pca.ptr, ca.length); if (c == 0) { //printf("found %d\n", mid); return cast(int)mid; -- 2.9.2
Re: Switch ignores case (?)
On Wednesday, 23 November 2016 at 20:49:01 UTC, Chris wrote: Actually, I came across a compiler message that gave me something like \x19\x20 which I found odd. This sure needs fixing. After all, it's quite a basic feature. So it's back to the old `if` again (for now). yeah, until this is fixed, `switch` over wstring/dstring can be considered completely broken, and better be avoided. `switch` over normal string is unaffected, of course.
Re: Switch ignores case (?)
On Wednesday, 23 November 2016 at 19:30:01 UTC, ketmar wrote: On Wednesday, 23 November 2016 at 19:07:49 UTC, Chris wrote: It has something to do with the smart quote, e.g.: it is wrong binary search in `_d_switch_string()`. strings for switch are lexically sorted, and compiler calls `_d_switch_string()` to select one. the thing is that comparison in `_d_switch_string()` is done with `memcmp()`. still not clear? ok, let's see how cases are encoded: body _d_switch_dstring() 'U0027' (ca) table[0] = 1, 'U0027' table[1] = 1, 'U2019' or, in memory: table[0] = 1, 0x27, 0x00 table[1] = 1, 0x19, 0x20 so, memcmp for `table[1]` returns... 1! 'cause 0x27 is greater than 0x19. and binsearch is broken from here on. the same is true for `_d_switch_ustring()`, of course. this can be fixed either by using slow char-by-char comparisons in druntime, or by fixing codegen, so it would sort strings as byte arrays. Actually, I came across a compiler message that gave me something like \x19\x20 which I found odd. This sure needs fixing. After all, it's quite a basic feature. So it's back to the old `if` again (for now).
Re: Switch ignores case (?)
On Wednesday, 23 November 2016 at 19:07:49 UTC, Chris wrote: It has something to do with the smart quote, e.g.: it is wrong binary search in `_d_switch_string()`. strings for switch are lexically sorted, and compiler calls `_d_switch_string()` to select one. the thing is that comparison in `_d_switch_string()` is done with `memcmp()`. still not clear? ok, let's see how cases are encoded: body _d_switch_dstring() 'U0027' (ca) table[0] = 1, 'U0027' table[1] = 1, 'U2019' or, in memory: table[0] = 1, 0x27, 0x00 table[1] = 1, 0x19, 0x20 so, memcmp for `table[1]` returns... 1! 'cause 0x27 is greater than 0x19. and binsearch is broken from here on. the same is true for `_d_switch_ustring()`, of course. this can be fixed either by using slow char-by-char comparisons in druntime, or by fixing codegen, so it would sort strings as byte arrays.
Re: Switch ignores case (?)
On Wednesday, November 23, 2016 19:07:49 Chris via Digitalmars-d-learn wrote: > (I think I marked it as "P1" (default option), maybe it's "P2", > didn't know what to choose) AFAIK, there are no devs that pay any attention to those. Some attention is paid to regression vs enhancement vs etc. But the priority field really doesn't mean anything for how D bugs get fixed or who fixes what when. - Jonathan M Davis
Re: Switch ignores case (?)
On Wednesday, 23 November 2016 at 18:34:28 UTC, Steven Schveighoffer wrote: Please file here: https://issues.dlang.org/enter_bug.cgi I think this has been there forever. Happens in 2.040 too (the earliest dmd I have on my system). -Steve Here it is: https://issues.dlang.org/show_bug.cgi?id=16739 (I think I marked it as "P1" (default option), maybe it's "P2", didn't know what to choose) It has something to do with the smart quote, e.g.: import std.array; import std.conv; import std.stdio; void main() { auto tokens = to!(dchar[][])(["D"d, "’"d, "Addario"d, "'"d, "."d]); // Or use this below: //~ dstring[] tokens = ["D"d, "’"d, "Addario"d, "'"d]; while (!tokens.empty) { switch (tokens[0]) { case "\u2019"d: writeln("Apostrophe smart " ~ tokens[0]); break; case "\u0027"d: writeln("Apostrophe straight " ~ tokens[0]); break; case "D"d: writeln("Letter 'D'"); break; case "."d: writeln("fullstop " ~ tokens[0]); break; default: writeln("Other " ~ tokens[0]); break; } tokens = tokens[1..$]; } } prints: Letter 'D' Other ’ // <== not expected Other Addario Apostrophe straight ' fullstop . Both LDC and DMD produce the same error. I discovered this while writing a tokenizer. It is a bit upsetting, because it is really an essential thing. The workaround for now would be if (token[0] == "\u2019"d) goto Wherever; which works, by the way.
Re: Switch ignores case (?)
On 11/23/16 1:03 PM, Chris wrote: On Wednesday, 23 November 2016 at 17:33:04 UTC, Steven Schveighoffer wrote: I tested this locally with different ideas. This definitely looks like a codegen bug. I was able to reduce it to: void main() { switch("'"d) { case "'"d: writeln("a"); break; case "’"d: writeln("b"); break; default: writeln("default"); } } prints "default" What seems to fix it is removing the case statement for the "smart apostrophe". So I'd look there for where the bug is triggering. Yep, removing one of the two cases works. I tried it with different versions of dmd back to 2.070.0, and it always gives me the same (wrong) result. I haven't tried ldc yet. Please file here: https://issues.dlang.org/enter_bug.cgi I think this has been there forever. Happens in 2.040 too (the earliest dmd I have on my system). -Steve
Re: Switch ignores case (?)
On Wednesday, 23 November 2016 at 17:33:04 UTC, Steven Schveighoffer wrote: I tested this locally with different ideas. This definitely looks like a codegen bug. I was able to reduce it to: void main() { switch("'"d) { case "'"d: writeln("a"); break; case "’"d: writeln("b"); break; default: writeln("default"); } } prints "default" What seems to fix it is removing the case statement for the "smart apostrophe". So I'd look there for where the bug is triggering. -Steve Yep, removing one of the two cases works. I tried it with different versions of dmd back to 2.070.0, and it always gives me the same (wrong) result. I haven't tried ldc yet.
Re: Switch ignores case (?)
On 11/23/16 11:28 AM, Chris wrote: Only one of the two cases is considered. What am I doing wrong? ` import std.array; import std.conv; import std.stdio; void main() { auto tokens = to!(dchar[][])(["D"d, "’"d, "Addario"d, "'"d]); // Or use this below: //~ dstring[] tokens = ["D"d, "’"d, "Addario"d, "'"d]; while (!tokens.empty) { switch (tokens[0]) { case "\u2019"d: writeln("Apostrophe smart " ~ tokens[0]); break; case "\u0027"d: writeln("Apostrophe straight " ~ tokens[0]); break; default: writeln("Other " ~ tokens[0]); break; } tokens = tokens[1..$]; } } ` Prints: Other D Apostrophe smart ’ Other Addario Other ' I would have expected: Other D Apostrophe smart ’ Other Addario Apostrophe straight ' <== expected I tested this locally with different ideas. This definitely looks like a codegen bug. I was able to reduce it to: void main() { switch("'"d) { case "'"d: writeln("a"); break; case "’"d: writeln("b"); break; default: writeln("default"); } } prints "default" What seems to fix it is removing the case statement for the "smart apostrophe". So I'd look there for where the bug is triggering. -Steve