Re: DMD 2.090.1: SIGILL, Illegal instruction on (ahem) intel Pentium III
On Thursday, 27 February 2020 at 00:36:49 UTC, kdevel wrote: ```test.d void main () { int [] v = new int [10]; } ``` $ [...]linux/bin32/dmd test $ gdb test [...] (gdb) r [...] Program received signal SIGILL, Illegal instruction. 0x0809ad14 in _D2gc4impl12conservativeQw3Gcx10smallAllocMFNbkKkkxC8TypeInfoZPv () [...] (gdb) disass [...] 0x0809ad14 <_D2gc4impl12conservativeQw3Gcx10smallAllocMFNbkKkkxC8TypeInfoZPv+172>: movsd -0x58(%ebp),%xmm0 Does this exception relate to [1] and shall I file a bug or do I have to decommission my PIII? I think so. Have you tried: 1) LDC on your machine? 2) the DMD version before this change?
Re: Strange counter-performance in an alternative `decimalLength9` function
On Thursday, 27 February 2020 at 03:58:15 UTC, Bruce Carneal wrote: Maybe you talked about another implementation of decimalLength9 ? Yes. It's one I wrote after I saw your post. Psuedo-code here: auto d9_branchless(uint v) { return 1 + (v >= 10) + (v >= 100) ... } Using ldc to target an x86 with the above yields a series of cmpl, seta instruction pairs in the function body followed by a summing and a return. No branching. Let me know if the above is unclear or insufficient. No thanks, it's crystal clear now.
Re: Strange counter-performance in an alternative `decimalLength9` function
On Thursday, 27 February 2020 at 03:58:15 UTC, Bruce Carneal wrote: On Wednesday, 26 February 2020 at 23:09:34 UTC, Basile B. wrote: On Wednesday, 26 February 2020 at 20:44:31 UTC, Bruce Carneal wrote: After shuffling the input, branchless wins by 2.4X (240%). snip Let me know if the above is unclear or insufficient. The 2.4X improvement came when using a shuffled uniform digits distribution. Equal numbers of 1 digit values, 2 digit values, ... put in to an array and shuffled. Branchless *loses* by 8.5% or so on my Zen1 when the array is not shuffled (when branch predictions are nearly perfect).
Re: Strange counter-performance in an alternative `decimalLength9` function
On Wednesday, 26 February 2020 at 23:09:34 UTC, Basile B. wrote: On Wednesday, 26 February 2020 at 20:44:31 UTC, Bruce Carneal wrote: After shuffling the input, branchless wins by 2.4X (240%). I've replaced the input by the front of a rndGen (that pops for count times and starting with a custom seed) and I never see the decimalLength9_3 (which seems to be the closest to the original in term of performances) doing better. The uniform random uint distribution yields a highly skewed digits distribution. Note, for example, that only 10 of the 2**32 possible outputs from a uint PRNG can be represented by a single decimal digit. Your current winner will be very hard to beat if the inputs are uniform random. That implementation's high-to-low cascade of early exit ifs aligns with PRNG output. If you aren't sure what the input distribution is, or want guaranteed good worst case behavior, then branchless may be a better way to go. Maybe you talked about another implementation of decimalLength9 ? Yes. It's one I wrote after I saw your post. Psuedo-code here: auto d9_branchless(uint v) { return 1 + (v >= 10) + (v >= 100) ... } Using ldc to target an x86 with the above yields a series of cmpl, seta instruction pairs in the function body followed by a summing and a return. No branching. Let me know if the above is unclear or insufficient.
DMD 2.090.1: SIGILL, Illegal instruction on (ahem) intel Pentium III
```test.d void main () { int [] v = new int [10]; } ``` $ [...]linux/bin32/dmd test $ gdb test [...] (gdb) r [...] Program received signal SIGILL, Illegal instruction. 0x0809ad14 in _D2gc4impl12conservativeQw3Gcx10smallAllocMFNbkKkkxC8TypeInfoZPv () [...] (gdb) disass [...] 0x0809ad14 <_D2gc4impl12conservativeQw3Gcx10smallAllocMFNbkKkkxC8TypeInfoZPv+172>: movsd -0x58(%ebp),%xmm0 Does this exception relate to [1] and shall I file a bug or do I have to decommission my PIII? Stefan [1] https://dlang.org/changelog/2.087.0.html#xmm-linux-changelog
Re: Strange counter-performance in an alternative `decimalLength9` function
On Wednesday, 26 February 2020 at 20:44:31 UTC, Bruce Carneal wrote: The winning function implementation lines up with that distribution. It would not fare as well with higher entropy input. Using sorted equi-probable inputs (N 1 digit numbers, N 2 digit numbers, ...) decimalLength9_0 beats a simple branchless implementation by about 10%. After shuffling the input, branchless wins by 2.4X (240%). I've replaced the input by the front of a rndGen (that pops for count times and starting with a custom seed) and I never see the decimalLength9_3 (which seems to be the closest to the original in term of performances) doing better. Maybe you talked about another implementation of decimalLength9 ?
Re: Strange counter-performance in an alternative `decimalLength9` function
On Wednesday, 26 February 2020 at 22:07:30 UTC, Johan wrote: On Wednesday, 26 February 2020 at 00:50:35 UTC, Basile B. wrote: [...] Hi Basile, I recently saw this presentation: https://www.youtube.com/watch?v=Czr5dBfs72U It has some ideas that may help you make sure your measurements are good and may give you ideas to find the performance bottleneck or where to optimize. llvm-mca is featured on godbolt.org: https://mca.godbolt.org/z/YWp3yv cheers, Johan yes llvm-mca looks excellent, although I don't know if it worth continuing... You see this function is certainly not a bottleneck, it's just that I wanted to try better than the naive implementation. Fundamentatlly the problem is that 1. the original is smaller, faster to decode 2. the alternatives (esp. the 3rd) is conceptually better but the cost of the jump table + lzcnt wastes it.
Re: Strange counter-performance in an alternative `decimalLength9` function
On Wednesday, 26 February 2020 at 00:50:35 UTC, Basile B. wrote: So after reading the translation of RYU I was interested too see if the decimalLength() function can be written to be faster, as it cascades up to 8 CMP. ... Then bad surprise. Even with ldmd (so ldc2 basically) feeded with the args from the script line. Maybe the fdecimalLength9 version is slightly faster. Only *slightly*. Good news, I've lost my time. So I try an alternative version that uses a table of delegates instead of a switch (ffdecimalLength9) and surprise, "tada", it is like **100x** slower then the two others. How is that possible ? Hi Basile, I recently saw this presentation: https://www.youtube.com/watch?v=Czr5dBfs72U It has some ideas that may help you make sure your measurements are good and may give you ideas to find the performance bottleneck or where to optimize. llvm-mca is featured on godbolt.org: https://mca.godbolt.org/z/YWp3yv cheers, Johan
Re: Strange counter-performance in an alternative `decimalLength9` function
On Wednesday, 26 February 2020 at 19:44:05 UTC, Bruce Carneal wrote: On Wednesday, 26 February 2020 at 13:50:11 UTC, Basile B. wrote: On Wednesday, 26 February 2020 at 00:50:35 UTC, Basile B. wrote: ... foreach (i; 0 .. count) sum += funcs[func](i); The input stream is highly predictable and strongly skewed towards higher digits. The winning function implementation lines up with that distribution. It would not fare as well with higher entropy input. Using sorted equi-probable inputs (N 1 digit numbers, N 2 digit numbers, ...) decimalLength9_0 beats a simple branchless implementation by about 10%. After shuffling the input, branchless wins by 2.4X (240%).
Re: SQLite 3 support?
On Wednesday, 26 February 2020 at 20:06:20 UTC, mark wrote: There seems to be some support for SQLite 3 in std. lib. etc when looking at the stable docs: https://dlang.org/phobos/etc_c_sqlite3.html But this isn't visible when looking at stable (ddox). Is this the best SQLite 3 library to use or is a third-party library best? For example https://github.com/biozic/d2sqlite3 I use d2sqlite3 regularly, no problems at all with it. I have no experience with the std. lib one. Jordan
Re: SQLite 3 support?
On Wednesday, 26 February 2020 at 20:06:20 UTC, mark wrote: There seems to be some support for SQLite 3 in std. lib. etc when looking at the stable docs: https://dlang.org/phobos/etc_c_sqlite3.html But this isn't visible when looking at stable (ddox). Is this the best SQLite 3 library to use or is a third-party library best? For example https://github.com/biozic/d2sqlite3 What's in the Phobos is just D binding to the sqlite3 C api. Probably not much up to date too. I've been using d2sqlite3 a lot some time ago and I'd definitelly recommend it. It also provides nice higher level API that helps to work with it in a more D friendly and productive way. But I'm not following sqlite3 updates much nowadays.
SQLite 3 support?
There seems to be some support for SQLite 3 in std. lib. etc when looking at the stable docs: https://dlang.org/phobos/etc_c_sqlite3.html But this isn't visible when looking at stable (ddox). Is this the best SQLite 3 library to use or is a third-party library best? For example https://github.com/biozic/d2sqlite3
Re: Strange counter-performance in an alternative `decimalLength9` function
On Wednesday, 26 February 2020 at 13:50:11 UTC, Basile B. wrote: On Wednesday, 26 February 2020 at 00:50:35 UTC, Basile B. wrote: ... foreach (i; 0 .. count) sum += funcs[func](i); The input stream is highly predictable and strongly skewed towards higher digits. The winning function implementation lines up with that distribution. It would not fare as well with higher entropy input.
Re: Strange counter-performance in an alternative `decimalLength9` function
On Wednesday, 26 February 2020 at 00:50:35 UTC, Basile B. wrote: How is that possible ? It turns out that there's a problem with the benchmarking method. With command line argument the different optimization passes of LLVM don't fuck up with the literal constants. It appears that none of the alternatives based on the "most significant bit trick" are faster (at least when covering a decent range of numbers): --- #!ldmd -boundscheck=off -release -inline -O -mcpu=native -mattr=+sse2,+sse3,+sse4.1,+sse4.2,+fast-lzcnt,+avx,+avx2,+cmov,+bmi,+bmi2 import core.memory; import core.bitop; import std.stdio; import std.range; import std.algorithm; import std.getopt; ubyte decimalLength9_0(const uint v) { if (v >= 1) { return 9; } if (v >= 1000) { return 8; } if (v >= 100) { return 7; } if (v >= 10) { return 6; } if (v >= 1) { return 5; } if (v >= 1000) { return 4; } if (v >= 100) { return 3; } if (v >= 10) { return 2; } return 1; } ubyte decimalLength9_1(const uint v) pure nothrow { if (v == 0) // BSR and LZCNT UB when input is 0 return 1; const ubyte lzc = cast(ubyte) bsr(v); ubyte result; switch (lzc) { case 0 : case 1 : case 2 : result = 1; break; case 3 : result = v >= 10 ? 2 : 1; break; case 4 : case 5 : result = 2; break; case 6 : result = v >= 100 ? 3 : 2; break; case 7 : case 8 : result = 3; break; case 9 : result = v >= 1000 ? 4 : 3; break; case 10: case 11: case 12: result = 4; break; case 13: result = v >= 1 ? 5 : 4; break; case 14: case 15: result = 5; break; case 16: result = v >= 10 ? 6 : 5; break; case 17: case 18: result = 6; break; case 19: result = v >= 100 ? 7 : 6; break; case 20: case 21: case 22: result = 7; break; case 23: result = v >= 1000 ? 8 : 7; break; case 24: case 25: result = 8; break; case 26: result = v >= 1 ? 9 : 8; break; case 27: case 28: case 29: case 30: case 31: result = 9; break; default: assert(false); } return result; } private ubyte decimalLength9_2(const uint v) pure nothrow { if (v == 0) // BSR and LZCNT UB when input is 0 return 1; const ubyte lzc = cast(ubyte) bsr(v); static immutable pure nothrow ubyte function(uint)[32] tbl = [ 0 : (uint a) => ubyte(1), 1 : (uint a) => ubyte(1), 2 : (uint a) => ubyte(1), 3 : (uint a) => a >= 10 ? ubyte(2) : ubyte(1), 4 : (uint a) => ubyte(2), 5 : (uint a) => ubyte(2), 6 : (uint a) => a >= 100 ? ubyte(3) : ubyte(2), 7 : (uint a) => ubyte(3), 8 : (uint a) => ubyte(3), 9 : (uint a) => a >= 1000 ? ubyte(4) : ubyte(3), 10: (uint a) => ubyte(4), 11: (uint a) => ubyte(4), 12: (uint a) => ubyte(4), 13: (uint a) => a >= 1 ? ubyte(5) : ubyte(4), 14: (uint a) => ubyte(5), 15: (uint a) => ubyte(5), 16: (uint a) => a >= 10 ? ubyte(6) : ubyte(5), 17: (uint a) => ubyte(6), 18: (uint a) => ubyte(6), 19: (uint a) => a >= 100 ? ubyte(7) : ubyte(6), 20: (uint a) => ubyte(7), 21: (uint a) => ubyte(7), 22: (uint a) => ubyte(7), 23: (uint a) => a >= 1000 ? ubyte(8) : ubyte(7), 24: (uint a) => ubyte(8), 25: (uint a) => ubyte(8), 26: (uint a) => a >= 1 ? ubyte(9) : ubyte(8), 27: (uint a) => ubyte(9), 28: (uint a) => ubyte(9), 29: (uint a) => ubyte(9), 30: (uint a) => ubyte(9), 31: (uint a) => ubyte(9), ]; return tbl[lzc](v); } ubyte decimalLength9_3(const uint v) pure nothrow { if (v == 0) // BSR and LZCNT UB when input is 0 return 1; ubyte result; const ubyte lzc = cast(ubyte) bsr(v); const ubyte[32] decimalLength9LUT = [ 0 : ubyte(1), 1 : ubyte(1), 2 : ubyte(1), 3 : ubyte(10), 4 : ubyte(2), 5 : ubyte(2), 6 : ubyte(11), 7 : ubyte(3), 8 : ubyte(3), 9 : ubyte(12), 10: ubyte(4), 11: ubyte(4), 12: ubyte(4), 13: ubyte(12), 14: ubyte(5), 15: ubyte(5), 16: ubyte(13), 17: ubyte(6), 18: ubyte(6), 19: ubyte(14), 20: ubyte(7), 21: ubyte(7), 22: ubyte(7), 23: ubyte(15), 24: ubyte(8), 25: ubyte(8), 26: ubyte(16), 27: ubyte(9), 28: ubyte(9), 29: ubyte(9), 30: ubyte(9), 31: ubyte(9), ]; ubyte resultOrSelector = decimalLength9LUT[lzc]; if (resultOrSelector < 10) result = resultOrSelector; else switch (lzc) { case 3 : result = v >= 10 ? 2 : 1; break; case 6 : result = v >= 100 ? 3 : 2; break; case 9 : result = v >= 1000 ? 4 : 3; break; case 13: result = v >= 1 ? 5 : 4; break; case 16: result = v >= 10 ? 6 : 5; break; case 19: result =
Re: Strange counter-performance in an alternative `decimalLength9` function
On Wednesday, 26 February 2020 at 00:50:35 UTC, Basile B. wrote: So after reading the translation of RYU I was interested too see if the decimalLength() function can be written to be faster, as it cascades up to 8 CMP. [...] It can be made faster using binary search. Not by much though. You'll get ceil(log2(numberofplaces)) cmps.
Re: String switch is odd using betterC
On Wednesday, 26 February 2020 at 08:32:50 UTC, Abby wrote: On Wednesday, 26 February 2020 at 08:25:00 UTC, Abby wrote: Any idea why? Ok so this is enough to produce the same result, it seems that there is a problem in string switch when there is more the 6 cases. extern(C) void main() { auto s = "F"; final switch(s) { case "A": break; case "B": break; case "C": break; case "D": break; case "E": break; case "F": break; case "G": break; } } The explanation can be found in druntime/import/core/internal/switch_.d: the __switch template does a simple binary search for less than 7 cases, but calls .idup on each case label for >= 7 cases. There's a comment there about why it's being done, but it seems to be a far more complicated fix than necessary - static immutable cases = [caseLabels]; works just as well, it seems. Anyway, the current code was added in this commit: https://github.com/dlang/druntime/commit/fa665f6618af7dbc09ed5ba1333f385017b7ece8. Anyways, reported here: https://issues.dlang.org/show_bug.cgi?id=20613 -- Simen
Re: String switch is odd using betterC
On Wednesday, 26 February 2020 at 08:32:50 UTC, Abby wrote: On Wednesday, 26 February 2020 at 08:25:00 UTC, Abby wrote: Any idea why? Ok so this is enough to produce the same result, it seems that there is a problem in string switch when there is more the 6 cases. extern(C) void main() { auto s = "F"; final switch(s) { case "A": break; case "B": break; case "C": break; case "D": break; case "E": break; case "F": break; case "G": break; } } This looks like a possible cause: https://github.com/dlang/druntime/blob/e018a72084e54ecc7466e97c96e4557b96b7f905/src/core/internal/switch_.d#L34
Re: String switch is odd using betterC
On Wednesday, 26 February 2020 at 08:25:00 UTC, Abby wrote: Any idea why? Ok so this is enough to produce the same result, it seems that there is a problem in string switch when there is more the 6 cases. extern(C) void main() { auto s = "F"; final switch(s) { case "A": break; case "B": break; case "C": break; case "D": break; case "E": break; case "F": break; case "G": break; } }
String switch is odd using betterC
I have a simple enum of strings like so: enum Alphabet : string { a = "A", b = "B", c = "C", d = "D", e = "E", f = "F", g = "G" } and then simple final switch like so: extern(C) void main() { auto s = Alphabet.f; final switch(s) { case Alphabet.a: break; case Alphabet.b: break; case Alphabet.c: break; case Alphabet.d: break; case Alphabet.e: break; case Alphabet.f: break; case Alphabet.g: break; } } The problem I have is that this causes: /dlang/dmd/linux/bin64/../../src/druntime/import/object.d(2999): Error: TypeInfo cannot be used with -betterC Odd think is that wehen I remove g from my enum it compiles just fine, so it seems that this compilation error occurs only when my enum has more then 6 members. Any idea why?
Re: String switch is odd using betterC
On Wednesday, 26 February 2020 at 08:25:00 UTC, Abby wrote: /dlang/dmd/linux/bin64/../../src/druntime/import/object.d(2999): Error: TypeInfo cannot be used with -betterC Odd think is that wehen I remove g from my enum it compiles just fine, so it seems that this compilation error occurs only when my enum has more then 6 members. Any idea why? Ok so this is enough to reproduce the problem extern(C) void main() { auto s = "F"; final switch(s) { case "A": break; case "B": break; case "C": break; case "D": break; case "E": break; case "F": break; case "G": break; } }