My earlier experiment was on GHC-7.10.3. I repeated this on GHC-8.0.1 and the assembly traced was exactly the same except for a marginal improvement. The 8.0.1 code generator removed the r14/r11 swap but the rest of the register ring shift remains the same. I have updated the github gist with the 8.0.1 trace:
https://gist.github.com/harendra-kumar/7d34c6745f604a15a872768e57cd2447 thanks, harendra On 13 June 2016 at 00:08, Harendra Kumar <harendra.ku...@gmail.com> wrote: > Hi, > > I am implementing unicode normalization in Haskell. I challenged myself to > match the performance with the best C/C++ implementation, the best being > the ICU library. I am almost there, beating it in one of the benchmarks and > within 30% for others. I am out of all application level tricks that I > could think of and now need help from the compiler. > > I started with a bare minimum loop and adding functionality incrementally > watching where the performance trips. At one point I saw that adding just > one 'if' condition reduced the performance by half. I looked at what's > going on at the assembly level. Here is a github gist of the assembly > instructions executed in the fast path of the loop, corresponding cmm > snippets and also the full cmm corresponding to the loop: > > https://gist.github.com/harendra-kumar/7d34c6745f604a15a872768e57cd2447 > > I have annotated the assembly code with labels matching the corresponding > CMM. > > With the addition of another "if" condition the loop which was pretty > simple till now suddenly got bloated with a lot of register reassignments. > Here is a snippet of the register movements added: > > # _n4se: > # swap r14 <-> r11 > => 0x408d6b: mov %r11,0x98(%rsp) > => 0x408d73: mov %r14,%r11 > => 0x408d76: mov 0x98(%rsp),%r14 > > # reassignments > # rbx -> r10 -> r9 -> r8 -> rdi -> rsi -> rdx -> rcx -> rbx > => 0x408d7e: mov %rbx,0x90(%rsp) > => 0x408d86: mov %rcx,%rbx > => 0x408d89: mov %rdx,%rcx > => 0x408d8c: mov %rsi,%rdx > => 0x408d8f: mov %rdi,%rsi > => 0x408d92: mov %r8,%rdi > => 0x408d95: mov %r9,%r8 > => 0x408d98: mov %r10,%r9 > => 0x408d9b: mov 0x90(%rsp),%r10 > . > . > . > loop logic here which uses only %rax, %r10 and %r9 . > . > . > . > # _n4s8: > # shuffle back to original assignments > => 0x4090dc: mov %r14,%r11 > => 0x4090df: mov %r9,%r10 > => 0x4090e2: mov %r8,%r9 > => 0x4090e5: mov %rdi,%r8 > => 0x4090e8: mov %rsi,%rdi > => 0x4090eb: mov %rdx,%rsi > => 0x4090ee: mov %rcx,%rdx > => 0x4090f1: mov %rbx,%rcx > => 0x4090f4: mov %rax,%rbx > => 0x4090f7: mov 0x88(%rsp),%rax > > => 0x4090ff: jmpq 0x408d2a > > > The registers seem to be getting reassigned here, data flowing from one to > the next. In this particular path a lot of these register movements seem > unnecessary and are only undone at the end without being used. > > Maybe this is because these are reusable blocks and the movement is > necessary when used in some other path? > > Can this be avoided? Or at least avoided in a certain fast path somehow by > hinting the compiler? Any pointers to the GHC code will be appreciated. I > am not yet much familiar with the GHC code but can dig deeper pretty > quickly. But before that I hope someone knowledgeable in this area can shed > some light on this at a conceptual level or if at all it can be improved. I > can provide more details and experiment more if needed. > > Thanks, > Harendra >
_______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/glasgow-haskell-users