My earlier experiment was on GHC-7.10.3. I repeated this on GHC-8.0.1 and
the assembly traced was exactly the same except for a marginal improvement.
The 8.0.1 code generator removed the r14/r11 swap but the rest of the
register ring shift remains the same. I have updated the github gist with
the 8.0.1 trace:

https://gist.github.com/harendra-kumar/7d34c6745f604a15a872768e57cd2447

thanks,
harendra

On 13 June 2016 at 00:08, Harendra Kumar <harendra.ku...@gmail.com> wrote:

> Hi,
>
> I am implementing unicode normalization in Haskell. I challenged myself to
> match the performance with the best C/C++ implementation, the best being
> the ICU library. I am almost there, beating it in one of the benchmarks and
> within 30% for others. I am out of all application level tricks that I
> could think of and now need help from the compiler.
>
> I started with a bare minimum loop and adding functionality incrementally
> watching where the performance trips. At one point I saw that adding just
> one 'if' condition reduced the performance by half. I looked at what's
> going on at the assembly level. Here is a github gist of the assembly
> instructions executed in the fast path of the loop, corresponding cmm
> snippets and also the full cmm corresponding to the loop:
>
> https://gist.github.com/harendra-kumar/7d34c6745f604a15a872768e57cd2447
>
> I have annotated the assembly code with labels matching the corresponding
> CMM.
>
> With the addition of another "if" condition the loop which was pretty
> simple till now suddenly got bloated with a lot of register reassignments.
> Here is a snippet of the register movements added:
>
> # _n4se:
> # swap r14 <-> r11
> => 0x408d6b: mov    %r11,0x98(%rsp)
> => 0x408d73: mov    %r14,%r11
> => 0x408d76: mov    0x98(%rsp),%r14
>
> # reassignments
> # rbx -> r10 -> r9 -> r8 -> rdi -> rsi -> rdx -> rcx -> rbx
> => 0x408d7e: mov    %rbx,0x90(%rsp)
> => 0x408d86: mov    %rcx,%rbx
> => 0x408d89: mov    %rdx,%rcx
> => 0x408d8c: mov    %rsi,%rdx
> => 0x408d8f: mov    %rdi,%rsi
> => 0x408d92: mov    %r8,%rdi
> => 0x408d95: mov    %r9,%r8
> => 0x408d98: mov    %r10,%r9
> => 0x408d9b: mov    0x90(%rsp),%r10
> .
> .
> .
> loop logic here which uses only %rax, %r10 and %r9 .
> .
> .
> .
> # _n4s8:
> # shuffle back to original assignments
> => 0x4090dc: mov    %r14,%r11
> => 0x4090df: mov    %r9,%r10
> => 0x4090e2: mov    %r8,%r9
> => 0x4090e5: mov    %rdi,%r8
> => 0x4090e8: mov    %rsi,%rdi
> => 0x4090eb: mov    %rdx,%rsi
> => 0x4090ee: mov    %rcx,%rdx
> => 0x4090f1: mov    %rbx,%rcx
> => 0x4090f4: mov    %rax,%rbx
> => 0x4090f7: mov    0x88(%rsp),%rax
>
> => 0x4090ff: jmpq   0x408d2a
>
>
> The registers seem to be getting reassigned here, data flowing from one to
> the next. In this particular path a lot of these register movements seem
> unnecessary and are only undone at the end without being used.
>
> Maybe this is because these are reusable blocks and the movement is
> necessary when used in some other path?
>
> Can this be avoided? Or at least avoided in a certain fast path somehow by
> hinting the compiler? Any pointers to the GHC code will be appreciated. I
> am not yet much familiar with the GHC code but can dig deeper pretty
> quickly. But before that I hope someone knowledgeable in this area can shed
> some light on this at a conceptual level or if at all it can be improved. I
> can provide more details and experiment more if needed.
>
> Thanks,
> Harendra
>
_______________________________________________
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/glasgow-haskell-users

Reply via email to