> It seems to make sense to bump cost of idiv a bit, given the fact that there
> are register pressure implications.
> I would like to however understand what code sequences we produce that are
> estimated to be long but ends up being shorter in practice.  Would be possible
> to try to give me some examples of constants where it is important to bump 
> cost
> to 8?  It is possible we can simply fix cost estimation in divmod expansion
> instead.

Attached t.c.bz2 is a good source file to experiment with.

With last month's svn snapshot of gcc, I did the following:

/usr/app/gcc-4.4.svn.20090528/bin/gcc -g0 -Os -fomit-frame-pointer
-ffunction-sections -c t.c
objdump -dr t.o >t.asm

with and without the patch, and compared results. (-ffunction-sections are used
merely because they make "objdump -dr" output much more suitable for diffing).

Here is the diff between unpatched and patched gcc's code generated for int_x /

 Disassembly of section .text.id_x_16:
 0000000000000000 <id_x_16>:
-   0:  89 f8                   mov    %edi,%eax
-   2:  ba 10 00 00 00          mov    $0x10,%edx
-   7:  89 d1                   mov    %edx,%ecx
-   9:  99                      cltd
-   a:  f7 f9                   idiv   %ecx
-   c:  c3                      retq
+   0:  8d 47 0f                lea    0xf(%rdi),%eax
+   3:  85 ff                   test   %edi,%edi
+   5:  0f 49 c7                cmovns %edi,%eax
+   8:  c1 f8 04                sar    $0x4,%eax
+   b:  c3                      retq

int_x / 2:

 Disassembly of section .text.id_x_2:
 0000000000000000 <id_x_2>:
    0:  89 f8                   mov    %edi,%eax
-   2:  ba 02 00 00 00          mov    $0x2,%edx
-   7:  89 d1                   mov    %edx,%ecx
-   9:  99                      cltd
-   a:  f7 f9                   idiv   %ecx
-   c:  c3                      retq
+   2:  c1 e8 1f                shr    $0x1f,%eax
+   5:  01 f8                   add    %edi,%eax
+   7:  d1 f8                   sar    %eax
+   9:  c3                      retq

As you can see, code become smaller and *much* faster (not even mul insn is
used now).

Here is an example of unsigned_x / 641. In this case, code size is the same,
but the code is faster:

 Disassembly of section .text.ud_x_641:
 0000000000000000 <ud_x_641>:
-   0:  ba 81 02 00 00          mov    $0x281,%edx
-   5:  89 f8                   mov    %edi,%eax
-   7:  89 d1                   mov    %edx,%ecx
-   9:  31 d2                   xor    %edx,%edx
-   b:  f7 f1                   div    %ecx
+   0:  89 f8                   mov    %edi,%eax
+   2:  48 69 c0 81 3d 66 00    imul   $0x663d81,%rax,%rax
+   9:  48 c1 e8 20             shr    $0x20,%rax
    d:  c3                      retq

There is not a single instance of code growth. Either newer gcc is better or
maybe code growth cases are in 32-bit code only.

I will attach t64.asm.diff, take a look if you want to see all changes in
generated code.



