On Wed, Mar 18, 2015 at 1:50 AM, Denys Vlasenko
<vda.li...@googlemail.com> wrote:
> On Sat, Feb 21, 2015 at 12:51 AM, Rasmus Villemoes
> <li...@rasmusvillemoes.dk> wrote:
>> The most expensive part of decimal conversion is the divisions by 10
>> (albeit done using reciprocal multiplication with appropriately chosen
>> constants). I decided to see if one could eliminate around half of
>> these multiplications by emitting two digits at a time, at the cost of
>> a 200 byte lookup table, and it does indeed seem like there is
>> something to be gained, especially on 64 bits.
> ...
> ...
>> +char *put_dec_trunc8(char *buf, unsigned r)
>>  {
>> +       unsigned q;
>> +
>> +       /* 1 <= r < 10^8 */
>> +       if (r < 100)
>> +               goto out_r;
>> +
>> +       /* 100 <= r < 10^8 */
>> +       q = (r * (u64)0x28f5c29) >> 32;
>> +       *((u16 *)buf) = decpair[r - 100*q];
>> +       buf += 2;
>> +
>> +       /* 1 <= q < 10^6 */
>> +       if (q < 100)
>> +               goto out_q;
>> +
>> +       /*  100 <= q < 10^6 */
>> +       r = (q * (u64)0x28f5c29) >> 32;
>> +       *((u16 *)buf) = decpair[q - 100*r];
>> +       buf += 2;
>> +
>> +       /* 1 <= r < 10^4 */
>> +       if (r < 100)
>> +               goto out_r;
>> +
>> +       /* 100 <= r < 10^4 */
>> +       q = (r * 0x147b) >> 19;
>> +       *((u16 *)buf) = decpair[r - 100*q];
>> +       buf += 2;
>> +out_q:
>> +       /* 1 <= q < 100 */
>> +       r = q;
>> +out_r:
>> +       /* 1 <= r < 100 */
>> +       *((u16 *)buf) = decpair[r];
>> +       buf += 2;
>> +       if (buf[-1] == '0')
>> +               buf--;
>>         return buf;
>>  }
>
> Your code does four 16-bit stores.

I quoted wrong part of your email.
Of course, I meant put_dec_full8(), not put_dec_trunc8().

> The version below does two 32-bit ones instead,
> and it is also marginally smaller.
>
> char *put_dec_full8(char *buf, unsigned r)
> {
>         unsigned q;
>         u32 v;
>
>         /* 0 <= r < 10^8 */
>         q = (r * (u64)0x28f5c29) >> 32;
>         v = (u32)decpair[r - 100*q] << 16;
>
>         /* 0 <= q < 10^6 */
>         r = (q * (u64)0x28f5c29) >> 32;
>         v = v | decpair[q - 100*r];
>         ((u32*)buf)[0] = v;
>
>         /* 0 <= r < 10^4 */
>         q = (r * 0x147b) >> 19;
>         v = (u32)decpair[r - 100*q] << 16;
>
>         /* 0 <= q < 100 */
>         v = v | decpair[q];
>         ((u32*)buf)[1] = v;
>
>         return buf + 8;
> }
>
> It may be faster not only because of having fewer stores,
> but because on x86, this code (moving 16-bit halves):
>
>         movw    decpair(%ebx,%ebx), %dx
>         movw    %dx, 4(%eax)
>         movw    decpair(%ecx,%ecx), %dx
>         movw    %dx, 6(%eax)
>
> suffers from register merge stall when 16-bit value
> is read into lower part of %edx. 32-bit code
> has no such stalls:
>
>         movzwl  decpair(%ebx,%ebx), %edx
>         sall    $16, %edx
>         movzwl  decpair(%ecx,%ecx), %ecx
>         orl     %ecx, %edx
>         movl    %edx, 4(%eax)
>
>
> Before you ask:
> I didn't manage to extend this trick to a code
> with one 64-bit store which is not larger.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to