Re: [ft-devel] FT_MulFix assembly

2010-08-06 Thread Werner LEMBERG
> I see implementations for ia32 and arm; would other platforms > benefit from assembply implementations of MulFix? As usual: patches are highly welcomed. Werner ___ Freetype-devel mailing list Freetype-devel@nongnu.org http://lists.nongnu.org/mai

Re: [ft-devel] FT_MulFix assembly

2010-08-08 Thread James Cloos
My first cut at FT_MulFix_x86_64() is: static __inline__ FT_Int32 FT_MulFix_x86_64 (FT_Int32 a, FT_Int32 b) { register FT_Int32 r; __asm__ __volatile__ ( "movslq %%edx, %%rdx\n" "cltq\n" "imul %%rdx\n" "addq %%rdx, %%rax\n" "addq $0x8000, %%rax\n"

Re: [ft-devel] FT_MulFix assembly

2010-08-12 Thread Werner LEMBERG
> I have to finish the patch, but I thought I'd offer the algorithm > for review, if anyone wants to. I haven't enough knowledge to comment, but thanks for working on it! Werner ___ Freetype-devel mailing list Freetype-devel@nongnu.org http://lis

Re: [ft-devel] FT_MulFix assembly

2010-09-05 Thread James Cloos
The final result for amd64 looks like: static __inline__ long FT_MulFix_x86_64( long a, long b ) { register long result; __asm__ __volatile__ ( "movq %1, %%rax\n" "imul %2\n" "addq %%rdx, %%rax\n" "addq $0x8000, %%rax\n" "sarq

Re: [ft-devel] FT_MulFix assembly

2010-09-06 Thread Graham Asher
Have you done an ARM version? Forgive my inattentiveness if you've already announced one. It just struck me that this sort of optimisation is even more necessary on mobile devices. Graham James Cloos wrote: The final result for amd64 looks like: static __inline__ long FT_MulFix_x86_64(

Re: [ft-devel] FT_MulFix assembly

2010-09-06 Thread Miles Bader
James Cloos writes: > __asm__ __volatile__ ( > "movq %1, %%rax\n" > "imul %2\n" > "addq %%rdx, %%rax\n" > "addq $0x8000, %%rax\n" > "sarq $16, %%rax\n" > : "=a"(result) > : "g"(a), "g"(b) > : "rdx" ); > > The above code has a latency of 1+5+

Re: [ft-devel] FT_MulFix assembly

2010-09-06 Thread Miles Bader
Incidentally, you wrote: > The assembly generated by the C code is 45 lines and 158 octets long, > contains six conditional jumps, three each of explicit compares and > tests, and still benchmarks are just as fast. Out-of-order processing > wins out over hand-coded asm. :-/ ... but when I follow

Re: [ft-devel] FT_MulFix assembly

2010-09-06 Thread Miles Bader
Miles Bader writes: > The compiler generates the following assembly: > > mov %esi, %eax > mov %edi, %edi > imulq %rdi, %rax > addq$32768, %rax > shrq$16, %rax > > The movs there are obviously a bit silly (compiler bug?), but that > output seems reaso

Re: [ft-devel] FT_MulFix assembly

2010-09-06 Thread James Cloos
> "GA" == Graham Asher writes: GA> Have you done an ARM version? Forgive my inattentiveness if you've GA> already announced one. It just struck me that this sort of GA> optimisation is even more necessary on mobile devices. I386, arm and arm-thumb versions were already there. -JimC -- Jame

Re: [ft-devel] FT_MulFix assembly

2010-09-06 Thread James Cloos
> "MB" == Miles Bader writes: MB> The compiler generates the following assembly: MB> mov %esi, %eax MB> mov %edi, %edi MB> imulq %rdi, %rax MB> addq$32768, %rax MB> shrq$16, %rax That does not match the C code though; it rounds negative values wrong. T

Re: [ft-devel] FT_MulFix assembly

2010-09-06 Thread Miles Bader
On Tue, Sep 7, 2010 at 4:28 AM, James Cloos wrote: >> "MB" == Miles Bader writes: > > MB> The compiler generates the following assembly: > > MB>     mov     %esi, %eax > MB>     mov     %edi, %edi > MB>     imulq   %rdi, %rax > MB>     addq    $32768, %rax > MB>     shrq    $16, %rax > > That

Re: [ft-devel] FT_MulFix assembly

2010-09-06 Thread James Cloos
> "MB" == Miles Bader writes: >> The C version does away-from-zero rounding. MB> Do you have test cases that show this? I tried using random inputs, MB> but even up to billions of iterations, I can't seem to find a set of MB> inputs where my function yields different results from yours. Th

Re: [ft-devel] FT_MulFix assembly

2010-09-07 Thread Miles Bader
James Cloos writes: >>> The C version does away-from-zero rounding. > > MB> Do you have test cases that show this? I tried using random inputs, > MB> but even up to billions of iterations, I can't seem to find a set of > MB> inputs where my function yields different results from yours. > > The C

Re: [ft-devel] FT_MulFix assembly

2010-09-07 Thread James Cloos
> "MB" == Miles Bader writes: MB> Hm, are you sure that's not backwards? When I tried the git C version[*], MB> as well as your most recent FT_MulFix_x86_64, it returned 0x8506... Odd. Adding your algo to my test app, I get: 7AFA8000, , 8505, 8505, 8506 #a

Re: [ft-devel] FT_MulFix assembly

2010-09-07 Thread Miles Bader
James Cloos writes: > Since FT's C version uses longs, though, this: > > int another (long a, long b) { > long r = (long)a * (long)b; > long s = r >> 31; > return (r + s + 0x8000) >> 16; > } That's not correct though, is it? The variable "s" should be the "all sign" portion of the mu

Re: [ft-devel] FT_MulFix assembly

2010-09-18 Thread James Cloos
The difference between: int miles (int32_t a, int32_t b) { return (((long)a * (long)b) + 0x8000) >> 16; } and: int another (long a, long b) { long r = a * b; long s = r >> SIZEOF_LONG_LESS_ONE; return (r + s + 0x8000) >> 16; } only shows up for products which are negative and wh

Re: [ft-devel] FT_MulFix assembly

2010-09-19 Thread Werner LEMBERG
> Werner: Miles' version is shorter, is only wrong by one ulp and only > when the product overflows and is negative. My variation, > called another() above, fixes that slight difference. > Which would you prefer, if anything? I tend to prefer the faster one $(Q#|(B I