On Monday 21 February 2011 13:07:31 김태균 wrote:
> Hi,
> Thank you for the reply.
> 
> > Regarding performance, improving it twice is still a little bit too slow
> > on the hardware which has SIMD. On x86, support for SSE2 is pretty much
> > common, so it is quite natural to use it if it proves to be beneficial.
> > But for the low end embedded machines with primitive processors without
> > SIMD it may be indeed very good to have any kind of performance
> > improvements.
> 
> Yes, right.
> I will fully utilize SIMD as possible as I can. (NEON is available on some
> of our target machines)

Great. Surely contributions in this area would be definitely useful. But you
may have started this work a bit too late ;) I have been looking into improving
bilinear scaling performance for the last couple of weeks already and have just
submitted some initial SSE2 and ARM NEON optimizations for it (btw, testing is
very much welcome). And there is still lots of work to do before all the
bilinear scaling related performance bottlenecks are eliminated.

> But I have to consider not only high end machines but also low ends which
> do not support SIMD.
> That's why I'm trying to optimize non-SIMD general code path.

Well, in your original e-mail, you mentioned that you are interested in getting
good performance on intel quad core. That's why without having any other
information, I suggested SSE2 as a solution for this problem :)

What kind of hardware do the rest of your target machines have? A lot of ARM
processors beginning with armv5te have special instructions for fast signed
16-bit multiplication. If we know what the target hardware supports, we may
modify bilinear interpolation code to make better use of it.

The current bilinear interpolation code has one problem that it needs 16-bit
unsigned multiplications (uint16_t * uint16_t -> uint32_t), which are also not
so efficient for MMX/SSE2. Maybe going down from 256 levels to 128 levels could
allow to use signed 16-bit multiplications and provide more optimization
possibilities on a wide range of hardware? Also SSSE3 may be worth considering
because it has PMADDUBSW instruction (uint8_t * int8_t -> int16_t).

It is just ARM NEON not challenging at all and boring because it is totally
orthogonal and supports all kind of vector multiplications easily (8-bit and
16-bit, both signed and unsigned, both ordinary and long variant). I guess it
would work fine with any interpolation method, like it did with the current
one.

I also tried to benchmark your change to the bilinear code and got something
like 23% better scaling performance overall on Intel Core i7. I guess you have
benchmarked 2x performance improvement for that function alone but not for a
full rendering pipeline, right?. It's a good improvement, but not even close to
the performance effect of using SSE2 or NEON (or maybe even armv5te). So I
would consider looking at the supported instruction set on your target hardware
first.

For these experiments, I'm typically doing benchmarks with the 'scaling-bench'
program from:
  http://cgit.freedesktop.org/~siamashka/pixman/log/?h=playground/test-n-bench

-- 
Best regards,
Siarhei Siamashka
_______________________________________________
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman

Reply via email to