I finished adding SSE2 optimizations for the Inverse DWT decoding
routines this evening.
Here are the current performance numbers from my Atom D510 test system:
Without SSE:
|-----------------------|
PROFILER | elapsed seconds |
|--------------------------------------------|-----------------------|
| code section | iterations | total | avg. |
|-------------------------------|------------|-----------|-----------|
| rfx_decode_rgb | 57385 | 54.530000 | 0.000950 |
| rfx_decode_component | 172155 | 42.120000 | 0.000245 |
| rfx_rlgr_decode | 172155 | 10.560000 | 0.000061 |
| rfx_differential_decode | 172155 | 0.240000 | 0.000001 |
| rfx_quantization_decode | 172155 | 3.980000 | 0.000023 |
| rfx_dwt_2d_decode | 172155 | 26.250000 | 0.000152 |
| rfx_decode_YCbCr_to_RGB | 57385 | 10.260000 | 0.000179 |
|--------------------------------------------------------------------|
With SSE:
|-----------------------|
PROFILER | elapsed seconds |
|--------------------------------------------|-----------------------|
| code section | iterations | total | avg. |
|-------------------------------|------------|-----------|-----------|
| rfx_decode_rgb | 47871 | 20.000000 | 0.000418 |
| rfx_decode_component | 143613 | 17.010000 | 0.000118 |
| rfx_rlgr_decode | 143613 | 12.230000 | 0.000085 |
| rfx_differential_decode | 143613 | 0.150000 | 0.000001 |
| rfx_quantization_decode_SSE2 | 143613 | 0.730000 | 0.000005 |
| rfx_dwt_2d_decode_SSE2 | 143613 | 3.060000 | 0.000021 |
| rfx_decode_YCbCr_to_RGB_SSE2 | 47871 | 1.020000 | 0.000021 |
|--------------------------------------------------------------------|
As you can see, we are currently getting a little more than 100%
performance gain by using SSE. It is noticeably faster and more
responsive as well. Looking at just the SSE vs. non-SSE methods we are
getting > 500% improvement.
Running the numbers through a calculation (accounting for some of these
methods being called more than others) gives this break-down:
61.00% rlgr
0.72% diff
3.59% quant (sse)
15.07% dwt (sse)
5.02% ycbcr (sse)
14.59% other
So, the one large remaining non-SSE method (rfx_rlgr_decode) is
accounting for about 61% (85*3 / 418) of the total RemoteFX processing
time currently. This method might be hard to optimized using SSE,
however, as it appears to be more stream/logic based than
loop/calculation based. It is definitely worth taking a further look
at, however, to see if there are other optimizations that can be made.
It might also be worth taking a look at the 'other' category. I assume
this includes the final assembly of the RGB data into it's output
format. This might be able to be optimized using SSE still.
FYI... I probably won't be able to push updates quite as fast over the
next 2 weeks, as we are at the end of a large project at work that is
requiring extra effort to get across the finish line. I would still
like to see if there is any more performance we can get out of this code
though. If someone on the list has SSE optimization experience, I would
love a code review... particularly around order of operations and cache
usage. We might be able to get another couple % improvement with some
very minor changes.
Lastly... I should get my new AMD Zacate based board tomorrow. Over the
next couple of weeks, I want to take a stab at an alternate OpenCL
accelerated version of this RemoteFX code as well. Any other interest
or experience in this type of acceleration?
Thanks,
Steve
------------------------------------------------------------------------------
EditLive Enterprise is the world's most technically advanced content
authoring tool. Experience the power of Track Changes, Inline Image
Editing and ensure content is compliant with Accessibility Checking.
http://p.sf.net/sfu/ephox-dev2dev
_______________________________________________
Freerdp-devel mailing list
Freerdp-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freerdp-devel