I finished adding SSE2 optimizations for the Inverse DWT decoding
routines this evening.
Here are the current performance numbers from my Atom D510 test system:
Without SSE:
|---|
PROFILER |
On 6/14/2011 10:02 PM, Marc-André Moreau wrote:
> Ah, so we either add Kernel.framework as a dependency on Mac OS X, or
> we wrap a call to the cpuid instruction
>
> Any preference?
>
I have no preference. Are you going to make the change, or do you want
me (or someone else) to work on it?
> By
I finished adding SSE2 optimizations for the Inverse DWT decoding
routines this evening.
Here are the current performance numbers from my Atom D510 test system:
Without SSE:
|---|
PROFILER |
Marc,
On 6/14/2011 7:01 PM, Marc-André Moreau wrote:
> Hi Steve,
>
> I noticed the addition of cpuid.h, which is not found on Mac OS X. Is
> there a more portable alternative for detecting SSE support level?
> Can't the cpuinfo instruction be used for this?
That's weird. I was under the assump
1:09 AM, S. Erisman wrote:
> Hey Vic,
>
> On 6/10/2011 12:32 AM, Vic Lee wrote:
>> Hi Steve,
>>
>> Yes both is faster, but the SSE version is still quite slower than the
>> original one. Here is my testing.
>>
>> Before pulling:
>> | rfx_dec
On 6/10/2011 10:59 AM, S. Erisman wrote:
The _mm_* function _do_ indeed get compiled down to SSE assembly
instructions.
For reference... Here is what the non-SSE code compiles down too:
rfx_decode_YCbCr_to_RGB():
0:55 push %ebp
1:31 d2
Vic,
On 6/10/2011 9:36 AM, Vic Lee wrote:
> That's quite strange because it processes 8 coeffectients in parallel
> and shouldn't be slower.
>
I agree. At this point I have no idea how it can still be slower, but
it is. Granted this is my first time writing SSE code, and for all I
know, I am d
Vic,
On 6/10/2011 4:16 AM, Martin Fleisz wrote:
I am not quite sure how internally those _mm_* functions work, but if
those are really functions, it will definitely hurt the performance. I
think use assembly SSE2 instruction set directly (like paddw) should be
much better.
Vic
The _mm_* funct
Hey Vic,
On 6/10/2011 12:32 AM, Vic Lee wrote:
> Hi Steve,
>
> Yes both is faster, but the SSE version is still quite slower than the
> original one. Here is my testing.
>
> Before pulling:
> | rfx_decode_YCbCr_to_RGB_SSE2 | 2123 | 1.75 | 0.000824 |
> | rfx_decode_YCbCr_to_RGB |
On 6/9/2011 10:04 PM, S. Erisman wrote:
> Vic,
>
> On 6/9/2011 10:05 PM, Vic Lee wrote:
>> Hi Steve,
>>
>> The RemoteFX algorithm does not specify the minimum required bits, butt
>> according to a forum post in MSDN, MS's implementation use 16bit signed
>&g
Vic,
On 6/9/2011 10:05 PM, Vic Lee wrote:
> Hi Steve,
>
> The RemoteFX algorithm does not specify the minimum required bits, butt
> according to a forum post in MSDN, MS's implementation use 16bit signed
> integer, so I believe it should be enough.
>
Thanks for the response. I actually found my
Martin,
On 6/9/2011 7:09 AM, Martin Fleisz wrote:
One thing that will definitely hurt performance is if our memory is
not 16-byte aligned. We should also have a possibility to overload the
memory allocation in rfx_pool to use _mm_malloc/_mm_free to have
correctly aligned buffers.
We should a
Marc,
I took your suggestions into account, revised my earlier patch, and
committed my changes to a new fork:
https://github.com/serisman/FreeRDP
... more comments below ...
On 6/7/2011 9:29 PM, Marc-André Moreau wrote:
> Hi Steve,
>
> Well, that was fast :) I had started thinking of the d
Marc,
On 6/7/2011 11:35 PM, Marc-André Moreau wrote:
> Hi Steve,
>
> I just tried your patch - awesome!
>
Thanks. That was the first SSE code I have ever written and it ended up
being pretty easy. Once we have high level agreement on the structure
needed around these optimizations there is def
Vic,
On 6/7/2011 7:18 PM, Vic Lee wrote:
> Hi Steve,
>
> I think it looks like it might be not just affecting fullscreen
> toggling only (depending on the window manager I guess it might happen
> other cases). This patch should fix it more properly.
>
> diff --git a/X11/xf_decode.c b/X11/xf_deco
Marc,
On 6/6/2011 9:20 AM, Marc-André Moreau wrote:
I read more about SSE, and then about NEON which is the equivalent for
ARM
My first impression is damn, how could I not see this before? This
thing looks very well suited not only for acceleration of RemoteFX
decoding, but there's a chance
Marc,
I vote to merge your github fork. I tried it out last night, and it
seems pretty stable.
Could you also include a fix for the fullscreen toggle (while using
RemoteFX) issue that I sent to the list the other day? It should be as
simple as clearing or resetting the clip region at the b
On 6/5/2011 9:50 PM, Otavio Salvador wrote:
> On Mon, Jun 6, 2011 at 02:25, S. Erisman wrote:
>> I tried out your RemoteFX code over the weekend, and it works very nicely.
> It didn't work for me. How did you configure the Windows Server to use
> it? What worries me is that t
On 5/25/2011 8:42 PM, Vic Lee wrote:
> Hi,
>
> I have finally completed RemoteFX software decoding feature. It's writen
> as a separate and relatively independent library librfx. I only added it
> to xfreerdp, but the library is portable, so there shouldn't be problem
> to use it in other UI.
>
> I
On 3/2/2011 8:13 AM, Marc-André Moreau wrote:
>
> By regular hardware, do you mean hardware that does not include the
> special RemoteFX chip? Adding RemoteFX support without the chip means
> implementing the codec in software, and that is way too much for a
> student to do in a summer. I have l
20 matches
Mail list logo