> On Tue, Jun 25, 2013 at 3:36 PM, Michael Zolotukhin
> <michael.v.zolotuk...@gmail.com> wrote:
> > Ping.
> >
> > On 20 June 2013 20:56, Michael Zolotukhin
> > <michael.v.zolotuk...@gmail.com> wrote:
> >> It seems that one of the tests needed a small fix.
> >> Attached is a corrected version.
> 
> Jan, do you plan to review this patch? It touches the area that you
> worked on extensively some time ago, so your expert opinion would be
> much appreciated here.

Yes, I looked on the patch in detail this week (I am currently on a travel with
sporadic internet access and my days before leaving was extremely hectic).  The
patch is OK except for the expr.c bits I can not apporve myself and they ought
to go in separately anyway.

The reason I took so long to decide on the patch is that I did some experiments
with the SSE loop and it is hard to find scenarios with the current
alignment/block size code where it is a win over library call on the current
implementation (or with Ondra's). So I did not find block size/chip combination
where the current inline SSE loop would be default codegen strategy. One of
cases where inline loop wins is when the extra register pressure impossed by
the call cause IRA to not do very good job on surrounding code.

I think this can be improved by putting the loop kernels themselves offline
with custom calling conventions. Doing so seems bit hard however to get right
with unwind info.

In gneral it IMO it makes sense to add the inline codegen for both memcpy and 
memset
so it stays on the radar on all chips (I tested only core and recent AMD chips, 
not
Atoms etc.).

Michael, my apologizes for taking so long to decide here. Do you think you can 
work on the
memset and move_by_pieces/store_by_pieces bits?
I think the by pieces code is a lot more promising then the actual inline SSE 
loops.

Honza
> 
> Thanks and best regards,
> Uros.

Reply via email to