> On Tue, Jun 25, 2013 at 3:36 PM, Michael Zolotukhin > <michael.v.zolotuk...@gmail.com> wrote: > > Ping. > > > > On 20 June 2013 20:56, Michael Zolotukhin > > <michael.v.zolotuk...@gmail.com> wrote: > >> It seems that one of the tests needed a small fix. > >> Attached is a corrected version. > > Jan, do you plan to review this patch? It touches the area that you > worked on extensively some time ago, so your expert opinion would be > much appreciated here.
Yes, I looked on the patch in detail this week (I am currently on a travel with sporadic internet access and my days before leaving was extremely hectic). The patch is OK except for the expr.c bits I can not apporve myself and they ought to go in separately anyway. The reason I took so long to decide on the patch is that I did some experiments with the SSE loop and it is hard to find scenarios with the current alignment/block size code where it is a win over library call on the current implementation (or with Ondra's). So I did not find block size/chip combination where the current inline SSE loop would be default codegen strategy. One of cases where inline loop wins is when the extra register pressure impossed by the call cause IRA to not do very good job on surrounding code. I think this can be improved by putting the loop kernels themselves offline with custom calling conventions. Doing so seems bit hard however to get right with unwind info. In gneral it IMO it makes sense to add the inline codegen for both memcpy and memset so it stays on the radar on all chips (I tested only core and recent AMD chips, not Atoms etc.). Michael, my apologizes for taking so long to decide here. Do you think you can work on the memset and move_by_pieces/store_by_pieces bits? I think the by pieces code is a lot more promising then the actual inline SSE loops. Honza > > Thanks and best regards, > Uros.