On Sun, Jun 30, 2013 at 11:06:47AM +0200, Jan Hubicka wrote: > > On Tue, Jun 25, 2013 at 3:36 PM, Michael Zolotukhin > > <michael.v.zolotuk...@gmail.com> wrote: > > > Ping. > > > > > > On 20 June 2013 20:56, Michael Zolotukhin > > > <michael.v.zolotuk...@gmail.com> wrote: > > >> It seems that one of the tests needed a small fix. > > >> Attached is a corrected version. > > > > Jan, do you plan to review this patch? It touches the area that you > > worked on extensively some time ago, so your expert opinion would be > > much appreciated here. > > Yes, I looked on the patch in detail this week (I am currently on a travel > with > sporadic internet access and my days before leaving was extremely hectic). > The > patch is OK except for the expr.c bits I can not apporve myself and they ought > to go in separately anyway. > > The reason I took so long to decide on the patch is that I did some > experiments > with the SSE loop and it is hard to find scenarios with the current > alignment/block size code where it is a win over library call on the current > implementation (or with Ondra's). So I did not find block size/chip > combination > where the current inline SSE loop would be default codegen strategy. One of > cases where inline loop wins is when the extra register pressure impossed by > the call cause IRA to not do very good job on surrounding code. > It does not measure real world issues like icache pressure etc. I would like more definitive data. One possibility is compile gcc with various options and repeately run it to compile simple project for day or so.
I am interested upto which size inlining code makes sense, my guess is that after 64 bytes and no FP registers inlining is unproductive. I tried this to see how LD_PRELOADing memcpy affects performance and it is measurable after about day, a performance impact is small in my case it was about 0.5% so you really need that long to converge. Ondra