On Sun, Jun 30, 2013 at 11:06:47AM +0200, Jan Hubicka wrote:
> > On Tue, Jun 25, 2013 at 3:36 PM, Michael Zolotukhin
> > <michael.v.zolotuk...@gmail.com> wrote:
> > > Ping.
> > >
> > > On 20 June 2013 20:56, Michael Zolotukhin
> > > <michael.v.zolotuk...@gmail.com> wrote:
> > >> It seems that one of the tests needed a small fix.
> > >> Attached is a corrected version.
> > 
> > Jan, do you plan to review this patch? It touches the area that you
> > worked on extensively some time ago, so your expert opinion would be
> > much appreciated here.
> 
> Yes, I looked on the patch in detail this week (I am currently on a travel 
> with
> sporadic internet access and my days before leaving was extremely hectic).  
> The
> patch is OK except for the expr.c bits I can not apporve myself and they ought
> to go in separately anyway.
> 
> The reason I took so long to decide on the patch is that I did some 
> experiments
> with the SSE loop and it is hard to find scenarios with the current
> alignment/block size code where it is a win over library call on the current
> implementation (or with Ondra's). So I did not find block size/chip 
> combination
> where the current inline SSE loop would be default codegen strategy. One of
> cases where inline loop wins is when the extra register pressure impossed by
> the call cause IRA to not do very good job on surrounding code.
>
It does not measure real world issues like icache pressure etc. I would
like more definitive data. One possibility is compile gcc with various
options and repeately run it to compile simple project for day or so.

I am interested upto which size inlining code makes sense, my guess is
that after 64 bytes and no FP registers inlining is unproductive.

I tried this to see how LD_PRELOADing memcpy affects performance and it
is measurable after about day, a performance impact is small in my case
it was about 0.5% so you really need that long to converge.

Ondra

Reply via email to