Re: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-14 Thread George Spelvin
Just for everyone's information, here's the updated benchmark code on the same Phenom. The REP MOVSQ code is indeed much faster. vendor_id : AuthenticAMD cpu family : 16 model : 2 model name : AMD Phenom(tm) 9850 Quad-Core Processor stepping: 3 microcode :

RE: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-14 Thread Ma, Ling
2 6:58 PM > To: Ma, Ling > Cc: Konrad Rzeszutek Wilk; mi...@elte.hu; h...@zytor.com; > t...@linutronix.de; linux-kernel@vger.kernel.org; i...@google.com; > George Spelvin > Subject: Re: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging > instruction sequence and saving r

Re: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-14 Thread Borislav Petkov
On Fri, Oct 12, 2012 at 08:04:11PM +0200, Borislav Petkov wrote: > Right, so benchmark shows around 20% speedup on Bulldozer but this is > a microbenchmark and before pursue this further, we need to verify > whether this brings any palpable speedup with a real benchmark, I > don't know, kernbench,

Re: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-12 Thread Borislav Petkov
On Fri, Oct 12, 2012 at 05:02:57PM -0400, George Spelvin wrote: > Here are some Phenom results for that benchmark. The average time > increases from 700 to 760 cycles (+8.6%). I was afraid something like that would show up. Btw, in looking at this more and IINM, we use the REP MOVSQ version on A

Re: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-12 Thread George Spelvin
Here are some Phenom results for that benchmark. The average time increases from 700 to 760 cycles (+8.6%). vendor_id : AuthenticAMD cpu family : 16 model : 2 model name : AMD Phenom(tm) 9850 Quad-Core Processor stepping: 3 microcode : 0x183 cpu MHz

Re: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-12 Thread Borislav Petkov
On Fri, Oct 12, 2012 at 09:07:43AM +, Ma, Ling wrote: > > > > So is that also true for AMD CPUs? > > > Although Bulldozer put 32byte instruction into decoupled 16byte entry > > > buffers, it still decode 4 instructions per cycle, so 4 instructions > > > will be fed into execution unit and > > >

Re: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-12 Thread Andi Kleen
On Fri, Oct 12, 2012 at 02:54:54PM +, Ma, Ling wrote: > > If you can't test the CPUs who run this code I think it's safer if you > > add a new variant for Atom, not change the existing well tested code. > > Otherwise you risk performance regressions on these older CPUs. > > I found one older m

RE: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-12 Thread Ma, Ling
> If you can't test the CPUs who run this code I think it's safer if you > add a new variant for Atom, not change the existing well tested code. > Otherwise you risk performance regressions on these older CPUs. I found one older machine, and tested the code on it, the results between them are alm

Re: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-12 Thread Andi Kleen
> I tested new and original version on core2, the patch improved performance > about 9%, That's not useful because core2 doesn't use this variant, it uses the rep string variant. Primary user is P4. > Although core2 is out-of-order pipeline and weaken instruction sequence > requirement, > beca

RE: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-12 Thread Ma, Ling
> > > So is that also true for AMD CPUs? > > Although Bulldozer put 32byte instruction into decoupled 16byte entry > > buffers, it still decode 4 instructions per cycle, so 4 instructions > > will be fed into execution unit and > > 2 loads ,1 write will be issued per cycle. > > I'd be very interes

Re: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-11 Thread Borislav Petkov
On Fri, Oct 12, 2012 at 03:37:50AM +, Ma, Ling wrote: > > > Load and write operation occupy about 35% and 10% respectively for > > > most industry benchmarks. Fetched 16-aligned bytes code include about > > > 4 instructions, implying 1.34(0.35 * 4) load, 0.4 write. > > > Modern CPU support 2 lo

RE: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-11 Thread Ma, Ling
> > Load and write operation occupy about 35% and 10% respectively for > > most industry benchmarks. Fetched 16-aligned bytes code include about > > 4 instructions, implying 1.34(0.35 * 4) load, 0.4 write. > > Modern CPU support 2 load and 1 write per cycle, so throughput from > > write is bottlene

RE: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-11 Thread Ma, Ling
> > Load and write operation occupy about 35% and 10% respectively for > > most industry benchmarks. Fetched 16-aligned bytes code include about > > 4 instructions, implying 1.34(0.35 * 4) load, 0.4 write. > > Modern CPU support 2 load and 1 write per cycle, so throughput from > > write is bottlene

Re: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-11 Thread Konrad Rzeszutek Wilk
On Thu, Oct 11, 2012 at 08:29:08PM +0800, ling...@intel.com wrote: > From: Ma Ling > > Load and write operation occupy about 35% and 10% respectively > for most industry benchmarks. Fetched 16-aligned bytes code include > about 4 instructions, implying 1.34(0.35 * 4) load, 0.4 write. > Modern

Re: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-11 Thread Andi Kleen
ling...@intel.com writes: > From: Ma Ling > > Load and write operation occupy about 35% and 10% respectively > for most industry benchmarks. Fetched 16-aligned bytes code include > about 4 instructions, implying 1.34(0.35 * 4) load, 0.4 write. > Modern CPU support 2 load and 1 write per cycle,

[PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-10 Thread ling . ma
From: Ma Ling Load and write operation occupy about 35% and 10% respectively for most industry benchmarks. Fetched 16-aligned bytes code include about 4 instructions, implying 1.34(0.35 * 4) load, 0.4 write. Modern CPU support 2 load and 1 write per cycle, so throughput from write is bottlenec