Just for everyone's information, here's the updated benchmark code on
the same Phenom. The REP MOVSQ code is indeed much faster.
vendor_id : AuthenticAMD
cpu family : 16
model : 2
model name : AMD Phenom(tm) 9850 Quad-Core Processor
stepping: 3
microcode :
2 6:58 PM
> To: Ma, Ling
> Cc: Konrad Rzeszutek Wilk; mi...@elte.hu; h...@zytor.com;
> t...@linutronix.de; linux-kernel@vger.kernel.org; i...@google.com;
> George Spelvin
> Subject: Re: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging
> instruction sequence and saving r
On Fri, Oct 12, 2012 at 08:04:11PM +0200, Borislav Petkov wrote:
> Right, so benchmark shows around 20% speedup on Bulldozer but this is
> a microbenchmark and before pursue this further, we need to verify
> whether this brings any palpable speedup with a real benchmark, I
> don't know, kernbench,
On Fri, Oct 12, 2012 at 05:02:57PM -0400, George Spelvin wrote:
> Here are some Phenom results for that benchmark. The average time
> increases from 700 to 760 cycles (+8.6%).
I was afraid something like that would show up.
Btw, in looking at this more and IINM, we use the REP MOVSQ version on
A
Here are some Phenom results for that benchmark. The average time
increases from 700 to 760 cycles (+8.6%).
vendor_id : AuthenticAMD
cpu family : 16
model : 2
model name : AMD Phenom(tm) 9850 Quad-Core Processor
stepping: 3
microcode : 0x183
cpu MHz
On Fri, Oct 12, 2012 at 09:07:43AM +, Ma, Ling wrote:
> > > > So is that also true for AMD CPUs?
> > > Although Bulldozer put 32byte instruction into decoupled 16byte entry
> > > buffers, it still decode 4 instructions per cycle, so 4 instructions
> > > will be fed into execution unit and
> > >
On Fri, Oct 12, 2012 at 02:54:54PM +, Ma, Ling wrote:
> > If you can't test the CPUs who run this code I think it's safer if you
> > add a new variant for Atom, not change the existing well tested code.
> > Otherwise you risk performance regressions on these older CPUs.
>
> I found one older m
> If you can't test the CPUs who run this code I think it's safer if you
> add a new variant for Atom, not change the existing well tested code.
> Otherwise you risk performance regressions on these older CPUs.
I found one older machine, and tested the code on it, the results between them
are alm
> I tested new and original version on core2, the patch improved performance
> about 9%,
That's not useful because core2 doesn't use this variant, it uses the
rep string variant. Primary user is P4.
> Although core2 is out-of-order pipeline and weaken instruction sequence
> requirement,
> beca
> > > So is that also true for AMD CPUs?
> > Although Bulldozer put 32byte instruction into decoupled 16byte entry
> > buffers, it still decode 4 instructions per cycle, so 4 instructions
> > will be fed into execution unit and
> > 2 loads ,1 write will be issued per cycle.
>
> I'd be very interes
On Fri, Oct 12, 2012 at 03:37:50AM +, Ma, Ling wrote:
> > > Load and write operation occupy about 35% and 10% respectively for
> > > most industry benchmarks. Fetched 16-aligned bytes code include about
> > > 4 instructions, implying 1.34(0.35 * 4) load, 0.4 write.
> > > Modern CPU support 2 lo
> > Load and write operation occupy about 35% and 10% respectively for
> > most industry benchmarks. Fetched 16-aligned bytes code include about
> > 4 instructions, implying 1.34(0.35 * 4) load, 0.4 write.
> > Modern CPU support 2 load and 1 write per cycle, so throughput from
> > write is bottlene
> > Load and write operation occupy about 35% and 10% respectively for
> > most industry benchmarks. Fetched 16-aligned bytes code include about
> > 4 instructions, implying 1.34(0.35 * 4) load, 0.4 write.
> > Modern CPU support 2 load and 1 write per cycle, so throughput from
> > write is bottlene
On Thu, Oct 11, 2012 at 08:29:08PM +0800, ling...@intel.com wrote:
> From: Ma Ling
>
> Load and write operation occupy about 35% and 10% respectively
> for most industry benchmarks. Fetched 16-aligned bytes code include
> about 4 instructions, implying 1.34(0.35 * 4) load, 0.4 write.
> Modern
ling...@intel.com writes:
> From: Ma Ling
>
> Load and write operation occupy about 35% and 10% respectively
> for most industry benchmarks. Fetched 16-aligned bytes code include
> about 4 instructions, implying 1.34(0.35 * 4) load, 0.4 write.
> Modern CPU support 2 load and 1 write per cycle,
From: Ma Ling
Load and write operation occupy about 35% and 10% respectively
for most industry benchmarks. Fetched 16-aligned bytes code include
about 4 instructions, implying 1.34(0.35 * 4) load, 0.4 write.
Modern CPU support 2 load and 1 write per cycle, so throughput from write is
bottlenec
16 matches
Mail list logo