Re: [PATCH] [x86]: Compiler Option Os is better on latest x86

2013-01-29 Thread Ingo Molnar

* valdis.kletni...@vt.edu  wrote:

> On Fri, 25 Jan 2013 09:11:01 -0500, ling.ma.prog...@gmail.com said:
> 
> > Based on above reasons, we compiled linux kernel 3.6.9 with O2 and Os
> > respectively. The results show Os improve performance netperf 4.8%,
> > 2.7% for volano as below
> 
> Am I allowed to NAK this?  What the numbers given so far 
> *actually* show is 4.8% more instructions executed, *not* 4.8% 
> better performance.

cycles and elapsed time is down in both tests - the speedup 
seems statistically a wash in the first test and significant for 
the second workload.

the instruction count might be an artifact of byte wise versus 
word wise REP; MOV.

> I'm having a *very* hard time convincing myself that what 
> we're seeing isn't simply the expected behavior of loops *not* 
> being unrolled and similar non-optimizations done by -Os, so 
> more instructions get executed to do the same amount of work.
> 
> Rather than "run for 10 seconds and count instructions", can 
> we "run for 50,000 syscalls and count clock time" or similar 
> that shows an *actual* improvement?

Look at the numbers, it counts a whole lot of other things as 
well beyond instructions - elapsed time being the most important 
one.

But more numbers never hurt.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [x86]: Compiler Option Os is better on latest x86

2013-01-28 Thread Valdis . Kletnieks
On Fri, 25 Jan 2013 09:11:01 -0500, ling.ma.prog...@gmail.com said:

> Based on above reasons, we compiled linux kernel 3.6.9 with O2 and Os
> respectively. The results show Os improve performance netperf 4.8%,
> 2.7% for volano as below

Am I allowed to NAK this?  What the numbers given so far *actually*
show is 4.8% more instructions executed, *not* 4.8% better performance.

I'm having a *very* hard time convincing myself that what we're seeing isn't
simply the expected behavior of loops *not* being unrolled and similar
non-optimizations done by -Os, so more instructions get executed to do the same
amount of work.

Rather than "run for 10 seconds and count instructions", can we
"run for 50,000 syscalls and count clock time" or similar that shows
an *actual* improvement?



pgpZXQFuHkHPj.pgp
Description: PGP signature


[PATCH] [x86]: Compiler Option Os is better on latest x86

2013-01-25 Thread ling . ma . program
From: Ma Ling 

  Currently we use O2 as compiler option for better performance,
although it will enlarge code size, in modern CPUs larger instructon
and unified cache, sophisticated instruction prefetch weaken instruction
cache miss, meanwhile flags such as -falign-functions, -falign-jumps,
-falign-loops, -falign-labels are very helpful to improve CPU front-end
throughput because CPU fetch instruction by 16 aligned–bytes code block
per cycle.

  In order to save power and get higher performance, Sandy Bridge 
starts to introduce decoded-cache, instructions will be kept in it
after decode stage. When CPU refetches the instruction, decoded cache could
provide 32 aligned-bytes instruction block, instead of 16 bytes from I-cache,
fewer branch miss penalty resulted from shorter pipeline. It requires hot
code should be put into decoded cache as possible we can. Sandy Bridge,
Ivy Bridge, and Haswell all implemented this feature, Os-Optimize for size
should be better than O2 on them.

Based on above reasons, we compiled linux kernel 3.6.9 with O2 and Os
respectively. The results show Os improve performance netperf 4.8%,
2.7% for volano as below

O2 + netperf
Performance counter stats for 'netperf' (3 runs):

   5416.157986 task-clock#0.541 CPUs utilized   
 ( +-  0.19% )
   348,249 context-switches  #0.064 M/sec   
 ( +-  0.17% )
 0 CPU-migrations#0.000 M/sec   
 ( +-  0.00% )
   353 page-faults   #0.000 M/sec   
 ( +-  0.16% )
13,166,254,384 cycles#2.431 GHz 
 ( +-  0.18% )
 8,827,499,807 stalled-cycles-frontend   #   67.05% frontend cycles idle
 ( +-  0.29% )
 5,951,234,060 stalled-cycles-backend#   45.20% backend  cycles idle
 ( +-  0.44% )
 8,122,481,914 instructions  #0.62  insns per cycle
 #1.09  stalled cycles per insn 
 ( +-  0.17% )
 1,415,864,138 branches  #  261.415 M/sec   
 ( +-  0.17% )
16,975,308 branch-misses #1.20% of all branches 
 ( +-  0.61% )

  10.007215371 seconds time elapsed 
 ( +-  0.03% )

Os + netperf

Performance counter stats for 'netperf' (3 runs):

   5395.386704 task-clock#0.539 CPUs utilized   
 ( +-  0.14% )
   345,880 context-switches  #0.064 M/sec   
 ( +-  0.25% )
 0 CPU-migrations#0.000 M/sec   
 ( +-  0.00% )
   354 page-faults   #0.000 M/sec   
 ( +-  0.00% )
13,142,706,297 cycles#2.436 GHz 
 ( +-  0.23% )
 8,379,382,641 stalled-cycles-frontend   #   63.76% frontend cycles idle
 ( +-  0.50% )
 5,513,722,219 stalled-cycles-backend#   41.95% backend  cycles idle
 ( +-  0.71% )
 8,554,202,795 instructions  #0.65  insns per cycle
 #0.98  stalled cycles per insn 
 ( +-  0.25% )
 1,530,020,505 branches  #  283.579 M/sec   
 ( +-  0.25% )
17,710,406 branch-misses #1.16% of all branches 
 ( +-  1.00% )

  10.004859867 seconds time elapsed   

During the same time (10.004859867 seconds) IPC from Os is 0.65, O2 is 0.62, Os 
improved performance 4.8%

O2 + volano
Performance counter stats for './loopclient.sh openjdk' (3 runs):

 210627.115313 task-clock#0.781 CPUs utilized   
 ( +-  0.92% )
13,812,610 context-switches  #0.066 M/sec   
 ( +-  0.17% )
 2,352,755 CPU-migrations#0.011 M/sec   
 ( +-  0.84% )
   208,333 page-faults   #0.001 M/sec   
 ( +-  1.58% )
   525,627,073,405 cycles#2.496 GHz 
 ( +-  0.96% )
   428,177,571,365 stalled-cycles-frontend   #   81.46% frontend cycles idle
 ( +-  1.09% )
   370,885,224,739 stalled-cycles-backend#   70.56% backend  cycles idle
 ( +-  1.18% )
   187,662,577,544 instructions  #0.36  insns per cycle
 #2.28  stalled cycles per insn 
 ( +-  0.31% )
35,684,976,425 branches  #  169.423 M/sec   
 ( +-  0.45% )
 1,062,086,942 branch-misses #2.98% of all branches 
 ( +-  0.08% )

 269.764578435 seconds time elapsed
 
Os + volano
Performance counter stats for './loopclient.sh openjdk' (3 runs):

 209545.786941 task-clock#0.778 CPUs utilized   
 ( +-  0.66% )
13,864,142 context-switches