On Saturday, June 18, 2016 at 12:21:21 AM UTC+2, gordo...@gmail.com wrote:
On Friday, June 17, 2016 at 4:48:06 PM UTC+2, gordo...@gmail.com wrote: 
> > I am not on a high end Intel CPU now, but when I was I found that with a 
> > buffer size adjusted to the L1 cache size (8192 32-bit words or 32 
> > Kilobytes) that eratspeed ran on an Intel 2700K @ 3.5 GHz at about 3.5 
> > clock cycles per loop (about 405,000,000 loops for this range). 

> > My current AMD Bulldozer CPU has a severe cache bottleneck and can't come 
> > close to this speed by a factor of about two. 

> Which Bulldozer version do you have: original Bulldozer, Piledriver, 
> Steamroller/Kaveri or Excavator/Carrizo? 

One of the original's, a FX8120 (4 core, 8 processor) @ 3.1 GHz. 

Your CPU is bdver1. My CPU is bdver3.

by the clock frequency, I assume you were running these tests on a high end 
Intel CPU? 

bdver3 CPU has some optimizations compared to bdver1, but I don't know whether 
this affects eratspeed code. Is your IPC lower than 2.80 when you compile 
https://play.golang.org/p/Sd6qlMQcHF with Go1.7-tip and run "perf stat 
--repeat=10 -- ./eratspeed-go1.7tip"?

I am on Windows 7 64-bit, so don't have perf, which is specific to Linux.

However, we can easily estimate the figure you request as follows:

1) The number of instructions are likely to be the same between your machine 
and mine as we are both using the same source code with the same compiler 
version with the same compiler settings.

2) The run times over the clock speed for my machine as compared to yours is 
about one third again, therefore my instruction rate will be 75% of yours or 
about 2.1 instructions per cycle.

3) The difference is almost certainly to be mostly the the backend stall time, 
which for your machine is 25% or 150 ms out of your run time; my run time at 
your clock frequency would be 200 ms more, therefore a total of 350 ms out of 
800 ms so for my machine backend stall time will be about 43.75%.

4) The reason bdver1 is so bad is it has something called a Write Combining 
Cache (WCC) to take care of the write-through memory access of bd; which is 
fine and good; the problem is that WCC is only 4 Kilobytes in size, which means 
that for a loop like this that mostly does only writes (backend), the effective 
L1 cache size is only 4 Kilobytes instead of 16 Kilobytes, causing many misses 
and high cache latency.

5) This is even worse for multi-threading as the WCC is shared between the to 
processors per core, thus the effective size per processor is only 2 Kilobytes, 
and much of the advantage of having so many processors is wasted.

6) Later bd versions use something other than the WCC (AFAIK undisclosed in the 
literture) which doesn't have as bad a bottleneck although it is still there, 
as in your bdver3.

7) Intel processors do not have this problem as they don't have cache 
write-through (although they have other problems for multi-processing).

8) Without this problem, our CPU's execution time for eratspeed would be about 
75% of the current time for yours and 56% for mine (not quite this good as 
there would still be a small amount of latency due to instructions being too 
adjacent.

9) We AMD loyalists can only wait and hope for AMD Zen due toward the end of 
this year, which is said to adopt much of the Intel design philosophy but 
better (as I am sure AMD hopes, too).

The other interesting data point from your comparison of Clang C results and 
these golang results for eratspeed indicate that **on your machine**, C Clang 
runs in about 75% of the time of golang, or golang is about 33% slower.  Part 
of that gain is Clang minimizing the number of (more complex) instructions, but 
part of that gain is also reducing backend latency by re-organizing the 
instructions.  I don't have Clang installed on my machine, but I do have 
Haskell, which can use the LLLVM backend that Clang uses and produces 
comparable instruction rates.  Golang eratspeed runs about 20% slower than that 
on my machine, but that is of course with the severe handicap of the high 
backend stalling.  The difference will get much bigger in percentage for more 
efficient CPU's.

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to