Re: [help needed] nim version of: COMPARING PYTHAGOREAN TRIPLES IN C++, D, AND RUST

cblake Wed, 02 Jan 2019 13:45:30 -0800

Oh, I got your point and tried to emphasize that. I wasn't arguing against you 
anywhere that I know of. I totally agree with your a,b,&c. I suspect we don't 
disagree on anything real at all, but are just discussing different aspects. I 
tried to express praise for your approach (if you happened to need to do this 
_particular_ thing) and gave a reference link. :-) Please note that nothing I 
say below should be taken as any specific rebuttal to something you said.


My point was less about exactly zero vs non-zero abstraction cost (I even noted 
my own 5% variation) and more about the cost being very sensitive to 
compilation strategy, PGO as a particular example. I would say, looking at the 
generated C code, that this is not much about "Nim language iterator overhead 
aka 'cost'" and more just about "C compiler assembly code generation 
variability". That's an important distinction as slowness does not necessarily 
translate into "work for Nim core to improve anything". Backends vary and could 
change a bunch of stuff and be different in the next version of 
gcc/clang/whatever, for example. So, Nim kind of needs to target a "broad cross 
section" of what works well on the various backends, not overfit to one. (Not 
saying you said otherwise, just backing up importance of the distinction.)

You mentioned compiling my code on Ryzen, but not using gcc's PGO. PGO is 
mostly just a very strong version of your (c) hints idea where the compiler 
generates extra code to measure what it thinks will help it make better code 
generation decisions. (Yeah, a person or some GA picking optim flags/luck/etc. 
could all still maybe do even better sometimes as I already alluded to, but at 
ever increasing developer costs.)

To add a little more color quantitatively, I just did the same PGO experiment 
with the same version of gcc on a Ryzen 2950X clocked at 3.80 GHz and saw times 
go from 140 ms down to 85 ms..(best of 10 trials for both). That's not quite 
the same 2x speed-up as for the Intel case. Yeah, maybe Ryzen has better CPU 
predictors or predictor warm-ups or maybe it gets a bit luckier with "non-PGO" 
-O3 yadda-yadda defaults, etc. Whatever the causes, 1.65X is still a lot more 
than PGO boosts I usually see of 1.05x to 1.15x (such as with your algo, for 
example). { Of course, the boost is still not as good as a better algorithm. I 
never said so, nor meant to imply it. If a better algo is handy and speed 
matters then by all means use it! At least some of the time, a better algo will 
not be as easy to come by as PGO. }

Anyway, I realize that it's a tangent from your original interest in this 
problem, but if you're as serious about (c) as a general guideline as you seem, 
and if you haven't already, then you should look into PGO sometime. My bet is 
that for this problem and for most backend compilers that a PGO Nim would be 
faster than non-PGO "pure C". Even though such a goal is explicitly _not_ your 
objective, I think it's still interesting. Nim is one of the very few languages 
I've tried that in "typical" practice is often "about as efficient as C" (with 
all the usual disclaimers as to algos, etc.)

PGO can also be very little developer time extra cost. Once you get a little 
script set up, you just type `nim-pgo pythag './pythag > /dev/null'` for some 
`pythag.nim` file instead of `nim c -d:release pythag.nim`. If the program had 
less output (say the sum of the c in c^2=a^2+b^2) the shell IO re-direct 
wouldn't even be there. For this program, it took me almost no time to try PGO. 
And, of course, YMMV - that 2x on i7-6700k effect is among the biggest PGO 
boosts that I've ever seen. That seemed noteworthy. So, I noted to y'all. :-)

In terms of comparing Ryzen timings, running in a VM should not make much 
difference on this problem as it is a pure user-space CPU hot loop except for 
the time queries..That's really a best case scenario for VMs. Of course, if 
your host OS is time-sharing the CPU between your VM instance and some other 
process then that could slow things down a lot - arbitrarily much, really if 
the host OS is overloaded. Might help to take the minimum time out of more 
trials. You should be able to get 100..200 ms of one core all to yourself at 
_some_ point. :-) My guess is that you could get that 210 ms down to under 100 
ms (not that it really matters for anything practical in this exact case.

A better algo is better, but a lot of people - not just in this forum - are, 
rightly or wrongly, discussing this problem in the context of "programming 
language performance" which is the only reason I bothered to write both of 
these posts. The clang/ldc/rust/etc. numbers were all more closely clustered 
than 1.65x of the Ryzen PGO experiment in Timothee's very first link (217 
ms/153 ms = only 1.42x).

TL;DR - A) For this problem, compiler back-end generation effects dominate Nim 
front end overheads (and possibly front-end overheads for other languages) for 
which the huge PGO boost is mostly just supporting evidence, and B) PGO - try 
it someday, it just might make a bigger difference than you expect without much 
dev work on your part.

Re: [help needed] nim version of: COMPARING PYTHAGOREAN TRIPLES IN C++, D, AND RUST

Reply via email to