Oh, I got your point and tried to emphasize that. I wasn't arguing against you anywhere that I know of. I totally agree with your a,b,&c. I suspect we don't disagree on anything real at all, but are just discussing different aspects. I tried to express praise for your approach (if you happened to need to do this _particular_ thing) and gave a reference link. :-) Please note that nothing I say below should be taken as any specific rebuttal to something you said.
My point was less about exactly zero vs non-zero abstraction cost (I even noted my own 5% variation) and more about the cost being very sensitive to compilation strategy, PGO as a particular example. I would say, looking at the generated C code, that this is not much about "Nim language iterator overhead aka 'cost'" and more just about "C compiler assembly code generation variability". That's an important distinction as slowness does not necessarily translate into "work for Nim core to improve anything". Backends vary and could change a bunch of stuff and be different in the next version of gcc/clang/whatever, for example. So, Nim kind of needs to target a "broad cross section" of what works well on the various backends, not overfit to one. (Not saying you said otherwise, just backing up importance of the distinction.) You mentioned compiling my code on Ryzen, but not using gcc's PGO. PGO is mostly just a very strong version of your (c) hints idea where the compiler generates extra code to measure what it thinks will help it make better code generation decisions. (Yeah, a person or some GA picking optim flags/luck/etc. could all still maybe do even better sometimes as I already alluded to, but at ever increasing developer costs.) To add a little more color quantitatively, I just did the same PGO experiment with the same version of gcc on a Ryzen 2950X clocked at 3.80 GHz and saw times go from 140 ms down to 85 ms..(best of 10 trials for both). That's not quite the same 2x speed-up as for the Intel case. Yeah, maybe Ryzen has better CPU predictors or predictor warm-ups or maybe it gets a bit luckier with "non-PGO" -O3 yadda-yadda defaults, etc. Whatever the causes, 1.65X is still a lot more than PGO boosts I usually see of 1.05x to 1.15x (such as with your algo, for example). { Of course, the boost is still not as good as a better algorithm. I never said so, nor meant to imply it. If a better algo is handy and speed matters then by all means use it! At least some of the time, a better algo will not be as easy to come by as PGO. } Anyway, I realize that it's a tangent from your original interest in this problem, but if you're as serious about (c) as a general guideline as you seem, and if you haven't already, then you should look into PGO sometime. My bet is that for this problem and for most backend compilers that a PGO Nim would be faster than non-PGO "pure C". Even though such a goal is explicitly _not_ your objective, I think it's still interesting. Nim is one of the very few languages I've tried that in "typical" practice is often "about as efficient as C" (with all the usual disclaimers as to algos, etc.) PGO can also be very little developer time extra cost. Once you get a little script set up, you just type `nim-pgo pythag './pythag > /dev/null'` for some `pythag.nim` file instead of `nim c -d:release pythag.nim`. If the program had less output (say the sum of the c in c^2=a^2+b^2) the shell IO re-direct wouldn't even be there. For this program, it took me almost no time to try PGO. And, of course, YMMV - that 2x on i7-6700k effect is among the biggest PGO boosts that I've ever seen. That seemed noteworthy. So, I noted to y'all. :-) In terms of comparing Ryzen timings, running in a VM should not make much difference on this problem as it is a pure user-space CPU hot loop except for the time queries..That's really a best case scenario for VMs. Of course, if your host OS is time-sharing the CPU between your VM instance and some other process then that could slow things down a lot - arbitrarily much, really if the host OS is overloaded. Might help to take the minimum time out of more trials. You should be able to get 100..200 ms of one core all to yourself at _some_ point. :-) My guess is that you could get that 210 ms down to under 100 ms (not that it really matters for anything practical in this exact case. A better algo is better, but a lot of people - not just in this forum - are, rightly or wrongly, discussing this problem in the context of "programming language performance" which is the only reason I bothered to write both of these posts. The clang/ldc/rust/etc. numbers were all more closely clustered than 1.65x of the Ryzen PGO experiment in Timothee's very first link (217 ms/153 ms = only 1.42x). TL;DR - A) For this problem, compiler back-end generation effects dominate Nim front end overheads (and possibly front-end overheads for other languages) for which the huge PGO boost is mostly just supporting evidence, and B) PGO - try it someday, it just might make a bigger difference than you expect without much dev work on your part.