Don wrote:
Adam D. Ruppe wrote:
On Tue, Apr 13, 2010 at 11:10:24AM -0400, Clemens wrote:
That's strange. Looking at src/backend/cod4.c, function cdbscan, in the dmd sources, bsr seems to be implemented in terms of the bsr opcode [1] (which I guess is the reason it's an intrinsic in the first place). I would have expected this to be much, much faster than a user function. Anyone care enough to check the generated assembly?

The opcode is fairly slow anyway (as far as opcodes go) - odds are the
implementation inside the processor is similar to Jerome's method, and
the main savings come from it loading fewer bytes into the pipeline.

I remember a line from a blog, IIRC it was the author of the C++ FQA
writing it, saying hardware and software are pretty much the same thing -
moving an instruction to hardware doesn't mean it will be any faster,
since it is the same algorithm, just done in processor microcode instead of
user opcodes.

It's fast on Intel, slow on AMD. I bet the speed difference comes from inlining max().

Specifically, bsr is 7 uops on AMD, 1 uop on Intel since the original Pentium. AMD's performance is shameful.

And bsr() is supported in the compiler; in fact DMC uses it extensively, which is why it's included in DMD!

Reply via email to