Re: Peephole optimisation: isWhitespace()

2020-08-15 Thread Nathan Sidwell

On 8/14/20 12:43 PM, Stefan Kanthak wrote:

Hi @ll,

in his ACM queue article ,
Matt Godbolt used the function

| bool isWhitespace(char c)
| {
| return c == ' '
|   || c == '\r'
|   || c == '\n'
|   || c == '\t';
| }

as an example, for which GCC 9.1 emits the following assembly for AMD64
processors (see ):

|xoreax, eax  ; result = false
|cmpdil, 32   ; is c > 32
|ja .L4   ; if so, exit with false
|movabs rax, 4294977024   ; rax = 0x12600
|shrx   rax, rax, rdi ; rax >>= c
|andeax, 1; result = rax & 1
|.L4:
|ret

This code is but not optimal!


What evidence do you have that your alternative sequence performs better?  Have 
you benchmarked it?  (I tried, but your code doesn't assemble)


It is more instructions and cannot speculate past the setnz (As I understand it, 
x86_64 speculates branch instructions, but doesn't speculate cmov -- so 
perversely branches are faster!)



The following equivalent and branchless code works on i386 too,
it needs neither an AMD64 processor nor the SHRX instruction,
which is not available on older processors:




  movecx, edi
  moveax, 2600h; eax = (1 << '\r') | (1 << '\n') | (1 << 
'\t')
  test   cl, cl
  setnz  al; eax |= (c != '\0')
  shreax, cl   ; eax >>= (c % ' ')


^^ operand type mismatch on this instruction


  xoredx, edx
  cmpecx, 33   ; CF = c <= ' '
  adcedx, edx  ; edx = (c <= ' ')
  andeax, edx
  ret


regards
Stefan Kanthak




--
Nathan Sidwell


Re: Peephole optimisation: isWhitespace()

2020-08-16 Thread Stefan Kanthak
"Nathan Sidwell"  wrote:

> On 8/14/20 12:43 PM, Stefan Kanthak wrote:
>> Hi @ll,
>> 
>> in his ACM queue article ,
>> Matt Godbolt used the function
>> 
>> | bool isWhitespace(char c)
>> | {
>> | return c == ' '
>> |   || c == '\r'
>> |   || c == '\n'
>> |   || c == '\t';
>> | }
>> 
>> as an example, for which GCC 9.1 emits the following assembly for AMD64
>> processors (see ):
>> 
>> |xoreax, eax  ; result = false
>> |cmpdil, 32   ; is c > 32
>> |ja .L4   ; if so, exit with false
>> |movabs rax, 4294977024   ; rax = 0x12600
>> |shrx   rax, rax, rdi ; rax >>= c
>> |andeax, 1; result = rax & 1
>> |.L4:
>> |ret
>> 
>> This code is but not optimal!
> 
> What evidence do you have that your alternative sequence performs
> better?

45+ years experience in writing assembly code!

> Have you benchmarked it?

Of course! Did you?
I didn't include the numbers in my initial post since I don't have
a processor which supports BMI2 and thus can't run the original code.
I benchmarked the following equivalent code (input character is in
ECX instead of EDI):

GCC:  x64:   x86:
xoreax, eax   movabs rax, 12600h mov   eax, 2600h
cmpcl, 32test  cl, cl
ja L4setnz al
movabs rax, 12600hshrrax, cl shr   eax, cl
shrrax, clxoredx, edxxor   edx, edx
andeax, 1 cmpecx, 33 cmp   edx, 33
  setb   dl  setb  dl
L4:   andeax, edxand   eax, edx
ret   retret

Left column 1 billion sequential characters
for (int i=10; i; --i) ...(i);
right column 1 billion random characters, in cycles per character:

GCC:   2.43.4
x64:   3.02.5
x86:   4.34.5

> (I tried, but your code doesn't assemble)

Wrong: YOUR code doesn't assemble, mine does! And it works well too.

> It is more instructions

Because I dared to show code for the old(er) i386 alias x86 processor,
not for the AMD64 alias x86_64.
I expect people commenting on assembly to understand (just a little bit)
of it and to get the idea of the code I posted first, then eventually be
able to come up with the following x86_64 code (without a fancy SHRX,
i.e. for ALL AMD64 processors, not just those supporting BMI2):

 movecx, edi
 movabs rax, 12600h   ; rax = (1 << ' ') | (1 << '\r') | (1 << 
'\n') | (1 << '\t')
 shrrax, cl   ; rax >>= c
 xoredi, edi
 cmpecx, 33   ; CF = c <= ' '
.if 0
 adcedi, edi  ; edi = (c <= ' ')
.else
 setb   dil
.endif
 andeax, edi  ; result = rax & (c <= ' ')
 ret


If someone (or GCC) but really insists to (ab)use SHRX:

 xorecx, ecx
 cmpdil, 33   ; is !(c > 32)
 setb   cl
 movabs rax, 4294977024   ; rax = 0x12600
 shrx   rax, rax, rdi ; rax >>= c
 andeax, ecx  ; result = rax & (c <= ' ')
 ret

Now start counting: first the instructions, then the cycles lost
due to wrong branch prediction (hint: 0 for my code).

> and cannot speculate past the setnz

OUCH!
Unfortunately, AMD64 and i386 processors are not THAT retarded.

> (As I understand it, x86_64 speculates branch instructions, but
> doesn't speculate cmov -- so perversely branches are faster!)

There's no CMOV in my code.
What are you trying to demonstrate?

JFTR: how much speculative executed instructions are discarded if
  the branch prediction is wrong?

>> The following equivalent and branchless code works on i386 too,
>> it needs neither an AMD64 processor nor the SHRX instruction,
>> which is not available on older processors:
> 
>> 
>>   movecx, edi
>>   moveax, 2600h; eax = (1 << '\r') | (1 << '\n') | (1 << 
>> '\t')
>>   test   cl, cl
>>   setnz  al; eax |= (c != '\0')
>>   shreax, cl   ; eax >>= (c % ' ')
> 
> ^^ operand type mismatch on this instruction

I recommend to fix YOUR typo!
I bet you wrote "shr eax, c" instead of "shr eax, cl" there.

>>   xoredx, edx
>>   cmpecx, 33   ; CF = c <= ' '
>>   adcedx, edx  ; edx = (c <= ' ')
>>   andeax, edx
>>   ret

Stefan


Re: Peephole optimisation: isWhitespace()

2020-08-16 Thread Nathan Sidwell

On 8/16/20 9:54 AM, Stefan Kanthak wrote:

"Nathan Sidwell"  wrote:



What evidence do you have that your alternative sequence performs
better?


45+ years experience in writing assembly code!






Have you benchmarked it?


Of course! Did you?
I didn't include the numbers in my initial post since I don't have
a processor which supports BMI2 and thus can't run the original code.
I benchmarked the following equivalent code (input character is in
ECX instead of EDI):


you seem very angry about being asked for data.  As I said, I couldn't benchmark 
your code, because of the incorrect assembly.


As some one with 45+years of writing assembly, you'll be aware that processor 
micro architectures have changed dramatically over that time, and one can very 
easily be misled by 'intuition'.


Because I dared to show code for the old(er) i386 alias x86 processor,
not for the AMD64 alias x86_64.


Which I did find bizarre -- if you're targeting an x86_64 ISA, why are you 
writing code for a different processor?


anyway, you've made it clear you do not wish to engage in constructive 
discussion.

BTW, I have come up with a sequence as short as GCC's but without the 
conditional branch.  Sadly the margin is too small to write it.


Good day, sir

nathan

--
Nathan Sidwell


Re: Peephole optimisation: isWhitespace()

2020-08-17 Thread Stefan Kanthak
"Nathan Sidwell" 

> On 8/16/20 9:54 AM, Stefan Kanthak wrote:
>> "Nathan Sidwell"  wrote:

[...]

>>> Have you benchmarked it?
>> 
>> Of course! Did you?

[...]

> you seem very angry about being asked for data.

As much as you hallucinated about CMOV or processors not speculating beyond
SETNZ?

>  As I said, I couldn't benchmark your code, because of the incorrect assembly.

As I wrote, that's your fault: every x86 and its evolution x64 know SHR EAX, CL

[...]

> anyway, you've made it clear you do not wish to engage in constructive 
> discussion.

Use a mirror!

> BTW, I have come up with a sequence as short as GCC's but without the 
> conditional branch.  Sadly the margin is too small to write it.

Reality surely bites!

Stefan


Re: Peephole optimisation: isWhitespace()

2020-08-17 Thread Allan Sandfeld Jensen
On Freitag, 14. August 2020 18:43:12 CEST Stefan Kanthak wrote:
> Hi @ll,
> 
> in his ACM queue article ,
> Matt Godbolt used the function
> 
> | bool isWhitespace(char c)
> | {
> | 
> | return c == ' '
> | 
> |   || c == '\r'
> |   || c == '\n'
> |   || c == '\t';
> | 
> | }
> 
> as an example, for which GCC 9.1 emits the following assembly for AMD64
> 
> processors (see ):
> |xoreax, eax  ; result = false
> |cmpdil, 32   ; is c > 32
> |ja .L4   ; if so, exit with false
> |movabs rax, 4294977024   ; rax = 0x12600
> |shrx   rax, rax, rdi ; rax >>= c
> |andeax, 1; result = rax & 1
> |
> |.L4:
> |ret
> 
No it doesn't. As your example shows if you took the time to read it, it is 
what gcc emit when generating code to run on a _haswell_ architecture. If you 
remove -march=haswell from the command line you get:

xor eax, eax
cmp dil, 32
ja  .L1
movabs  rax, 4294977024
mov ecx, edi
shr rax, cl
and eax, 1

It uses one mov more, but no shrx. 

'Allan




Re: Peephole optimisation: isWhitespace()

2020-08-17 Thread Stefan Kanthak
"Allan Sandfeld Jensen"  wrote:

> On Freitag, 14. August 2020 18:43:12 CEST Stefan Kanthak wrote:
>> Hi @ll,
>> 
>> in his ACM queue article ,
>> Matt Godbolt used the function
>> 
>> | bool isWhitespace(char c)
>> | {
>> | 
>> | return c == ' '
>> | 
>> |   || c == '\r'
>> |   || c == '\n'
>> |   || c == '\t';
>> | 
>> | }
>> 
>> as an example, for which GCC 9.1 emits the following assembly for AMD64
>> 
>> processors (see ):
>> |xoreax, eax  ; result = false
>> |cmpdil, 32   ; is c > 32
>> |ja .L4   ; if so, exit with false
>> |movabs rax, 4294977024   ; rax = 0x12600
>> |shrx   rax, rax, rdi ; rax >>= c
>> |andeax, 1; result = rax & 1
>> |
>> |.L4:
>> |ret
>> 
> No it doesn't. As your example shows if you took the time to read it, it is 
> what gcc emit when generating code to run on a _haswell_ architecture.

Matt's article does NOT specify the architecture for THIS example.
He specified it for another example he named "(q)":

| When targeting the Haswell microarchitecture, GCC 8.2 compiles this code
| to the assembly in (q) (https://godbolt.org/z/acm19_bits):

WHat about CAREFUL reading?

> If you remove -march=haswell from the command line you get:
> 
>xor eax, eax
>cmp dil, 32
>ja  .L1
>movabs  rax, 4294977024
>mov ecx, edi
>shr rax, cl
>and eax, 1
> 
> It uses one mov more, but no shrx. 

The SHRX is NOT the point here; its the avoidable conditional branch that
matters!

 mov ecx, edi
 movabs  rax, 4294977024
 shr rax, cl
 xor edi, edi
 cmp ecx, 33
 setbdil
 and eax, edi

Stefan


Re: Peephole optimisation: isWhitespace()

2020-08-24 Thread Richard Biener via Gcc
On Mon, Aug 17, 2020 at 7:09 PM Stefan Kanthak  wrote:
>
> "Allan Sandfeld Jensen"  wrote:
>
> > On Freitag, 14. August 2020 18:43:12 CEST Stefan Kanthak wrote:
> >> Hi @ll,
> >>
> >> in his ACM queue article ,
> >> Matt Godbolt used the function
> >>
> >> | bool isWhitespace(char c)
> >> | {
> >> |
> >> | return c == ' '
> >> |
> >> |   || c == '\r'
> >> |   || c == '\n'
> >> |   || c == '\t';
> >> |
> >> | }
> >>
> >> as an example, for which GCC 9.1 emits the following assembly for AMD64
> >>
> >> processors (see ):
> >> |xoreax, eax  ; result = false
> >> |cmpdil, 32   ; is c > 32
> >> |ja .L4   ; if so, exit with false
> >> |movabs rax, 4294977024   ; rax = 0x12600
> >> |shrx   rax, rax, rdi ; rax >>= c
> >> |andeax, 1; result = rax & 1
> >> |
> >> |.L4:
> >> |ret
> >>
> > No it doesn't. As your example shows if you took the time to read it, it is
> > what gcc emit when generating code to run on a _haswell_ architecture.
>
> Matt's article does NOT specify the architecture for THIS example.
> He specified it for another example he named "(q)":
>
> | When targeting the Haswell microarchitecture, GCC 8.2 compiles this code
> | to the assembly in (q) (https://godbolt.org/z/acm19_bits):
>
> WHat about CAREFUL reading?
>
> > If you remove -march=haswell from the command line you get:
> >
> >xor eax, eax
> >cmp dil, 32
> >ja  .L1
> >movabs  rax, 4294977024
> >mov ecx, edi
> >shr rax, cl
> >and eax, 1
> >
> > It uses one mov more, but no shrx.
>
> The SHRX is NOT the point here; its the avoidable conditional branch that
> matters!

Whether or not the conditional branch sequence is faster depends on whether
the branch is well-predicted which very much depends on the data you
feed the isWhitespace function with but I guess since this is the
c == ' ' test it _will_ be a well-predicted branch which means the
conditional branch sequence will be usually faster.  The proposed
change turns the control into a data dependence which constrains
instruction scheduling and retirement.  Indeed a mispredicted branch
will likely be more costly.

x86 CPUs do not perform data speculation.

Richard.

>  mov ecx, edi
>  movabs  rax, 4294977024
>  shr rax, cl
>  xor edi, edi
>  cmp ecx, 33
>  setbdil
>  and eax, edi
>
> Stefan


Re: Peephole optimisation: isWhitespace()

2020-08-24 Thread Alexander Monakov via Gcc
On Mon, 24 Aug 2020, Richard Biener via Gcc wrote:

> Whether or not the conditional branch sequence is faster depends on whether
> the branch is well-predicted which very much depends on the data you
> feed the isWhitespace function with but I guess since this is the
> c == ' ' test it _will_ be a well-predicted branch which means the
> conditional branch sequence will be usually faster.  The proposed
> change turns the control into a data dependence which constrains
> instruction scheduling and retirement.  Indeed a mispredicted branch
> will likely be more costly.

There's also the question how the caller is using the return value. In all
likelihood, the caller branches on it, so making isWhitespace branchless
just moves the misprediction cost to the caller.

On x86, we should be aiming to produce the BT instruction. GIMPLE reassoc
nicely transforms multiple branches into a bit test, but unfortunately it
uses right shift, while RTL matches for a left shift, but not right..
With hand-written code it's easy to make GCC produce BT as desired:

void is_ws_cb(unsigned char c, void f(void))
{
unsigned long long mask = 1ll<<' ' | 1<<'\t' | 1<<'\r' | 1<<'\n';
if (c <= 32 && (mask & (1ll<

Re: Peephole optimisation: isWhitespace()

2020-08-24 Thread Stefan Kanthak
"Richard Biener"  wrote:

> On Mon, Aug 17, 2020 at 7:09 PM Stefan Kanthak  
> wrote:
>>
>> "Allan Sandfeld Jensen"  wrote:
>>
>> > On Freitag, 14. August 2020 18:43:12 CEST Stefan Kanthak wrote:
>> >> Hi @ll,
>> >>
>> >> in his ACM queue article ,
>> >> Matt Godbolt used the function
>> >>
>> >> | bool isWhitespace(char c)
>> >> | {
>> >> |
>> >> | return c == ' '
>> >> |
>> >> |   || c == '\r'
>> >> |   || c == '\n'
>> >> |   || c == '\t';
>> >> |
>> >> | }
>> >>
>> >> as an example, for which GCC 9.1 emits the following assembly for AMD64
>> >>
>> >> processors (see ):
>> >> |xoreax, eax  ; result = false
>> >> |cmpdil, 32   ; is c > 32
>> >> |ja .L4   ; if so, exit with false
>> >> |movabs rax, 4294977024   ; rax = 0x12600
>> >> |shrx   rax, rax, rdi ; rax >>= c
>> >> |andeax, 1; result = rax & 1
>> >> |
>> >> |.L4:
>> >> |ret

[...]
 
> Whether or not the conditional branch sequence is faster depends on whether
> the branch is well-predicted which very much depends on the data you
> feed the isWhitespace function with

Correct.

> but I guess since this is the c == ' ' test it _will_ be a well-predicted 
> branch

Also correct, but you miss a point: the typical use case is

while (isWhitespace(*ptr)) ptr++;

> which means the conditional branch sequence will be usually faster.

And this is wrong!
The (well-predicted) branch is usually NOT taken, so both code variants
usually execute (with one exception the same) 6 or 7 instructions.

> The proposed change turns the control into a data dependence which
> constrains instruction scheduling and retirement.

It doesn't matter: the branch has the same data dependency too!

> Indeed a mispredicted branch will likely be more costly.

And no branch is even better: the branch predictor has a limited capacity,
so every removed branch instruction can help improve its efficiency.

> x86 CPUs do not perform data speculation.

>>  mov ecx, edi
>>  movabs  rax, 4294977024
>>  shr rax, cl
>>  xor edi, edi
>>  cmp ecx, 33
>>  setbdil
>>  and eax, edi

I already presented measured numbers: with random data, the branch-free
code is faster, with ordered data the original code.

Left column 1 billion sequential characters
for (int i=10; i; --i) ...(i);
right column 1 billion random characters, in cycles per character:

GCC:   2.43.4
branch-free:   3.02.5

Now perform a linear interpolation and find the break-even point at
p=0.4, with p=0 for ordered data and p=1 for random data, or just use
the average of these numbers: 2.9 cycles vs. 2.75 cycles.
That's small, but measurable!

Stefan


Re: Peephole optimisation: isWhitespace()

2020-08-24 Thread Richard Biener via Gcc
On Mon, Aug 24, 2020 at 1:22 PM Stefan Kanthak  wrote:
>
> "Richard Biener"  wrote:
>
> > On Mon, Aug 17, 2020 at 7:09 PM Stefan Kanthak  
> > wrote:
> >>
> >> "Allan Sandfeld Jensen"  wrote:
> >>
> >> > On Freitag, 14. August 2020 18:43:12 CEST Stefan Kanthak wrote:
> >> >> Hi @ll,
> >> >>
> >> >> in his ACM queue article ,
> >> >> Matt Godbolt used the function
> >> >>
> >> >> | bool isWhitespace(char c)
> >> >> | {
> >> >> |
> >> >> | return c == ' '
> >> >> |
> >> >> |   || c == '\r'
> >> >> |   || c == '\n'
> >> >> |   || c == '\t';
> >> >> |
> >> >> | }
> >> >>
> >> >> as an example, for which GCC 9.1 emits the following assembly for AMD64
> >> >>
> >> >> processors (see ):
> >> >> |xoreax, eax  ; result = false
> >> >> |cmpdil, 32   ; is c > 32
> >> >> |ja .L4   ; if so, exit with false
> >> >> |movabs rax, 4294977024   ; rax = 0x12600
> >> >> |shrx   rax, rax, rdi ; rax >>= c
> >> >> |andeax, 1; result = rax & 1
> >> >> |
> >> >> |.L4:
> >> >> |ret
>
> [...]
>
> > Whether or not the conditional branch sequence is faster depends on whether
> > the branch is well-predicted which very much depends on the data you
> > feed the isWhitespace function with
>
> Correct.
>
> > but I guess since this is the c == ' ' test it _will_ be a well-predicted 
> > branch
>
> Also correct, but you miss a point: the typical use case is
>
> while (isWhitespace(*ptr)) ptr++;
>
> > which means the conditional branch sequence will be usually faster.
>
> And this is wrong!
> The (well-predicted) branch is usually NOT taken, so both code variants
> usually execute (with one exception the same) 6 or 7 instructions.

Whether or not the branch is predicted taken does not matter, what
matters is that the continuation is not data dependent on the branch
target computation and thus can execute in parallel to it.

> > The proposed change turns the control into a data dependence which
> > constrains instruction scheduling and retirement.
>
> It doesn't matter: the branch has the same data dependency too!
>
> > Indeed a mispredicted branch will likely be more costly.
>
> And no branch is even better: the branch predictor has a limited capacity,
> so every removed branch instruction can help improve its efficiency.
>
> > x86 CPUs do not perform data speculation.
>
> >>  mov ecx, edi
> >>  movabs  rax, 4294977024
> >>  shr rax, cl
> >>  xor edi, edi
> >>  cmp ecx, 33
> >>  setbdil
> >>  and eax, edi
>
> I already presented measured numbers: with random data, the branch-free
> code is faster, with ordered data the original code.
>
> Left column 1 billion sequential characters
> for (int i=10; i; --i) ...(i);
> right column 1 billion random characters, in cycles per character:

I guess feeding it Real Text (TM) is the only relevant benchmark,
doing sth like

  for (;;)
 cnt[isWhitespace(*ptr++)]++;

> GCC:   2.43.4
> branch-free:   3.02.5

I'd call that unconclusive data - you also failed to show your test data
is somehow relevant.  We do know that mispredicted branches are bad.
You show well-predicted branches are good.  By simple statistics
singling out 4 out of 255 values will make the branches well-predicted.

> Now perform a linear interpolation and find the break-even point at
> p=0.4, with p=0 for ordered data and p=1 for random data, or just use
> the average of these numbers: 2.9 cycles vs. 2.75 cycles.
> That's small, but measurable!


Re: Peephole optimisation: isWhitespace()

2020-08-24 Thread Stefan Kanthak
"Richard Biener"  wrote:


> On Mon, Aug 24, 2020 at 1:22 PM Stefan Kanthak  
> wrote:
>>
>> "Richard Biener"  wrote:
>>
>> > On Mon, Aug 17, 2020 at 7:09 PM Stefan Kanthak  
>> > wrote:
>> >>
>> >> "Allan Sandfeld Jensen"  wrote:
>> >>
>> >> > On Freitag, 14. August 2020 18:43:12 CEST Stefan Kanthak wrote:

[...]

> Whether or not the branch is predicted taken does not matter, what
> matters is that the continuation is not data dependent on the branch
> target computation and thus can execute in parallel to it.

My benchmark shows that this doesn't matter!

>> > The proposed change turns the control into a data dependence which
>> > constrains instruction scheduling and retirement.
>>
>> It doesn't matter: the branch has the same data dependency too!
>>
>> > Indeed a mispredicted branch will likely be more costly.
>>
>> And no branch is even better: the branch predictor has a limited capacity,
>> so every removed branch instruction can help improve its efficiency.
>>
>> > x86 CPUs do not perform data speculation.
>>
>> >>  mov ecx, edi
>> >>  movabs  rax, 4294977024
>> >>  shr rax, cl
>> >>  xor edi, edi
>> >>  cmp ecx, 33
>> >>  setbdil
>> >>  and eax, edi
>>
>> I already presented measured numbers: with random data, the branch-free
>> code is faster, with ordered data the original code.
>>
>> Left column 1 billion sequential characters
>> for (int i=10; i; --i) ...(i);
>> right column 1 billion random characters, in cycles per character:
> 
> I guess feeding it Real Text (TM) is the only relevant benchmark,
> doing sth like
> 
>  for (;;)
> cnt[isWhitespace(*ptr++)]++;

I approximated that using a PRNG...

>> GCC:   2.43.4
>> branch-free:   3.02.5
> 
> I'd call that unconclusive data - you also failed to show your test data
> is somehow relevant.

Since nobody can predict real world data all test data are irrelevant,
somehow. I thus call your argument a NULL argument.

> We do know that mispredicted branches are bad.
> You show well-predicted branches are good.

Wrong: I show that no branches are still better.

> By simple statistics singling out 4 out of 255 values will make the
> branches well-predicted.

Your statistic is wrong:
1. the branch singles out 224 of 256 values, i.e. 7/8 of all data;
2. whitespace lies in the 1/8 which is not singled out.

>> Now perform a linear interpolation and find the break-even point at
>> p=0.4, with p=0 for ordered data and p=1 for random data, or just use
>> the average of these numbers: 2.9 cycles vs. 2.75 cycles.
>> That's small, but measurable!

Stefan


Re: Peephole optimisation: isWhitespace()

2020-08-25 Thread Stefan Kanthak
I wrote:

> "Richard Biener"  wrote:

[...]
 
>> Whether or not the branch is predicted taken does not matter, what
>> matters is that the continuation is not data dependent on the branch
>> target computation and thus can execute in parallel to it.
> 
> My benchmark shows that this doesn't matter!

>>> I already presented measured numbers: with random data, the branch-free
>>> code is faster, with ordered data the original code.

[...]

>> We do know that mispredicted branches are bad.
>> You show well-predicted branches are good.
> 
> Wrong: I show that no branches are still better.
> 
>> By simple statistics singling out 4 out of 255 values will make the
>> branches well-predicted.
> 
> Your statistic is wrong:
> 1. the branch singles out 224 of 256 values, i.e. 7/8 of all data;
> 2. whitespace lies in the 1/8 which is not singled out.
> 
>>> Now perform a linear interpolation and find the break-even point at
>>> p=0.4, with p=0 for ordered data and p=1 for random data, or just use
>>> the average of these numbers: 2.9 cycles vs. 2.75 cycles.
>>> That's small, but measurable!

I recommend you ALL start to read Daniel Lemire's article


| Even when it is impossible to remove all branches, reducing the number
| of branches "almost always taken" or "almost never taken" may help the
| processor better predict the remaining branches.

JFTR: I didn't know his article before, but I hope that you are willing
  to learn.

Stefan