On 27/01/2019 17:26, Richard Henderson wrote:
> On 1/27/19 7:19 AM, Mark Cave-Ayland wrote:
>> Could this make the loop slower? I certainly haven't noticed any obvious
>> performance difference during testing (OS X uses merge quite a bit for
>> display rendering), and I'd hope that with a good compiler and modern branch
>> prediction then any effect here would be negligible.
>
> I would expect the i < n/2 loop to be faster, because the assignments are
> unconditional. FWIW.
Do you have any idea as to how much faster? Is it something that would show up
as
significant within the context of QEMU?
As well as eliminating the HI_IDX/LO_IDX constants I do find the updated
version much
easier to read, so I would prefer to keep it if possible. What about unrolling
the
loop into 2 separate ones e.g.
for (i = 0; i < ARRAY_SIZE(r->element); i+=2) {
result.access(i) = a->access(i >> 1);
}
for (i = 1; i < ARRAY_SIZE(r->element); i+=2) {
result.access(i) = b->access(i >> 1);
}
Would you expect this to perform better than the version proposed in the
patchset?
ATB,
Mark.