Did a quick test. For random bitmaps and my trivial test code, the branch-less code is 3.5x faster than branch one.
https://quick-bench.com/q/UD22IIdMgKO9HU1PsPezj05Kkro

On 6/23/21 11:21 PM, Wes McKinney wrote:
One project I was interested in getting to but haven't had the time
was introducing branch-free code into vector_selection.cc and reducing
the use of if-statements to try to improve performance.

One way to do this is to take code that looks like this:

if (BitUtil::GetBit(filter_data_, filter_offset_ + in_position)) {
   BitUtil::SetBit(out_is_valid_, out_offset_ + out_position_);
   out_data_[out_position_++] = values_data_[in_position];
}
++in_position;

and change it to a branch-free version

bool advance = BitUtil::GetBit(filter_data_, filter_offset_ + in_position);
BitUtil::SetBitTo(out_is_valid_, out_offset_ + out_position_, advance);
out_data_[out_position_] = values_data_[in_position];
out_position_ += advance; // may need static_cast<int> here
++in_position;

Since more people are working on kernels and computing now, I thought
this might be an interesting project for someone to explore and see
what improvements are possible (and what the differences between e.g.
x86 and ARM architecture are like when it comes to reducing
branching). Another thing to look at might be batch-at-a-time
bitpacking in the output bitmap versus bit-at-a-time.

Reply via email to