Re: [C++] Reducing branching in compute/kernels/vector_selection.cc

2021-07-07 Thread Niranda Perera
I have created a PR for the changes we discussed.
https://github.com/apache/arrow/pull/10679

It would be great if you guys could go through it. I'm still benchmarking
the results. And I also have some ideas to reduce the branches in the main
while loop in the Primitive Types implementation. I will try to add that as
well.

Best

On Tue, Jun 29, 2021 at 12:10 PM Niranda Perera 
wrote:

> Ah. Sorry, my mistake! there's a separate `ExecNonNull()` method! 
>
> On Tue, Jun 29, 2021 at 12:00 PM Antoine Pitrou 
> wrote:
>
>>
>> Le 29/06/2021 à 17:58, Niranda Perera a écrit :
>> > So, FWIU, in vector selection, the output array would always have a
>> > non-null validity buffer, isn't it?
>>
>> Why?
>>
>>
>> >
>> > On Tue, Jun 29, 2021 at 11:54 AM Antoine Pitrou 
>> wrote:
>> >
>> >>
>> >>
>> >> Le 29/06/2021 à 17:49, Niranda Perera a écrit :
>> >>> Hi all,
>> >>>
>> >>> I'm looking into this now, and I feel like there's a potential memory
>> >>> corruption at the very end of the out_data_ array.
>> >>>
>> >>> algo:
>> >>> bool advance = BitUtil::GetBit(filter_data_, filter_offset_ +
>> >> in_position);
>> >>> BitUtil::SetBitTo(out_is_valid_, out_offset_ + out_position_,
>> advance);
>> >>> out_data_[out_position_] = values_data_[in_position];
>> >>> out_position_ += advance; // may need static_cast here
>> >>> ++in_position;
>> >>> say,
>> >>> in.data =   [1, 2, 3, 4] // all valid
>> >>> filt.data = [1, 1, 0, 0] // all valid
>> >>> At in_position = 1, out_position becomes 2. And till the end of the
>> loop
>> >>> (i.e. in_position <= 3), out_position[2] will be updated (but with
>> >>> out_is_valid_[2] set to 0). I belive at the end of the loop, following
>> >>> would be the output.
>> >>> out.data =  [1, 2, 4]
>> >>> out.valid = [1, 1, 0]
>> >>>
>> >>> Wouldn't this be a problem?
>> >>
>> >> Indeed, you have to allocate N + 1 entries for the result array instead
>> >> of N.  Probably not a big deal, IMHO.
>> >>
>> >> Regards
>> >>
>> >> Antoine.
>> >>
>> >
>> >
>>
>
>
> --
> Niranda Perera
> https://niranda.dev/
> @n1r44 
>
>

-- 
Niranda Perera
https://niranda.dev/
@n1r44 


Re: [C++] Reducing branching in compute/kernels/vector_selection.cc

2021-06-29 Thread Niranda Perera
Ah. Sorry, my mistake! there's a separate `ExecNonNull()` method! 

On Tue, Jun 29, 2021 at 12:00 PM Antoine Pitrou  wrote:

>
> Le 29/06/2021 à 17:58, Niranda Perera a écrit :
> > So, FWIU, in vector selection, the output array would always have a
> > non-null validity buffer, isn't it?
>
> Why?
>
>
> >
> > On Tue, Jun 29, 2021 at 11:54 AM Antoine Pitrou 
> wrote:
> >
> >>
> >>
> >> Le 29/06/2021 à 17:49, Niranda Perera a écrit :
> >>> Hi all,
> >>>
> >>> I'm looking into this now, and I feel like there's a potential memory
> >>> corruption at the very end of the out_data_ array.
> >>>
> >>> algo:
> >>> bool advance = BitUtil::GetBit(filter_data_, filter_offset_ +
> >> in_position);
> >>> BitUtil::SetBitTo(out_is_valid_, out_offset_ + out_position_, advance);
> >>> out_data_[out_position_] = values_data_[in_position];
> >>> out_position_ += advance; // may need static_cast here
> >>> ++in_position;
> >>> say,
> >>> in.data =   [1, 2, 3, 4] // all valid
> >>> filt.data = [1, 1, 0, 0] // all valid
> >>> At in_position = 1, out_position becomes 2. And till the end of the
> loop
> >>> (i.e. in_position <= 3), out_position[2] will be updated (but with
> >>> out_is_valid_[2] set to 0). I belive at the end of the loop, following
> >>> would be the output.
> >>> out.data =  [1, 2, 4]
> >>> out.valid = [1, 1, 0]
> >>>
> >>> Wouldn't this be a problem?
> >>
> >> Indeed, you have to allocate N + 1 entries for the result array instead
> >> of N.  Probably not a big deal, IMHO.
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >
> >
>


-- 
Niranda Perera
https://niranda.dev/
@n1r44 


Re: [C++] Reducing branching in compute/kernels/vector_selection.cc

2021-06-29 Thread Antoine Pitrou



Le 29/06/2021 à 17:58, Niranda Perera a écrit :

So, FWIU, in vector selection, the output array would always have a
non-null validity buffer, isn't it?


Why?




On Tue, Jun 29, 2021 at 11:54 AM Antoine Pitrou  wrote:




Le 29/06/2021 à 17:49, Niranda Perera a écrit :

Hi all,

I'm looking into this now, and I feel like there's a potential memory
corruption at the very end of the out_data_ array.

algo:
bool advance = BitUtil::GetBit(filter_data_, filter_offset_ +

in_position);

BitUtil::SetBitTo(out_is_valid_, out_offset_ + out_position_, advance);
out_data_[out_position_] = values_data_[in_position];
out_position_ += advance; // may need static_cast here
++in_position;
say,
in.data =   [1, 2, 3, 4] // all valid
filt.data = [1, 1, 0, 0] // all valid
At in_position = 1, out_position becomes 2. And till the end of the loop
(i.e. in_position <= 3), out_position[2] will be updated (but with
out_is_valid_[2] set to 0). I belive at the end of the loop, following
would be the output.
out.data =  [1, 2, 4]
out.valid = [1, 1, 0]

Wouldn't this be a problem?


Indeed, you have to allocate N + 1 entries for the result array instead
of N.  Probably not a big deal, IMHO.

Regards

Antoine.






Re: [C++] Reducing branching in compute/kernels/vector_selection.cc

2021-06-29 Thread Niranda Perera
So, FWIU, in vector selection, the output array would always have a
non-null validity buffer, isn't it?

On Tue, Jun 29, 2021 at 11:54 AM Antoine Pitrou  wrote:

>
>
> Le 29/06/2021 à 17:49, Niranda Perera a écrit :
> > Hi all,
> >
> > I'm looking into this now, and I feel like there's a potential memory
> > corruption at the very end of the out_data_ array.
> >
> > algo:
> > bool advance = BitUtil::GetBit(filter_data_, filter_offset_ +
> in_position);
> > BitUtil::SetBitTo(out_is_valid_, out_offset_ + out_position_, advance);
> > out_data_[out_position_] = values_data_[in_position];
> > out_position_ += advance; // may need static_cast here
> > ++in_position;
> > say,
> > in.data =   [1, 2, 3, 4] // all valid
> > filt.data = [1, 1, 0, 0] // all valid
> > At in_position = 1, out_position becomes 2. And till the end of the loop
> > (i.e. in_position <= 3), out_position[2] will be updated (but with
> > out_is_valid_[2] set to 0). I belive at the end of the loop, following
> > would be the output.
> > out.data =  [1, 2, 4]
> > out.valid = [1, 1, 0]
> >
> > Wouldn't this be a problem?
>
> Indeed, you have to allocate N + 1 entries for the result array instead
> of N.  Probably not a big deal, IMHO.
>
> Regards
>
> Antoine.
>


-- 
Niranda Perera
https://niranda.dev/
@n1r44 


Re: [C++] Reducing branching in compute/kernels/vector_selection.cc

2021-06-29 Thread Antoine Pitrou




Le 29/06/2021 à 17:49, Niranda Perera a écrit :

Hi all,

I'm looking into this now, and I feel like there's a potential memory
corruption at the very end of the out_data_ array.

algo:
bool advance = BitUtil::GetBit(filter_data_, filter_offset_ + in_position);
BitUtil::SetBitTo(out_is_valid_, out_offset_ + out_position_, advance);
out_data_[out_position_] = values_data_[in_position];
out_position_ += advance; // may need static_cast here
++in_position;
say,
in.data =   [1, 2, 3, 4] // all valid
filt.data = [1, 1, 0, 0] // all valid
At in_position = 1, out_position becomes 2. And till the end of the loop
(i.e. in_position <= 3), out_position[2] will be updated (but with
out_is_valid_[2] set to 0). I belive at the end of the loop, following
would be the output.
out.data =  [1, 2, 4]
out.valid = [1, 1, 0]

Wouldn't this be a problem?


Indeed, you have to allocate N + 1 entries for the result array instead 
of N.  Probably not a big deal, IMHO.


Regards

Antoine.


Re: [C++] Reducing branching in compute/kernels/vector_selection.cc

2021-06-29 Thread Niranda Perera
Hi all,

I'm looking into this now, and I feel like there's a potential memory
corruption at the very end of the out_data_ array.

algo:
bool advance = BitUtil::GetBit(filter_data_, filter_offset_ + in_position);
BitUtil::SetBitTo(out_is_valid_, out_offset_ + out_position_, advance);
out_data_[out_position_] = values_data_[in_position];
out_position_ += advance; // may need static_cast here
++in_position;
say,
in.data =   [1, 2, 3, 4] // all valid
filt.data = [1, 1, 0, 0] // all valid
At in_position = 1, out_position becomes 2. And till the end of the loop
(i.e. in_position <= 3), out_position[2] will be updated (but with
out_is_valid_[2] set to 0). I belive at the end of the loop, following
would be the output.
out.data =  [1, 2, 4]
out.valid = [1, 1, 0]

Wouldn't this be a problem?


On Fri, Jun 25, 2021 at 12:19 AM Nate Bauernfeind <
natebauernfe...@deephaven.io> wrote:

> > Basically, it reset/set the borrow bit in eflag register based on the if
> condition, and runs `outpos = outpos - (-1) - borrow_bit`.
>
> That's clever, and I clearly didn't see that!
>
> On Thu, Jun 24, 2021 at 8:57 PM Yibo Cai  wrote:
>
> >
> >
> > On 6/25/21 6:58 AM, Nate Bauernfeind wrote:
> > > FYI, the bench was slightly broken; but the results stand.
> > >
> > >> benchmark::DoNotOptimize(output[rand()]);
> > > Since rand() has a domain of 0 to MAX_INT it blows past the output
> array
> > > (of length 4k). It segfaults in GCC; I'm not sure why the Clang
> benchmark
> > > is happy with that.
> > >
> > > I modified [1] it to:
> > >> benchmark::DoNotOptimize(output[rand() % N]);
> > >
> > > The benchmarks run:
> > > The Clang11 -O3 speed up is 2.3x.
> > > The GCC10.2 -O3 speed up is 2.6x.
> > >
> > > Interestingly, I added a second branching benchmark. The original
> > branching
> > > does this:
> > > ```
> > >if (bitmap[i/8] & (1 << (i%8))) {
> > >  output[outpos++] = input[i];
> > >}
> > >
> > ```
> > >
> > > My additional branching benchmark pulls the assignment outside of the
> > > branch:
> > > ```
> > >output[outpos] = input[i];
> > >if (bitmap[i/8] & (1 << (i%8))) {
> > >  ++outpos;
> > >}
> > > ```
> >
> >  From the disassembler, compiler optimizes this c code with below
> > branch-less assembly code.
> >
> > 10.06% sar%cl,%edx
> > 10.74% and$0x1,%edx
> > 9.93%  cmp$0x1,%edx
> > sbb$0x,%esi
> >
> > Basically, it reset/set the borrow bit in eflag register based on the if
> > condition, and runs `outpos = outpos - (-1) - borrow_bit`.
> >
> > >
> > > The benchmarks run:
> > > The GCC10.2 -O3 speed up compared to the original branching code is
> 2.3x.
> > > (are you as surprised as me?)
> > > The GCC10.2 -O3 speed up compared to the original non-branching code is
> > > 0.9x. (yes; it is slightly slower)
> > >
> > > Point: reducing the footprint of a false branch prediction is as
> > worthwhile
> > > an investment when you can't simply get rid of the conditional.
> > >
> > > Nate
> > > [1] https://quick-bench.com/q/kDFoF2pOuvPo9aVFufkVJMjWf-g
> > >
> > > On Thu, Jun 24, 2021 at 1:01 PM Niranda Perera <
> niranda.per...@gmail.com
> > >
> > > wrote:
> > >
> > >> I created a JIRA for this. I will do the changes in select kernels and
> > >> report back with benchmark results
> > >> https://issues.apache.org/jira/browse/ARROW-13170
> > >>
> > >>
> > >> On Thu, Jun 24, 2021 at 12:27 AM Yibo Cai  wrote:
> > >>
> > >>> Did a quick test. For random bitmaps and my trivial test code, the
> > >>> branch-less code is 3.5x faster than branch one.
> > >>> https://quick-bench.com/q/UD22IIdMgKO9HU1PsPezj05Kkro
> > >>>
> > >>> On 6/23/21 11:21 PM, Wes McKinney wrote:
> >  One project I was interested in getting to but haven't had the time
> >  was introducing branch-free code into vector_selection.cc and
> reducing
> >  the use of if-statements to try to improve performance.
> > 
> >  One way to do this is to take code that looks like this:
> > 
> >  if (BitUtil::GetBit(filter_data_, filter_offset_ + in_position)) {
> >  BitUtil::SetBit(out_is_valid_, out_offset_ + out_position_);
> >  out_data_[out_position_++] = values_data_[in_position];
> >  }
> >  ++in_position;
> > 
> >  and change it to a branch-free version
> > 
> >  bool advance = BitUtil::GetBit(filter_data_, filter_offset_ +
> > >>> in_position);
> >  BitUtil::SetBitTo(out_is_valid_, out_offset_ + out_position_,
> > advance);
> >  out_data_[out_position_] = values_data_[in_position];
> >  out_position_ += advance; // may need static_cast here
> >  ++in_position;
> > 
> >  Since more people are working on kernels and computing now, I
> thought
> >  this might be an interesting project for someone to explore and see
> >  what improvements are possible (and what the differences between
> e.g.
> >  x86 and ARM architecture are like when it comes to reducing
> >  branching). Another thing to 

Re: [C++] Reducing branching in compute/kernels/vector_selection.cc

2021-06-24 Thread Nate Bauernfeind
> Basically, it reset/set the borrow bit in eflag register based on the if
condition, and runs `outpos = outpos - (-1) - borrow_bit`.

That's clever, and I clearly didn't see that!

On Thu, Jun 24, 2021 at 8:57 PM Yibo Cai  wrote:

>
>
> On 6/25/21 6:58 AM, Nate Bauernfeind wrote:
> > FYI, the bench was slightly broken; but the results stand.
> >
> >> benchmark::DoNotOptimize(output[rand()]);
> > Since rand() has a domain of 0 to MAX_INT it blows past the output array
> > (of length 4k). It segfaults in GCC; I'm not sure why the Clang benchmark
> > is happy with that.
> >
> > I modified [1] it to:
> >> benchmark::DoNotOptimize(output[rand() % N]);
> >
> > The benchmarks run:
> > The Clang11 -O3 speed up is 2.3x.
> > The GCC10.2 -O3 speed up is 2.6x.
> >
> > Interestingly, I added a second branching benchmark. The original
> branching
> > does this:
> > ```
> >if (bitmap[i/8] & (1 << (i%8))) {
> >  output[outpos++] = input[i];
> >}
> >
> ```
> >
> > My additional branching benchmark pulls the assignment outside of the
> > branch:
> > ```
> >output[outpos] = input[i];
> >if (bitmap[i/8] & (1 << (i%8))) {
> >  ++outpos;
> >}
> > ```
>
>  From the disassembler, compiler optimizes this c code with below
> branch-less assembly code.
>
> 10.06% sar%cl,%edx
> 10.74% and$0x1,%edx
> 9.93%  cmp$0x1,%edx
> sbb$0x,%esi
>
> Basically, it reset/set the borrow bit in eflag register based on the if
> condition, and runs `outpos = outpos - (-1) - borrow_bit`.
>
> >
> > The benchmarks run:
> > The GCC10.2 -O3 speed up compared to the original branching code is 2.3x.
> > (are you as surprised as me?)
> > The GCC10.2 -O3 speed up compared to the original non-branching code is
> > 0.9x. (yes; it is slightly slower)
> >
> > Point: reducing the footprint of a false branch prediction is as
> worthwhile
> > an investment when you can't simply get rid of the conditional.
> >
> > Nate
> > [1] https://quick-bench.com/q/kDFoF2pOuvPo9aVFufkVJMjWf-g
> >
> > On Thu, Jun 24, 2021 at 1:01 PM Niranda Perera  >
> > wrote:
> >
> >> I created a JIRA for this. I will do the changes in select kernels and
> >> report back with benchmark results
> >> https://issues.apache.org/jira/browse/ARROW-13170
> >>
> >>
> >> On Thu, Jun 24, 2021 at 12:27 AM Yibo Cai  wrote:
> >>
> >>> Did a quick test. For random bitmaps and my trivial test code, the
> >>> branch-less code is 3.5x faster than branch one.
> >>> https://quick-bench.com/q/UD22IIdMgKO9HU1PsPezj05Kkro
> >>>
> >>> On 6/23/21 11:21 PM, Wes McKinney wrote:
>  One project I was interested in getting to but haven't had the time
>  was introducing branch-free code into vector_selection.cc and reducing
>  the use of if-statements to try to improve performance.
> 
>  One way to do this is to take code that looks like this:
> 
>  if (BitUtil::GetBit(filter_data_, filter_offset_ + in_position)) {
>  BitUtil::SetBit(out_is_valid_, out_offset_ + out_position_);
>  out_data_[out_position_++] = values_data_[in_position];
>  }
>  ++in_position;
> 
>  and change it to a branch-free version
> 
>  bool advance = BitUtil::GetBit(filter_data_, filter_offset_ +
> >>> in_position);
>  BitUtil::SetBitTo(out_is_valid_, out_offset_ + out_position_,
> advance);
>  out_data_[out_position_] = values_data_[in_position];
>  out_position_ += advance; // may need static_cast here
>  ++in_position;
> 
>  Since more people are working on kernels and computing now, I thought
>  this might be an interesting project for someone to explore and see
>  what improvements are possible (and what the differences between e.g.
>  x86 and ARM architecture are like when it comes to reducing
>  branching). Another thing to look at might be batch-at-a-time
>  bitpacking in the output bitmap versus bit-at-a-time.
> 
> >>>
> >>
> >>
> >> --
> >> Niranda Perera
> >> https://niranda.dev/
> >> @n1r44 
> >>
> >
> >
> > --
> >
>


--


Re: [C++] Reducing branching in compute/kernels/vector_selection.cc

2021-06-24 Thread Yibo Cai




On 6/25/21 6:58 AM, Nate Bauernfeind wrote:

FYI, the bench was slightly broken; but the results stand.


benchmark::DoNotOptimize(output[rand()]);

Since rand() has a domain of 0 to MAX_INT it blows past the output array
(of length 4k). It segfaults in GCC; I'm not sure why the Clang benchmark
is happy with that.

I modified [1] it to:

benchmark::DoNotOptimize(output[rand() % N]);


The benchmarks run:
The Clang11 -O3 speed up is 2.3x.
The GCC10.2 -O3 speed up is 2.6x.

Interestingly, I added a second branching benchmark. The original branching
does this:
```
   if (bitmap[i/8] & (1 << (i%8))) {
 output[outpos++] = input[i];
   }


```


My additional branching benchmark pulls the assignment outside of the
branch:
```
   output[outpos] = input[i];
   if (bitmap[i/8] & (1 << (i%8))) {
 ++outpos;
   }
```


From the disassembler, compiler optimizes this c code with below
branch-less assembly code.

10.06% sar%cl,%edx
10.74% and$0x1,%edx
9.93%  cmp$0x1,%edx
   sbb$0x,%esi

Basically, it reset/set the borrow bit in eflag register based on the if 
condition, and runs `outpos = outpos - (-1) - borrow_bit`.




The benchmarks run:
The GCC10.2 -O3 speed up compared to the original branching code is 2.3x.
(are you as surprised as me?)
The GCC10.2 -O3 speed up compared to the original non-branching code is
0.9x. (yes; it is slightly slower)

Point: reducing the footprint of a false branch prediction is as worthwhile
an investment when you can't simply get rid of the conditional.

Nate
[1] https://quick-bench.com/q/kDFoF2pOuvPo9aVFufkVJMjWf-g

On Thu, Jun 24, 2021 at 1:01 PM Niranda Perera 
wrote:


I created a JIRA for this. I will do the changes in select kernels and
report back with benchmark results
https://issues.apache.org/jira/browse/ARROW-13170


On Thu, Jun 24, 2021 at 12:27 AM Yibo Cai  wrote:


Did a quick test. For random bitmaps and my trivial test code, the
branch-less code is 3.5x faster than branch one.
https://quick-bench.com/q/UD22IIdMgKO9HU1PsPezj05Kkro

On 6/23/21 11:21 PM, Wes McKinney wrote:

One project I was interested in getting to but haven't had the time
was introducing branch-free code into vector_selection.cc and reducing
the use of if-statements to try to improve performance.

One way to do this is to take code that looks like this:

if (BitUtil::GetBit(filter_data_, filter_offset_ + in_position)) {
BitUtil::SetBit(out_is_valid_, out_offset_ + out_position_);
out_data_[out_position_++] = values_data_[in_position];
}
++in_position;

and change it to a branch-free version

bool advance = BitUtil::GetBit(filter_data_, filter_offset_ +

in_position);

BitUtil::SetBitTo(out_is_valid_, out_offset_ + out_position_, advance);
out_data_[out_position_] = values_data_[in_position];
out_position_ += advance; // may need static_cast here
++in_position;

Since more people are working on kernels and computing now, I thought
this might be an interesting project for someone to explore and see
what improvements are possible (and what the differences between e.g.
x86 and ARM architecture are like when it comes to reducing
branching). Another thing to look at might be batch-at-a-time
bitpacking in the output bitmap versus bit-at-a-time.






--
Niranda Perera
https://niranda.dev/
@n1r44 




--



Re: [C++] Reducing branching in compute/kernels/vector_selection.cc

2021-06-24 Thread Nate Bauernfeind
FYI, the bench was slightly broken; but the results stand.

> benchmark::DoNotOptimize(output[rand()]);
Since rand() has a domain of 0 to MAX_INT it blows past the output array
(of length 4k). It segfaults in GCC; I'm not sure why the Clang benchmark
is happy with that.

I modified [1] it to:
> benchmark::DoNotOptimize(output[rand() % N]);

The benchmarks run:
The Clang11 -O3 speed up is 2.3x.
The GCC10.2 -O3 speed up is 2.6x.

Interestingly, I added a second branching benchmark. The original branching
does this:
```
  if (bitmap[i/8] & (1 << (i%8))) {
output[outpos++] = input[i];
  }
```

My additional branching benchmark pulls the assignment outside of the
branch:
```
  output[outpos] = input[i];
  if (bitmap[i/8] & (1 << (i%8))) {
++outpos;
  }
```

The benchmarks run:
The GCC10.2 -O3 speed up compared to the original branching code is 2.3x.
(are you as surprised as me?)
The GCC10.2 -O3 speed up compared to the original non-branching code is
0.9x. (yes; it is slightly slower)

Point: reducing the footprint of a false branch prediction is as worthwhile
an investment when you can't simply get rid of the conditional.

Nate
[1] https://quick-bench.com/q/kDFoF2pOuvPo9aVFufkVJMjWf-g

On Thu, Jun 24, 2021 at 1:01 PM Niranda Perera 
wrote:

> I created a JIRA for this. I will do the changes in select kernels and
> report back with benchmark results
> https://issues.apache.org/jira/browse/ARROW-13170
>
>
> On Thu, Jun 24, 2021 at 12:27 AM Yibo Cai  wrote:
>
> > Did a quick test. For random bitmaps and my trivial test code, the
> > branch-less code is 3.5x faster than branch one.
> > https://quick-bench.com/q/UD22IIdMgKO9HU1PsPezj05Kkro
> >
> > On 6/23/21 11:21 PM, Wes McKinney wrote:
> > > One project I was interested in getting to but haven't had the time
> > > was introducing branch-free code into vector_selection.cc and reducing
> > > the use of if-statements to try to improve performance.
> > >
> > > One way to do this is to take code that looks like this:
> > >
> > > if (BitUtil::GetBit(filter_data_, filter_offset_ + in_position)) {
> > >BitUtil::SetBit(out_is_valid_, out_offset_ + out_position_);
> > >out_data_[out_position_++] = values_data_[in_position];
> > > }
> > > ++in_position;
> > >
> > > and change it to a branch-free version
> > >
> > > bool advance = BitUtil::GetBit(filter_data_, filter_offset_ +
> > in_position);
> > > BitUtil::SetBitTo(out_is_valid_, out_offset_ + out_position_, advance);
> > > out_data_[out_position_] = values_data_[in_position];
> > > out_position_ += advance; // may need static_cast here
> > > ++in_position;
> > >
> > > Since more people are working on kernels and computing now, I thought
> > > this might be an interesting project for someone to explore and see
> > > what improvements are possible (and what the differences between e.g.
> > > x86 and ARM architecture are like when it comes to reducing
> > > branching). Another thing to look at might be batch-at-a-time
> > > bitpacking in the output bitmap versus bit-at-a-time.
> > >
> >
>
>
> --
> Niranda Perera
> https://niranda.dev/
> @n1r44 
>


--


Re: [C++] Reducing branching in compute/kernels/vector_selection.cc

2021-06-24 Thread Niranda Perera
I created a JIRA for this. I will do the changes in select kernels and
report back with benchmark results
https://issues.apache.org/jira/browse/ARROW-13170


On Thu, Jun 24, 2021 at 12:27 AM Yibo Cai  wrote:

> Did a quick test. For random bitmaps and my trivial test code, the
> branch-less code is 3.5x faster than branch one.
> https://quick-bench.com/q/UD22IIdMgKO9HU1PsPezj05Kkro
>
> On 6/23/21 11:21 PM, Wes McKinney wrote:
> > One project I was interested in getting to but haven't had the time
> > was introducing branch-free code into vector_selection.cc and reducing
> > the use of if-statements to try to improve performance.
> >
> > One way to do this is to take code that looks like this:
> >
> > if (BitUtil::GetBit(filter_data_, filter_offset_ + in_position)) {
> >BitUtil::SetBit(out_is_valid_, out_offset_ + out_position_);
> >out_data_[out_position_++] = values_data_[in_position];
> > }
> > ++in_position;
> >
> > and change it to a branch-free version
> >
> > bool advance = BitUtil::GetBit(filter_data_, filter_offset_ +
> in_position);
> > BitUtil::SetBitTo(out_is_valid_, out_offset_ + out_position_, advance);
> > out_data_[out_position_] = values_data_[in_position];
> > out_position_ += advance; // may need static_cast here
> > ++in_position;
> >
> > Since more people are working on kernels and computing now, I thought
> > this might be an interesting project for someone to explore and see
> > what improvements are possible (and what the differences between e.g.
> > x86 and ARM architecture are like when it comes to reducing
> > branching). Another thing to look at might be batch-at-a-time
> > bitpacking in the output bitmap versus bit-at-a-time.
> >
>


-- 
Niranda Perera
https://niranda.dev/
@n1r44 


Re: [C++] Reducing branching in compute/kernels/vector_selection.cc

2021-06-23 Thread Yibo Cai
Did a quick test. For random bitmaps and my trivial test code, the 
branch-less code is 3.5x faster than branch one.

https://quick-bench.com/q/UD22IIdMgKO9HU1PsPezj05Kkro

On 6/23/21 11:21 PM, Wes McKinney wrote:

One project I was interested in getting to but haven't had the time
was introducing branch-free code into vector_selection.cc and reducing
the use of if-statements to try to improve performance.

One way to do this is to take code that looks like this:

if (BitUtil::GetBit(filter_data_, filter_offset_ + in_position)) {
   BitUtil::SetBit(out_is_valid_, out_offset_ + out_position_);
   out_data_[out_position_++] = values_data_[in_position];
}
++in_position;

and change it to a branch-free version

bool advance = BitUtil::GetBit(filter_data_, filter_offset_ + in_position);
BitUtil::SetBitTo(out_is_valid_, out_offset_ + out_position_, advance);
out_data_[out_position_] = values_data_[in_position];
out_position_ += advance; // may need static_cast here
++in_position;

Since more people are working on kernels and computing now, I thought
this might be an interesting project for someone to explore and see
what improvements are possible (and what the differences between e.g.
x86 and ARM architecture are like when it comes to reducing
branching). Another thing to look at might be batch-at-a-time
bitpacking in the output bitmap versus bit-at-a-time.



Re: [C++] Reducing branching in compute/kernels/vector_selection.cc

2021-06-23 Thread Niranda Perera
Hi Wes,

I am interesting in this. In this PR [1] we are exposing BitmapWordReader/
Writer [2] to the outside, which may help the 'batch-at-a-time' scenario.

[1] https://github.com/apache/arrow/pull/10487
[2]
https://github.com/apache/arrow/blob/bcce18e5d4d83f0831de71b363ad91470376084c/cpp/src/arrow/util/bitmap_reader.h#L149-L231

On Wed, Jun 23, 2021 at 11:21 AM Wes McKinney  wrote:

> One project I was interested in getting to but haven't had the time
> was introducing branch-free code into vector_selection.cc and reducing
> the use of if-statements to try to improve performance.
>
> One way to do this is to take code that looks like this:
>
> if (BitUtil::GetBit(filter_data_, filter_offset_ + in_position)) {
>   BitUtil::SetBit(out_is_valid_, out_offset_ + out_position_);
>   out_data_[out_position_++] = values_data_[in_position];
> }
> ++in_position;
>
> and change it to a branch-free version
>
> bool advance = BitUtil::GetBit(filter_data_, filter_offset_ + in_position);
> BitUtil::SetBitTo(out_is_valid_, out_offset_ + out_position_, advance);
> out_data_[out_position_] = values_data_[in_position];
> out_position_ += advance; // may need static_cast here
> ++in_position;
>
> Since more people are working on kernels and computing now, I thought
> this might be an interesting project for someone to explore and see
> what improvements are possible (and what the differences between e.g.
> x86 and ARM architecture are like when it comes to reducing
> branching). Another thing to look at might be batch-at-a-time
> bitpacking in the output bitmap versus bit-at-a-time.
>


-- 
Niranda Perera
https://niranda.dev/
@n1r44