[ 
https://issues.apache.org/jira/browse/ARROW-9842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17310428#comment-17310428
 ] 

Yibo Cai edited comment on ARROW-9842 at 3/29/21, 7:16 AM:
-----------------------------------------------------------

Tested on skylake, clang-9, -msse4.2.

Updated POC patch to evaluate best chunk size to accumulate bits before packing.
[^movemask-in-chunks.diff]

To my surprise, big chunks actually hurts performance (tested with 
arrow-compute-scalar-compare-benchmark). Chunk size 16 gives ~3G items/sec, 
while chunk size 256 gives ~2G.

My theory is for big chunk size, cpu has to stall for memory loading. For small 
chunk size, cpu can interleave memory loading and latter computation (packing 
bytes to bits). I see IPC (instruction per cycle) drops from 2.4 to 2.2, when 
chunk size increases from 16 to 64.

Disassembler shows for big chunk size, clang disables vectorization of the hot 
loop calling generator repeatedly. At first glance it looks strange, but it 
turns out compiler did quite good job. Forcing compiler to vectorize the 
generator hot loop actually results to worse performance. I think the reason is 
what I listed above. I see IPC drops to 2.0. Compiler recognises the pattern 
and generates code try to use cpu pipelines more effectively.

Just for the record, I simplified the code and pasted to godbolt. The generated 
code looks much better than in arrow.
https://godbolt.org/z/ETW3vMvqY


was (Author: yibocai):
Tested on skylake, clang-9, -msse4.2.

Updated POC patch to evaluate best chunk size to accumulate bits before packing.
[^movemask-in-chunks.diff]

To my surprise, big chunks actually hurts performance (tested with 
arrow-compute-scalar-compare-benchmark). Chunk size 16 gives ~3G items/sec, 
while chunk size 256 gives ~2G.

My theory is for big chunk size, cpu has to stall for memory loading. For small 
chunk size, cpu can interleave memory loading and latter computation (packing 
bytes to bits). I see IPC (instruction per cycle) drops from 2.4 to 2.2, when 
chunk size increases from 16 to 64.

Disassembler shows for big chunk size, clang disables vectorization of the hot 
loop calling generator repeatedly. At first glance it looks strange, but it 
turns out compiler did quite good job. Forcing compiler to vectorize the 
generator hot loop actually results to worse performance. I think the reason is 
what I listed above. I see IPC drops to 2.0. Compiler recognises the pattern 
and generates code try to use cpu pipelines more effectively.

> [C++] Explore alternative strategy for Compare kernel implementation for 
> better performance
> -------------------------------------------------------------------------------------------
>
>                 Key: ARROW-9842
>                 URL: https://issues.apache.org/jira/browse/ARROW-9842
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Wes McKinney
>            Priority: Major
>             Fix For: 5.0.0
>
>         Attachments: movemask-in-chunks.diff, movemask.patch
>
>
> The compiler may be able to vectorize comparison options if the bitpacking of 
> results is deferred until the end (or in chunks). Instead, a temporary 
> bytemap can be populated on a chunk-by-chunk basis and then the bytemaps can 
> be bitpacked into the output buffer. This may also reduce the code size of 
> the compare kernels (which are actually quite large at the moment)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to