[ 
https://issues.apache.org/jira/browse/ARROW-9842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17310428#comment-17310428
 ] 

Yibo Cai commented on ARROW-9842:
---------------------------------

Updated POC patch to evaluate best chunk size to accumulate bits before packing.
 [^movemask-in-chunks.diff]

To my surprise, big chunks actually hurts performance (tested with 
arrow-compute-scalar-compare-benchmark). Chunk size 16 gives ~3G items/sec, 
while chunk size 256 gives ~2G.
My theory is for big chunk size, cpu has to stall for memory loading. For small 
chunk size, cpu can interleave memory loading and latter computation (packing 
bytes to bits). I see IPC (instruction per cycle) drops from 2.4 to 2.2, when 
chunk size increases from 16 to 64.

> [C++] Explore alternative strategy for Compare kernel implementation for 
> better performance
> -------------------------------------------------------------------------------------------
>
>                 Key: ARROW-9842
>                 URL: https://issues.apache.org/jira/browse/ARROW-9842
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Wes McKinney
>            Priority: Major
>             Fix For: 5.0.0
>
>         Attachments: movemask-in-chunks.diff, movemask.patch
>
>
> The compiler may be able to vectorize comparison options if the bitpacking of 
> results is deferred until the end (or in chunks). Instead, a temporary 
> bytemap can be populated on a chunk-by-chunk basis and then the bytemaps can 
> be bitpacked into the output buffer. This may also reduce the code size of 
> the compare kernels (which are actually quite large at the moment)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to