[ https://issues.apache.org/jira/browse/ARROW-9842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17310428#comment-17310428 ]
Yibo Cai commented on ARROW-9842: --------------------------------- Updated POC patch to evaluate best chunk size to accumulate bits before packing. [^movemask-in-chunks.diff] To my surprise, big chunks actually hurts performance (tested with arrow-compute-scalar-compare-benchmark). Chunk size 16 gives ~3G items/sec, while chunk size 256 gives ~2G. My theory is for big chunk size, cpu has to stall for memory loading. For small chunk size, cpu can interleave memory loading and latter computation (packing bytes to bits). I see IPC (instruction per cycle) drops from 2.4 to 2.2, when chunk size increases from 16 to 64. > [C++] Explore alternative strategy for Compare kernel implementation for > better performance > ------------------------------------------------------------------------------------------- > > Key: ARROW-9842 > URL: https://issues.apache.org/jira/browse/ARROW-9842 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Wes McKinney > Priority: Major > Fix For: 5.0.0 > > Attachments: movemask-in-chunks.diff, movemask.patch > > > The compiler may be able to vectorize comparison options if the bitpacking of > results is deferred until the end (or in chunks). Instead, a temporary > bytemap can be populated on a chunk-by-chunk basis and then the bytemaps can > be bitpacked into the output buffer. This may also reduce the code size of > the compare kernels (which are actually quite large at the moment) -- This message was sent by Atlassian Jira (v8.3.4#803005)