[ https://issues.apache.org/jira/browse/ARROW-9029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated ARROW-9029: ---------------------------------- Labels: pull-request-available (was: ) > [C++] Implement BitmapScanner interface to accelerate processing of > mostly-not-null data > ---------------------------------------------------------------------------------------- > > Key: ARROW-9029 > URL: https://issues.apache.org/jira/browse/ARROW-9029 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Wes McKinney > Assignee: Wes McKinney > Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > In analytics, it is common for data to be all not-null or mostly not-null. > Data with > 50% nulls tends to be more exceptional. In this light, our > {{BitmapReader}} class which allows iteration of each bit in a bitmap can be > computationally suboptimal for mostly set validity bitmaps. > I propose instead a new interface for use in kernel implementations, for lack > of a better term {{BitmapScanner}}. This works as follows: > * Uses popcount to accumulate consecutive 64-bit words from a bitmap where > all values are set, up to some limit (e.g. anywhere from 8 to 128 words or > more -- we can use benchmarks to determine what is a good limit). The length > of this "all-on" run is returned to the caller in a single function call, so > that this "run" of data can be processed without any bit-by-bit bitmap > checking > * If words containing unset bits is encountered, the scanner will similarly > accumulate non-full words until the next full word is encountered or a limit > is hit. The length of this "has nulls" run is returned to the caller, which > then proceeds bit-by-bit to process the data > For data with a lot of nulls, this may degrade performance somewhat but > probably not that much empirically. However, data that is mostly-not-null > should benefit from this. > This BitmapScanner utility can probably also be used to accelerate the > implementation of Filter for mostly-not-null data -- This message was sent by Atlassian Jira (v8.3.4#803005)