[ https://issues.apache.org/jira/browse/ARROW-10058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17203136#comment-17203136 ]
Antoine Pitrou commented on ARROW-10058: ---------------------------------------- Thanks for looking at this! Two questions: 1) have you tried a higher bit count? I think 5 or even 6 bits would be reasonable, that's 1024 or 4096 entries - significantly smaller than L1 cache, and depending on actual values not all the table may be "hot" 2) have you tried computing {{popcount(mask)}} directly instead of extracting it from the lookup table? At least on x86, it seems popcount latency is generally good (~3 cycles) > [C++] Investigate performance of LevelsToBitmap without BMI2 > ------------------------------------------------------------ > > Key: ARROW-10058 > URL: https://issues.apache.org/jira/browse/ARROW-10058 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++ > Reporter: Antoine Pitrou > Priority: Major > Attachments: opt-level-conv.diff > > > Currently, when some Parquet nested data involves some repetition levels, > converting the levels to bitmap goes through a slow scalar path unless the > BMI2 instruction set is available and efficient (the latter using the PEXT > instruction to process 16 levels at once). > It may be possible to emulate PEXT for 5- or 6-bit masks by using a lookup > table, allowing to process 5-6 levels at once. > (also, it would be good to add nested reading benchmarks for non-trivial > nesting; currently we only benchmark one-level struct and one-level list) -- This message was sent by Atlassian Jira (v8.3.4#803005)