[ https://issues.apache.org/jira/browse/ARROW-10058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17202999#comment-17202999 ]
Yibo Cai commented on ARROW-10058: ---------------------------------- POC with 4 bit lookup table (uint8[16][16]) to map [mask][data] directly to the pext-ed result. See big performance improvement (637M/s -> 1074M/s). POC patch [^opt-level-conv.diff]. Will propose a formal PR. Benchmark result of release/parquet-level-conversion-benchmark *Current code* {code:bash} --------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... --------------------------------------------------------------------------------------------------------- BM_DefinitionLevelsToBitmapRepeatedAllMissing 1072 ns 1072 ns 651457 bytes_per_second=1.77856G/s BM_DefinitionLevelsToBitmapRepeatedAllPresent 1226 ns 1226 ns 570829 bytes_per_second=1.55599G/s BM_DefinitionLevelsToBitmapRepeatedMostPresent 3065 ns 3065 ns 228285 bytes_per_second=637.151M/s {code} *With lookup table* {code:bash} --------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... --------------------------------------------------------------------------------------------------------- BM_DefinitionLevelsToBitmapRepeatedAllMissing 1093 ns 1093 ns 640348 bytes_per_second=1.74501G/s BM_DefinitionLevelsToBitmapRepeatedAllPresent 1244 ns 1244 ns 564592 bytes_per_second=1.53301G/s BM_DefinitionLevelsToBitmapRepeatedMostPresent 1817 ns 1817 ns 384456 bytes_per_second=1074.7M/s {code} > [C++] Investigate performance of LevelsToBitmap without BMI2 > ------------------------------------------------------------ > > Key: ARROW-10058 > URL: https://issues.apache.org/jira/browse/ARROW-10058 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++ > Reporter: Antoine Pitrou > Priority: Major > Attachments: opt-level-conv.diff > > > Currently, when some Parquet nested data involves some repetition levels, > converting the levels to bitmap goes through a slow scalar path unless the > BMI2 instruction set is available and efficient (the latter using the PEXT > instruction to process 16 levels at once). > It may be possible to emulate PEXT for 5- or 6-bit masks by using a lookup > table, allowing to process 5-6 levels at once. > (also, it would be good to add nested reading benchmarks for non-trivial > nesting; currently we only benchmark one-level struct and one-level list) -- This message was sent by Atlassian Jira (v8.3.4#803005)