sandugood opened a new pull request, #4672: URL: https://github.com/apache/datafusion-comet/pull/4672
## Which issue does this PR close? Closes #4626 ## Rationale for this change Currently, there is a bottleneck in performance during bulk append in the `SparkUnsafeArray` implementation (in `macro` and respective `Builder` types for `bool`, `date32` and `timestamp`). If the array is `NULLABLE` there is a hotspot: - we check each corresponding element at the index for nullability using `Self::is_null_in_bitset()` which is suboptimal (see the benchmarking results below) ## What changes are included in this PR? With this PR we change the flow of execution for nullable arrays using native arrow methods: - appending nulls to the current builder - getting `.slices_mut() `from the passed builder to get the reference to its content (data and null buffer) - checking if we land on byte boundary - switching between Spark and Arrow null mask bits. All of these changes are also documented in the code as docstrings. This part of native code wasn't well documented, so I thought it would be great to add a bunch of comments for future contributors. ## How are these changes tested? All of the basic tests pass. Here is additional result from benchmarking in array_element_append.rs (benches for `NULLABLE` case) | array_type | main (current) | incoming PR | | ---| --- | --- | | i32 | 30.030μs | 1.877μs | | i64 | 27.572μs | 3.074μs | | f64 | 30.059μs | 3.082μs | | date32 | 28.693μs | 1.905μs | | timestamp | 46.832μs | 3.153μs | -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
