rluvaton commented on issue #9083: URL: https://github.com/apache/arrow-rs/issues/9083#issuecomment-3756262948
The POC PR looks good in term of performance (and contains a lot of commented code and playing around) that I want to start a discussion. POC PR: - #9164 **I first laying out how I plan to integrate this the easiest way for reviewers and what optimizations I done in that PR in each step here** ### How I plan to add those in a way that is easy to review and understand If we plan to process, I will create these chains of PRs for better review and make it easy to understand, each one will focus on one thing and everything except the first PR will be small. #### 1. Just copy the code and little tweaks 1. Copy the entire RowConverter and all the conversion code to a new directory named `unordered_row` 2. Rename to `Unordered*` 3. Remove the Ord trait - as we will no longer guarantee it (even though at this step we just copied the code) 4. update tests 5. Change the API to get `Fields` instead of `SortField` - this is needed so I won't introduce a breaking change later This will be the largest PR but it is just made to have a baseline and make it easy to review, especially for people that are familiar with RowConverter. #### 2. Add benchmarks just run all row format benchmarks on the unordered row format as well So next Pull request can have a baseline #### 3. Remove all sort and order related code because the first PR is just a copy to make it easier to review. #### 4. Implement Optimization for field reordering Because we no longer guarantee ordering, we can reorder fields so that: 1. fields with the same type can be grouped together so later optimization on encoding multiple columns at once is possible 2. optimize for cache lines (so we won't write u64, u8, u64 and instead write u64, u64, u8 so it will be mostly aligned - and later we can add padding if we want) 3. encoding fields with same type right after another improve code locality This will also help with the store bottleneck as described here: - https://github.com/apache/arrow-rs/issues/9083#issuecomment-3703789536 #### 5. Encode nulls at the beginning of a row and make it bit packed + metadata bit currently we encode a null in a full byte which is wasteful, so we will bit pack all nullability this is done to save space. there will be an optional metadata bit/byte depending on the number of columns to encode that tells if all columns are valid or unknown to fast skip rows will have detailed explanation of the code in the matching PR #### 6. Don't encode nulls for non nullable fields Skip null encoding for non nullable fields to save computation This will also help with the store bottleneck as described here: - https://github.com/apache/arrow-rs/issues/9083#issuecomment-3703789536 #### 7. Implement encoding 4 same type primitive columns at a time. because we can change the ordering of the encoded data, we will group together columns with the same types and also group by nullability. so we can implement encoding of 4 columns at a time to really take advantage of all the CPU units and to also reduce the amount of jumps we do in memory if we need to come back every time #### 8. Implement custom boolean encoding that save space Instead of encoding 1 bit and 1 byte for the null and boolean respectively, we will encode both in a single byte (we can encode a boolean in a single byte if we decide not to encode the nulls in the same byte. or bit pack them like nulls.) #### 9. Change how we encode binary based data to be simpler and reduce branches + logic: will encode the length and then will encode the bytes (just copy the bytes as is). to avoid small strings (like 1-8 bytes) have the encoded length larger than the actual byte we will use a dynamic type to encode the length. if the length is in the range of `u8` we will encode it as u8, and if in the range of u16 will encode it as u16 and so on. this means that if most string are below `65,536` we will only use 2 bytes to encode the length. and if I want to encode a string with length `65,535` with this impl it will only use `65,535 + 2` (+ ctrl byte) compared to ~`67,500` bytes in the current implementation due to marker byte between each block of 32 bytes. This also will reduce the amount of branch as we do not copy in blocks and have an iterator of blocks, and the branch of what type should use for the length has pretty wide range so much less probability for mispredicting the length. BTW branch misprediction here was one of the main bottlenecks as described in detail here: - https://github.com/apache/arrow-rs/issues/9083#issuecomment-3703789605 ##### 10. Improve list encoding Encode the number of items first, then the lengths and then copy the child rows as is. this will allow for large continues data copy. ##### 11. Improve list encoding even further Instead of encoding the lengths as 32 bit or 64 bit, we will encode them all in the same type that will be the smallest type that is needed to encode the worst case scenario so we can reduce space. -------- The row convertor will guarantee as little as possible to allow for further optimizations to be made. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
