rluvaton commented on issue #9083:
URL: https://github.com/apache/arrow-rs/issues/9083#issuecomment-3756262948

   The POC PR looks good in term of performance (and contains a lot of 
commented code and playing around) that I want to start a discussion.
   
   POC PR:
   - #9164
   
   **I first laying out how I plan to integrate this the easiest way for 
reviewers and what optimizations I done in that PR in each step here**
   
   ### How I plan to add those in a way that is easy to review and understand
   
   If we plan to process, I will create these chains of PRs for better review 
and make it easy to understand, each one will focus on one thing and everything 
except the first PR will be small.
   
   #### 1. Just copy the code and little tweaks
   1. Copy the entire RowConverter and all the conversion code to a new 
directory named `unordered_row`
   2. Rename to `Unordered*`
   3. Remove the Ord trait - as we will no longer guarantee it (even though at 
this step we just copied the code)
   4. update tests
   5. Change the API to get `Fields` instead of `SortField` - this is needed so 
I won't introduce a breaking change later 
   
   This will be the largest PR but it is just made to have a baseline and make 
it easy to review, especially for people that are familiar with RowConverter.
   
   #### 2. Add benchmarks
   just run all row format benchmarks on the unordered row format as well
   
   So next Pull request can have a baseline
   
   #### 3. Remove all sort and order related code
   because the first PR is just a copy to make it easier to review.
   
   #### 4. Implement Optimization for field reordering
   Because we no longer guarantee ordering, we can reorder fields so that:
   1. fields with the same type can be grouped together so later optimization 
on encoding multiple columns at once is possible
   2. optimize for cache lines (so we won't write u64, u8, u64 and instead 
write u64, u64, u8 so it will be mostly aligned - and later we can add padding 
if we want) 
   3. encoding fields with same type right after another improve code locality
   
   This will also help with the store bottleneck as described here:
   - https://github.com/apache/arrow-rs/issues/9083#issuecomment-3703789536
   
   #### 5. Encode nulls at the beginning of a row and make it bit packed + 
metadata bit
   currently we encode a null in a full byte which is wasteful, so we will bit 
pack all nullability 
   this is done to save space.
   
   there will be an optional metadata bit/byte depending on the number of 
columns to encode that tells if all columns are valid or unknown to fast skip 
rows
   
   will have detailed explanation of the code in the matching PR
   
   #### 6. Don't encode nulls for non nullable fields
   Skip null encoding for non nullable fields to save computation
   
   This will also help with the store bottleneck as described here:
   - https://github.com/apache/arrow-rs/issues/9083#issuecomment-3703789536
   
   #### 7. Implement encoding 4 same type primitive columns at a time.
   because we can change the ordering of the encoded data, we will group 
together columns with the same types and also group by nullability.
   
   so we can implement encoding of 4 columns at a time to really take advantage 
of all the CPU units and to also reduce the amount of jumps we do in memory if 
we need to come back every time 
   
   #### 8. Implement custom boolean encoding that save space
   Instead of encoding 1 bit and 1 byte for the null and boolean respectively,  
we will encode both in a single byte (we can encode a boolean in a single byte 
if we decide not to encode the nulls in the same byte. or bit pack them like 
nulls.)
   
   #### 9. Change how we encode binary based data to be simpler and reduce 
branches + logic:
   will encode the length and then will encode the bytes (just copy the bytes 
as is).
   
   to avoid small strings (like 1-8 bytes) have the encoded length larger than 
the actual byte we will use a dynamic type to encode the length.
   if the length is in the range of `u8` we will encode it as u8, and if in the 
range of u16 will encode it as u16 and so on.
   
   this means that if most string are below `65,536` we will only use 2 bytes 
to encode the length.
   
   and if I want to encode a string with length `65,535` with this impl it will 
only use `65,535 + 2` (+ ctrl byte) compared to ~`67,500` bytes in the current 
implementation due to marker byte between each block of 32 bytes. 
   
   This also will reduce the amount of branch as we do not copy in blocks and 
have an iterator of blocks, and the branch of what type should use for the 
length has pretty wide range so much less probability for mispredicting the 
length.  
   
   BTW branch misprediction here was one of the main bottlenecks as described 
in detail here:
   - https://github.com/apache/arrow-rs/issues/9083#issuecomment-3703789605
   
   ##### 10. Improve list encoding
   Encode the number of items first, then the lengths and then copy the child 
rows as is. this will allow for large continues data copy.
   
   ##### 11. Improve list encoding even further
   Instead of encoding the lengths as 32 bit or 64 bit, we will encode them all 
in the same type that will be the smallest type that is needed to encode the 
worst case scenario so we can reduce space.
   
   
   --------
   
   The row convertor will guarantee as little as possible to allow for further 
optimizations to be made.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to