Re: [I] [Epic] Implement `RunArray` (Run Length Encoding (RLE) / Run End Encoding (REE) support) [arrow-rs]

via GitHub Tue, 24 Jun 2025 08:13:59 -0700


rich-t-kid-datadog commented on issue #3520:
URL: https://github.com/apache/arrow-rs/issues/3520#issuecomment-3000909289


   ### **Road Map of REE epic:**
   
   Current Order:
   
   **Support REE in cast Kernels**
   
   **Support Binary Expressions**
   
   **Write Sql page of expected REE array workloads**
   
   **Support string functions for REE**
   
   Support REE in cast kernels: 
   
   Summary of PR:
   
   _This PR Adds support for casting to and from RunEndEncoded (REE) arrays 
within the Arrow Rust (arrow-rs) library.Also expands the casting logic to 
handle logical types, ensuring that REE arrays can participate in conversions 
involving logical representations as well._
   
   [Pull Request for REE arrow 
cast](https://github.com/apache/arrow-rs/pull/7713)
   
   
   
   REE support in binary expressions (arithmetic, comparison) :
   
   _Goals:_
   
   Enable support for arithmetic and comparison expressions (==, !=, <, >, 
etc.) over RunEndEncodedArray (REE) inputs.
   
   Support REE participation in:
   
   Equality (eq, neq)
   
   Partial equality (grouping, hashing, join keys)
   
   Arithmetic expressions (where applicable)
   
   Ensure consistency with how Arrow currently handles DictionaryArray and 
other nested/logical types in binary expressions.
   
   
   
   _Issue faced:_
   
   What defines equality between REE arrays?
   
   Should two REE arrays be considered equal if:
   
   Their logical arrays (i.e., expanded views) are equal?
   
   Their physical run-end structure is equal?
   
   Their value arrays are equal, but the run lengths differ?
   
   what attributes should be taken into account when taking the hash of REE 
arrays. Should the same arrow with two different types be considered equal, 
should they result in the same hash?
   
   EX:
   
    REE_1 = run_ends(int64): [4,9,22] values:[3,4,1]
   
    REE_2 = run_ends(int16):[4,9,22] values:[3,4,1]
   
   should REE_1 == REE_2 since their logical representations are the same or 
should this result in inequality since the datatypes don't match. What about 
for hashing?
   
    
   
   **REE support in string functions**
   
   _Goals:_
   
   Enable REE arrays with string values (RunEndEncoded<Index, Utf8> or 
Utf8View) to work seamlessly with Arrow's string functions like:
   
   length, concat, substr
   
   starts_with, ends_with, contains
   
   Regex-based ops (like, match, etc.)
   
   String functions operate over the logical representation of REE not physical.
   
   _Issues:_
   
   off the top of my head id assume theres a performance cost to constantly 
transforming REE into their logical forms to perform string operations. there 
may be smarter ways to keep track of the data inside of REE arrays without 
needing to continuously decode them,
   
   **Write sql page of expected use’s of REE array workloads.**
   
   write a sql_logictest file with REE encoded data and try to do basic queries 
on it (that will likely most/all fail due to lack of support). For example, 
something like this 
https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/dictionary.slt
 
   
   _What ever operations fail, work on implementing those first for REE._


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Epic] Implement `RunArray` (Run Length Encoding (RLE) / Run End Encoding (REE) support) [arrow-rs]

Reply via email to