rich-t-kid-datadog commented on issue #3520: URL: https://github.com/apache/arrow-rs/issues/3520#issuecomment-3000909289
### **Road Map of REE epic:** Current Order: **Support REE in cast Kernels** **Support Binary Expressions** **Write Sql page of expected REE array workloads** **Support string functions for REE** Support REE in cast kernels: Summary of PR: _This PR Adds support for casting to and from RunEndEncoded (REE) arrays within the Arrow Rust (arrow-rs) library.Also expands the casting logic to handle logical types, ensuring that REE arrays can participate in conversions involving logical representations as well._ [Pull Request for REE arrow cast](https://github.com/apache/arrow-rs/pull/7713) REE support in binary expressions (arithmetic, comparison) : _Goals:_ Enable support for arithmetic and comparison expressions (==, !=, <, >, etc.) over RunEndEncodedArray (REE) inputs. Support REE participation in: Equality (eq, neq) Partial equality (grouping, hashing, join keys) Arithmetic expressions (where applicable) Ensure consistency with how Arrow currently handles DictionaryArray and other nested/logical types in binary expressions. _Issue faced:_ What defines equality between REE arrays? Should two REE arrays be considered equal if: Their logical arrays (i.e., expanded views) are equal? Their physical run-end structure is equal? Their value arrays are equal, but the run lengths differ? what attributes should be taken into account when taking the hash of REE arrays. Should the same arrow with two different types be considered equal, should they result in the same hash? EX: REE_1 = run_ends(int64): [4,9,22] values:[3,4,1] REE_2 = run_ends(int16):[4,9,22] values:[3,4,1] should REE_1 == REE_2 since their logical representations are the same or should this result in inequality since the datatypes don't match. What about for hashing? **REE support in string functions** _Goals:_ Enable REE arrays with string values (RunEndEncoded<Index, Utf8> or Utf8View) to work seamlessly with Arrow's string functions like: length, concat, substr starts_with, ends_with, contains Regex-based ops (like, match, etc.) String functions operate over the logical representation of REE not physical. _Issues:_ off the top of my head id assume theres a performance cost to constantly transforming REE into their logical forms to perform string operations. there may be smarter ways to keep track of the data inside of REE arrays without needing to continuously decode them, **Write sql page of expected use’s of REE array workloads.** write a sql_logictest file with REE encoded data and try to do basic queries on it (that will likely most/all fail due to lack of support). For example, something like this https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/dictionary.slt _What ever operations fail, work on implementing those first for REE._ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
