alamb commented on code in PR #7875:
URL: https://github.com/apache/arrow-rs/pull/7875#discussion_r2187650306
##########
arrow-array/src/array/byte_view_array.rs:
##########
@@ -1193,9 +1197,19 @@ mod tests {
b"abcdefghi",
b"abcdefghij",
b"abcdefghijk",
- b"abcdefghijkl", // 12 bytes, max inline
- b"bar",
- b"bar\0", // special case to test length tiebreaker
+ b"abcdefghijkl",
+ //
───────────────────────────────────────────────────────────────────────
+ // This pair verifies that we didn’t accidentally reverse the
inline bytes:
+ // without our fix, “backend one” would compare as if it were
+ // “eno dnekcab”, so “one” might end up sorting _after_ “two”.
+ b"backend one", // special case: tests byte-order reversal bug
+ b"backend two",
Review Comment:
Can we also please include strings that have the same exact inline prefix
the length differ as well as content ?
Something like
* `LongerThan12Bytes`
* `LongerThan12Bytez`
* `LongerThan12Bytes\0`
* `LongerThan12Byt`
##########
arrow-array/src/array/byte_view_array.rs:
##########
@@ -605,29 +605,33 @@ impl<T: ByteViewType + ?Sized> GenericByteViewArray<T> {
/// key("bar\0") = 0x0000000000000000000062617200000004
/// ⇒ key("bar") < key("bar\0")
/// ```
+ ///
+ /// ### Why the old code failed
+ /// Fixed in: <https://github.com/apache/arrow-rs/pull/7875>
+ /// In the previous implementation, we did:
+ /// ```ignore
+ /// let word_be = u128::from_le_bytes(raw.to_le_bytes()).to_be();
+ /// ```
+ /// That single `.to_be()` flips **all 16 bytes**, including the data
region, effectively
+ /// reversing each string. As a result:
+ /// - `"backend one"` was compared as if it were `"eno dnekcab"`,
inverting lexicographical order
+ /// - any multi‑byte string ordering was completely incorrect
#[inline(always)]
pub fn inline_key_fast(raw: u128) -> u128 {
- // Convert the raw u128 (little-endian) into bytes for manipulation
let raw_bytes = raw.to_le_bytes();
- // Extract the length (first 4 bytes), convert to big-endian u32, and
promote to u128
- let len_le = &raw_bytes[0..4];
- let len_be = u32::from_le_bytes(len_le.try_into().unwrap()).to_be() as
u128;
-
- // Extract the inline string bytes (next 12 bytes), place them into
the lower 12 bytes of a 16-byte array,
- // padding the upper 4 bytes with zero to form a little-endian u128
value
- let mut inline_bytes = [0u8; 16];
- inline_bytes[4..16].copy_from_slice(&raw_bytes[4..16]);
-
- // Convert to big-endian to ensure correct lexical ordering
- let inline_u128 = u128::from_le_bytes(inline_bytes).to_be();
+ // Parse native length from LE, then write BE bytes
+ let length = u32::from_le_bytes(raw_bytes[0..4].try_into().unwrap());
Review Comment:
I think most of the other code in this repo extracts the length from a view
using `as u32`. I suggest we follow the same convention
```suggestion
let length = raw as u32;
```
##########
arrow-array/src/array/byte_view_array.rs:
##########
@@ -605,29 +605,33 @@ impl<T: ByteViewType + ?Sized> GenericByteViewArray<T> {
/// key("bar\0") = 0x0000000000000000000062617200000004
/// ⇒ key("bar") < key("bar\0")
/// ```
+ ///
+ /// ### Why the old code failed
Review Comment:
I think it would be more helpful to future readers to explain here to
explain how the calculation works, rather than explaining why a previous
attempt didnt' work 🤔
##########
arrow-array/src/array/byte_view_array.rs:
##########
@@ -1193,9 +1197,19 @@ mod tests {
b"abcdefghi",
b"abcdefghij",
b"abcdefghijk",
- b"abcdefghijkl", // 12 bytes, max inline
- b"bar",
- b"bar\0", // special case to test length tiebreaker
+ b"abcdefghijkl",
+ //
───────────────────────────────────────────────────────────────────────
+ // This pair verifies that we didn’t accidentally reverse the
inline bytes:
+ // without our fix, “backend one” would compare as if it were
+ // “eno dnekcab”, so “one” might end up sorting _after_ “two”.
+ b"backend one", // special case: tests byte-order reversal bug
Review Comment:
I wonder why this wasn't this caught withthe `xyy` and `xyz` test case?
##########
arrow-array/src/array/byte_view_array.rs:
##########
@@ -1193,9 +1197,19 @@ mod tests {
b"abcdefghi",
b"abcdefghij",
b"abcdefghijk",
- b"abcdefghijkl", // 12 bytes, max inline
- b"bar",
- b"bar\0", // special case to test length tiebreaker
+ b"abcdefghijkl",
+ //
───────────────────────────────────────────────────────────────────────
+ // This pair verifies that we didn’t accidentally reverse the
inline bytes:
+ // without our fix, “backend one” would compare as if it were
+ // “eno dnekcab”, so “one” might end up sorting _after_ “two”.
+ b"backend one", // special case: tests byte-order reversal bug
+ b"backend two",
Review Comment:
Also, given the endian swapping going on, can we please also include a few
strings that are more than 256 bytes long (so the length requires 2 bytes to
store)?
##########
arrow-array/src/array/byte_view_array.rs:
##########
@@ -605,29 +605,33 @@ impl<T: ByteViewType + ?Sized> GenericByteViewArray<T> {
/// key("bar\0") = 0x0000000000000000000062617200000004
Review Comment:
I think this diagram needs be updated as well
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]