[GitHub] [arrow-rs] alamb commented on a diff in pull request #1048: Optimized sorting for ordered dictionaries

GitBox Fri, 03 Jun 2022 07:57:19 -0700


alamb commented on code in PR #1048:
URL: https://github.com/apache/arrow-rs/pull/1048#discussion_r889011102



##########
arrow/src/array/array_dictionary.rs:
##########
@@ -163,6 +165,92 @@ impl<'a, K: ArrowPrimitiveType> DictionaryArray<K> {
         self.is_ordered
     }
 
+    /// Returns a DictionaryArray referencing the same data
+    /// with the [DictionaryArray::is_ordered] flag set to `true`.
+    /// Note that this does not actually reorder the values in the dictionary.
+    pub fn as_ordered(&self) -> Self {

Review Comment:
   I wonder if this API should be marked as `unsafe` as it relies on the user 
"doing the right thing"? 
   
   Perhaps we could call it something like `pub unsafe fn 
as_orderd_unchecked()` ?



##########
arrow/src/array/array_dictionary.rs:
##########
@@ -163,6 +165,92 @@ impl<'a, K: ArrowPrimitiveType> DictionaryArray<K> {
         self.is_ordered
     }
 
+    /// Returns a DictionaryArray referencing the same data
+    /// with the [DictionaryArray::is_ordered] flag set to `true`.
+    /// Note that this does not actually reorder the values in the dictionary.
+    pub fn as_ordered(&self) -> Self {
+        Self {
+            data: self.data.clone(),
+            values: self.values.clone(),
+            keys: PrimitiveArray::<K>::from(self.keys.data().clone()),
+            is_ordered: true,
+        }
+    }
+
+    pub fn make_ordered(&self) -> Result<Self> {
+        let values = self.values();
+        if self.is_ordered || values.is_empty() {
+            Ok(self.as_ordered())
+        } else {
+            // validate up front that we can do conversions from/to usize for 
the whole range of keys
+            // this allows using faster unchecked conversions below
+            K::Native::from_usize(values.len())

Review Comment:
   👍 



##########
arrow/src/compute/kernels/sort.rs:
##########
@@ -151,8 +152,16 @@ fn partition_validity(array: &ArrayRef) -> (Vec<u32>, 
Vec<u32>) {
         // faster path
         0 => ((0..(array.len() as u32)).collect(), vec![]),
         _ => {
-            let indices = 0..(array.len() as u32);
-            indices.partition(|index| array.is_valid(*index as usize))
+            let validity = array.data().null_buffer().unwrap();

Review Comment:
   @tustvold  / @viirya  can you help verify this change?



##########
arrow/src/array/ord.rs:
##########
@@ -92,13 +93,22 @@ where
     let left_values = StringArray::from(left.values().data().clone());
     let right_values = StringArray::from(right.values().data().clone());
 
-    Box::new(move |i: usize, j: usize| {
-        let key_left = left_keys.value(i).to_usize().unwrap();
-        let key_right = right_keys.value(j).to_usize().unwrap();
-        let left = left_values.value(key_left);
-        let right = right_values.value(key_right);
-        left.cmp(right)
-    })
+    // only compare by keys if both arrays actually point to the same value 
buffers
+    if left.is_ordered() && ArrayData::ptr_eq(left_values.data(), 
right_values.data()) {

Review Comment:
   I don't understand why we also need to check `left.is_ordered()` if we know 
the arrays actually point to the same underlying values.



##########
arrow/src/compute/kernels/sort.rs:
##########
@@ -1210,10 +1236,16 @@ mod tests {
     fn test_sort_string_dict_arrays<T: ArrowDictionaryKeyType>(
         data: Vec<Option<&str>>,
         options: Option<SortOptions>,
+        ordered: bool,
         limit: Option<usize>,
         expected_data: Vec<Option<&str>>,
     ) {
-        let array = data.into_iter().collect::<DictionaryArray<T>>();
+        let mut array = data.into_iter().collect::<DictionaryArray<T>>();
+
+        if ordered {
+            array = array.as_ordered();

Review Comment:
   Is this supposed to call `array.make_ordered()`? I don't understand why the 
test would pass in a dictionary whose values appear to be unsorted and then 
mark them `as_sorted`
   
   Sorry if I am missing something obvious



##########
arrow/src/array/array_dictionary.rs:
##########
@@ -163,6 +165,92 @@ impl<'a, K: ArrowPrimitiveType> DictionaryArray<K> {
         self.is_ordered
     }
 
+    /// Returns a DictionaryArray referencing the same data
+    /// with the [DictionaryArray::is_ordered] flag set to `true`.
+    /// Note that this does not actually reorder the values in the dictionary.
+    pub fn as_ordered(&self) -> Self {
+        Self {
+            data: self.data.clone(),
+            values: self.values.clone(),
+            keys: PrimitiveArray::<K>::from(self.keys.data().clone()),
+            is_ordered: true,
+        }
+    }
+
+    pub fn make_ordered(&self) -> Result<Self> {
+        let values = self.values();
+        if self.is_ordered || values.is_empty() {
+            Ok(self.as_ordered())
+        } else {
+            // validate up front that we can do conversions from/to usize for 
the whole range of keys
+            // this allows using faster unchecked conversions below
+            K::Native::from_usize(values.len())
+                .ok_or(ArrowError::DictionaryKeyOverflowError)?;
+            // sort indices are u32 so we cannot sort larger dictionaries
+            u32::try_from(values.len())
+                .map_err(|_| ArrowError::DictionaryKeyOverflowError)?;
+
+            // sort the dictionary values
+            let sort_indices = sort_to_indices(values, None, None)?;
+            let sorted_dictionary = take(
+                values.as_ref(),
+                &sort_indices,
+                Some(TakeOptions {
+                    check_bounds: false,
+                }),
+            )?;
+
+            // build a lookup table from old to new key
+            let mut lookup = vec![0; sort_indices.len()];
+            sort_indices
+                .values()
+                .iter()
+                .enumerate()
+                .for_each(|(i, idx)| {
+                    lookup[*idx as usize] = i;
+                });
+
+            let mapped_keys_iter = self.keys_iter().map(|opt_key| {
+                if let Some(key) = opt_key {
+                    // Safety:
+                    // lookup has the same length as the dictionary values
+                    // so if the keys were valid for values they will be valid 
indices into lookup
+                    unsafe {
+                        debug_assert!(key < lookup.len());
+                        let new_key = *lookup.get_unchecked(key);
+                        debug_assert!(new_key < values.len());
+                        K::Native::from_usize(new_key).unwrap_unchecked()
+                    }
+                } else {
+                    K::default_value()
+                }
+            });
+
+            // Safety:
+            // PrimitiveIter has a trusted len
+            let new_key_buffer =
+                unsafe { Buffer::from_trusted_len_iter(mapped_keys_iter) };
+
+            // Safety:
+            // after remapping the keys will be in the same range as before
+            let new_data = unsafe {
+                ArrayData::new_unchecked(
+                    self.data_type().clone(),
+                    self.len(),
+                    Some(self.data.null_count()),

Review Comment:
   It makes sense the nullness is the same before and after sorting. I assume 
it was faster to do this than to create the new values buffer directly from an 
iterator of `Option<K::Native>`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] alamb commented on a diff in pull request #1048: Optimized sorting for ordered dictionaries

Reply via email to