pierrebelzile opened a new issue, #10160:
URL: https://github.com/apache/arrow-rs/issues/10160

   ### Describe the bug
   
   When we concatenate 2 or more batches where some columns are dictionaries, 
the dictionaries are concatenated instead of being merged.
   
   The dictionary may end-up with duplicates. For example if both batches have 
the string "alpha", the new collapsed batch with have 2 dictionary entries for 
that string. The result is strictly correct (all indices point to their 
original value).  However any library that tries to perform an operation on the 
indices will obtain a wrong result. (e.g, an aggregation).
   
   Perhaps more direct: pandas will reject the batch because it validates 
uniqueness:
   ```
   lib/python3.11/site-packages/pandas/core/dtypes/dtypes.py", line 570, in 
validate_categories
       raise ValueError("Categorical categories must be unique")
   ```
   
   Pandas does have a function (union_categoricals) to merge dataframes with 
different dictionaries but it is not intended to reduce dictionaries of a 
single dataframe.
   
   ### To Reproduce
   
   ```
   //! Concatenation tests for dictionary arrays.
   use std::sync::Arc;
   
   use arrow::{
       array::{Array, ArrayRef, AsArray, DictionaryArray, Int32Array, 
RecordBatch, StringArray},
       compute::concat_batches,
       datatypes::{DataType, Field, Int32Type, Schema},
   };
   
   /// Build a dictionary array with explicit dictionary value order and key 
values.
   fn dictionary_array(dictionary_values: Vec<&str>, keys: Vec<i32>) -> 
ArrayRef {
       Arc::new(
           DictionaryArray::<Int32Type>::try_new(
               Int32Array::from(keys),
               Arc::new(StringArray::from(dictionary_values)),
           )
           .expect("dictionary array"),
       )
   }
   
   /// Build a one-column record batch containing a dictionary array.
   fn dictionary_batch(
       schema: Arc<Schema>,
       dictionary_values: Vec<&str>,
       keys: Vec<i32>,
   ) -> RecordBatch {
       RecordBatch::try_new(schema, vec![dictionary_array(dictionary_values, 
keys)])
           .expect("record batch")
   }
   
   /// this test will start to fail when arrow dictionary concat is supported
   #[test]
   fn concat_then_normalize_deduplicates_dictionary_values_and_remaps_keys() {
       let schema = Arc::new(Schema::new(vec![Field::new(
           "symbol",
           DataType::Dictionary(Box::new(DataType::Int32), 
Box::new(DataType::Utf8)),
           false,
       )]));
   
       let batch_0 = dictionary_batch(
           schema.clone(),
           vec!["alpha", "beta", "gamma"],
           vec![0, 1, 2, 0],
       );
       let batch_1 = dictionary_batch(
           schema.clone(),
           vec!["gamma", "alpha", "beta"],
           vec![2, 1, 0, 2],
       );
   
       let raw_concatenated = concat_batches(&schema, &[batch_0, 
batch_1]).expect("concat batches");
       let raw_column = raw_concatenated.column(0).as_dictionary::<Int32Type>();
       // this should be 3 because both batches had the same values
       assert_eq!(raw_column.values().len(), 6);
   }
   
   ```
   
   ### Expected behavior
   
   The dictionary should only contain unique entries.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to