pierrebelzile opened a new issue, #10160:
URL: https://github.com/apache/arrow-rs/issues/10160
### Describe the bug
When we concatenate 2 or more batches where some columns are dictionaries,
the dictionaries are concatenated instead of being merged.
The dictionary may end-up with duplicates. For example if both batches have
the string "alpha", the new collapsed batch with have 2 dictionary entries for
that string. The result is strictly correct (all indices point to their
original value). However any library that tries to perform an operation on the
indices will obtain a wrong result. (e.g, an aggregation).
Perhaps more direct: pandas will reject the batch because it validates
uniqueness:
```
lib/python3.11/site-packages/pandas/core/dtypes/dtypes.py", line 570, in
validate_categories
raise ValueError("Categorical categories must be unique")
```
Pandas does have a function (union_categoricals) to merge dataframes with
different dictionaries but it is not intended to reduce dictionaries of a
single dataframe.
### To Reproduce
```
//! Concatenation tests for dictionary arrays.
use std::sync::Arc;
use arrow::{
array::{Array, ArrayRef, AsArray, DictionaryArray, Int32Array,
RecordBatch, StringArray},
compute::concat_batches,
datatypes::{DataType, Field, Int32Type, Schema},
};
/// Build a dictionary array with explicit dictionary value order and key
values.
fn dictionary_array(dictionary_values: Vec<&str>, keys: Vec<i32>) ->
ArrayRef {
Arc::new(
DictionaryArray::<Int32Type>::try_new(
Int32Array::from(keys),
Arc::new(StringArray::from(dictionary_values)),
)
.expect("dictionary array"),
)
}
/// Build a one-column record batch containing a dictionary array.
fn dictionary_batch(
schema: Arc<Schema>,
dictionary_values: Vec<&str>,
keys: Vec<i32>,
) -> RecordBatch {
RecordBatch::try_new(schema, vec![dictionary_array(dictionary_values,
keys)])
.expect("record batch")
}
/// this test will start to fail when arrow dictionary concat is supported
#[test]
fn concat_then_normalize_deduplicates_dictionary_values_and_remaps_keys() {
let schema = Arc::new(Schema::new(vec![Field::new(
"symbol",
DataType::Dictionary(Box::new(DataType::Int32),
Box::new(DataType::Utf8)),
false,
)]));
let batch_0 = dictionary_batch(
schema.clone(),
vec!["alpha", "beta", "gamma"],
vec![0, 1, 2, 0],
);
let batch_1 = dictionary_batch(
schema.clone(),
vec!["gamma", "alpha", "beta"],
vec![2, 1, 0, 2],
);
let raw_concatenated = concat_batches(&schema, &[batch_0,
batch_1]).expect("concat batches");
let raw_column = raw_concatenated.column(0).as_dictionary::<Int32Type>();
// this should be 3 because both batches had the same values
assert_eq!(raw_column.values().len(), 6);
}
```
### Expected behavior
The dictionary should only contain unique entries.
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]