Andrew Lamb created ARROW-10159: ----------------------------------- Summary: [Rust][DataFusion] Add support for Dictionary types in data fusion Key: ARROW-10159 URL: https://issues.apache.org/jira/browse/ARROW-10159 Project: Apache Arrow Issue Type: New Feature Reporter: Andrew Lamb
We have a system that need to process low cardinality string data (aka there are only a few distinct values, but there are many millions of values). Using a `StringArray` is very expensive as the same string value is copied over and over again. The `DictionaryArray` was exactly designed to handle this situation where rather than repeating each string the data uses indexes into a dictionary and thus repeats integer values. Sadly, DataFusion does not support processing on `DictionaryArray` types for several reasons. This test (to be added to `arrow/rust/datafusion/tests/sql.rs`) shows what I would like to be possible: {code} #[tokio::test] async fn query_on_string_dictionary() -> Result<()> { // ensure that data fusion can operate on dictionary types // Use StringDictionary (32 bit indexes = keys) let field_type = DataType::Dictionary( Box::new(DataType::Int32), Box::new(DataType::Utf8), ); let schema = Arc::new(Schema::new(vec![Field::new("d1", field_type, true)])); let keys_builder = PrimitiveBuilder::<Int32Type>::new(10); let values_builder = StringBuilder::new(10); let mut builder = StringDictionaryBuilder::new( keys_builder, values_builder ); builder.append("one")?; builder.append_null()?; builder.append("three")?; let array = Arc::new(builder.finish()); let data = RecordBatch::try_new( schema.clone(), vec![array], )?; let table = MemTable::new(schema, vec![vec![data]])?; let mut ctx = ExecutionContext::new(); ctx.register_table("test", Box::new(table)); // Basic SELECT let sql = "SELECT * FROM test"; let actual = execute(&mut ctx, sql).await.join("\n"); let expected = "\"one\"\nNULL\n\"three\"".to_string(); assert_eq!(expected, actual); // basic filtering let sql = "SELECT * FROM test WHERE d1 IS NOT NULL"; let actual = execute(&mut ctx, sql).await.join("\n"); let expected = "\"one\"\n\"three\"".to_string(); assert_eq!(expected, actual); // filtering with constant let sql = "SELECT * FROM test WHERE d1 = 'three'"; let actual = execute(&mut ctx, sql).await.join("\n"); let expected = "\"three\"".to_string(); assert_eq!(expected, actual); // Expression evaluation let sql = "SELECT concat(d1, '-foo') FROM test"; let actual = execute(&mut ctx, sql).await.join("\n"); let expected = "\"one-foo\"\nNULL\n\"three-foo\"".to_string(); assert_eq!(expected, actual); // aggregation let sql = "SELECT COUNT(d1) FROM test"; let actual = execute(&mut ctx, sql).await.join("\n"); let expected = "2".to_string(); assert_eq!(expected, actual); Ok(()) } {code} However, it errors immediately: {code} ---- query_on_string_dictionary stdout ---- thread 'query_on_string_dictionary' panicked at 'assertion failed: `(left == right)` left: `"\"one\"\nNULL\n\"three\""`, right: `"???\nNULL\n???"`', datafusion/tests/sql.rs:989:5 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace {code{ This ticket tracks adding proper support Dictionary types to DataFusion. I will break the work down into several smaller subtasks -- This message was sent by Atlassian Jira (v8.3.4#803005)