andygrove opened a new issue, #4228: URL: https://github.com/apache/datafusion-comet/issues/4228
### What is the problem the feature request solves? Comet flattens Parquet dictionary-encoded string columns at the read boundary in `native/core/src/parquet/parquet_support.rs:170` (the `take(values, keys, None)` branch in `parquet_convert_array`). Because `QueryPlanSerde` serializes Spark `StringType` as Arrow `Utf8`, the requested `to_type` is never `Dictionary`, so this branch always fires. Every native expression sees a flat `Utf8` array, even when the source column has very low cardinality. This loses the dict advantage. For low-cardinality columns (URL hosts, country codes, status flags, enum-shaped strings) an expression like `unhex`, `length`, `lower`, `upper`, `regexp_replace`, etc. evaluates per-row instead of per-unique-value, multiplying the work by `N / M` where `M` is the dictionary size. ### Describe the potential solution Three layers, ordered by feasibility: 1. **Per-expression dict-aware paths.** Many expressions (`substring`, `rlike`, `cast`, the `temporal` kernels) already operate on dictionary values and reassemble. Audit the rest and apply the pattern from `string_funcs/substring.rs:54-60`. Document this in `adding_a_new_expression.md`. 2. **Plan-aware read boundary.** Make `parquet_convert_array`'s flatten-vs-preserve a function of whether the immediate downstream expression accepts dict input, not just the requested `to_type`. Needs a small planner pass to thread that hint back to the scan. 3. **Spark-boundary materialization stays.** Anywhere a batch leaves native execution (shuffle, fallback, scan output to JVM) flatten as today. Spark's row layout has no dict shape. Until (2) lands, per-expression work in (1) only helps when a dict is produced mid-plan, which is rare. Both are needed for measurable wins. ### Additional context Concrete starting point: pick one common high-cardinality-savings expression (e.g. `length` or `lower`), add the dict path, and microbenchmark on a 1M row column with ~100 unique values. If the speedup is in the 5-15% range expected for string-heavy TPC-DS-style queries, that is enough signal to justify the planner work in (2). Came up while reviewing #4222. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
