andygrove opened a new issue, #4228:
URL: https://github.com/apache/datafusion-comet/issues/4228

   ### What is the problem the feature request solves?
   
   Comet flattens Parquet dictionary-encoded string columns at the read 
boundary in `native/core/src/parquet/parquet_support.rs:170` (the `take(values, 
keys, None)` branch in `parquet_convert_array`). Because `QueryPlanSerde` 
serializes Spark `StringType` as Arrow `Utf8`, the requested `to_type` is never 
`Dictionary`, so this branch always fires. Every native expression sees a flat 
`Utf8` array, even when the source column has very low cardinality.
   
   This loses the dict advantage. For low-cardinality columns (URL hosts, 
country codes, status flags, enum-shaped strings) an expression like `unhex`, 
`length`, `lower`, `upper`, `regexp_replace`, etc. evaluates per-row instead of 
per-unique-value, multiplying the work by `N / M` where `M` is the dictionary 
size.
   
   ### Describe the potential solution
   
   Three layers, ordered by feasibility:
   
   1. **Per-expression dict-aware paths.** Many expressions (`substring`, 
`rlike`, `cast`, the `temporal` kernels) already operate on dictionary values 
and reassemble. Audit the rest and apply the pattern from 
`string_funcs/substring.rs:54-60`. Document this in 
`adding_a_new_expression.md`.
   
   2. **Plan-aware read boundary.** Make `parquet_convert_array`'s 
flatten-vs-preserve a function of whether the immediate downstream expression 
accepts dict input, not just the requested `to_type`. Needs a small planner 
pass to thread that hint back to the scan.
   
   3. **Spark-boundary materialization stays.** Anywhere a batch leaves native 
execution (shuffle, fallback, scan output to JVM) flatten as today. Spark's row 
layout has no dict shape.
   
   Until (2) lands, per-expression work in (1) only helps when a dict is 
produced mid-plan, which is rare. Both are needed for measurable wins.
   
   ### Additional context
   
   Concrete starting point: pick one common high-cardinality-savings expression 
(e.g. `length` or `lower`), add the dict path, and microbenchmark on a 1M row 
column with ~100 unique values. If the speedup is in the 5-15% range expected 
for string-heavy TPC-DS-style queries, that is enough signal to justify the 
planner work in (2).
   
   Came up while reviewing #4222.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to