[I] Document / Add an example of preserving dictionary encoding when reading parquet [arrow-rs]

via GitHub Mon, 05 Jan 2026 04:25:34 -0800


alamb opened a new issue, #9095:
URL: https://github.com/apache/arrow-rs/issues/9095


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   This has come up several times, most recently on the arrow mailing list:
   
   https://lists.apache.org/thread/5kg3q0y4cqzl16q6vrvkxlw0yxmk4241
   
   > Discussing how to expose dictionary data may lead to multiple overlapping
   considerations, long discussions and perhaps format and API changes. So we
   hope that there could be some loopholes or small change that could
   potentially unblock such optimization without going into a large design/API
   space. For instance:
   
   > 1. Can we introduce a hint to ParquetReader which will produce
   > DictionaryArray for the given column instead of a concrete array
   > (StringViewArray in our case)?
   > 2. When doing late materialization, maybe we can extend ArrowPredicate,
   > so that it first instructs Parquet reader that it wants to get encoded
   > dictionaries first, and once they are supplied, return another predicate
   > that will be applied to encoded data. E.g., "x = some_value" translates to
   > "x_encoded = index".
   
   @tustvold  pointed out:
   
   > What you are requesting is already supported in parquet-rs. In
   > particular if you request a UTF8 or Binary DictionaryArray for the
   > column it will decode the column preserving the dictionary encoding. You
   >  can override the embedded arrow schema, if any, using
   > ArrowReaderOptions::with_schema [1]. Provided you don't read RecordBatch
   > across row groups and therefore across dictionaries, which the async
   > reader doesn't, this should never materialize the dictionary. FWIW the
   > ViewArray decodeders will also preserve the dictionary encoding,
   > however, the dictionary encoded nature is less explicit in the resulting
   > arrays.
   
   The API does have an example, but it shows how to read i32 as a timestamp, 
rather than dictionary encoding
   
https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderOptions.html#method.with_schema
   
   **Describe the solution you'd like**
   I would like these features to be more easily documented:
   1. An example showing how to override the schema of the parquet reader to 
keep the Dictionary encoding
   
   The example should mention that the dictionary encoding is preserved even 
when the original data was not dictionary encoded
   
   It woul
   
   **Describe alternatives you've considered**
   <!--
   A clear and concise description of any alternative solutions or features 
you've considered.
   -->
   
   **Additional context**
   <!--
   Add any other context or screenshots about the feature request here.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Document / Add an example of preserving dictionary encoding when reading parquet [arrow-rs]

Reply via email to