kumarUjjawal commented on issue #20011:
URL: https://github.com/apache/datafusion/issues/20011#issuecomment-3809196861
@Jefffrey PR #14227 intentionally drops dict_id from DataFusion’s protobuf
schema (Arrow deprecated it and it isn’t stable/meaningful schema metadata).
The looked into it further for other ways to resolve this in
`ScalarNestedValue` (ScalarValue list/struct/map), where we serialize via Arrow
IPC: IPC still needs dict IDs, but they’re assigned during schema encoding and
aren’t carried in our protobuf Schema.
I was thinking we keep proto free of `dict_id` and treat dict IDs as an
internal IPC detail:
1. when encoding nested scalars, seed DictionaryTracker by encoding the
schema first, then encode the batch;
2. when decoding, reconstruct an IPC schema from the protobuf schema
(round-trip through arrow-ipc) and use `arrow_ipc::reader::read_dictionary` to
build `dict_by_id` before reading the record batch.
Do you have any thoughts on this?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]