Hi, I’ve also had quite a few thoughts on this, as it is somewhat strange at the moment (within the context of Acero at least) that e.g. IntegerDictionary is not the same type as an Integer, meaning that we have to manually cast between the two or reject any operation that mixes the two. I was mucking around with implementing my own Arrow type system (just to see how I’d design myself) and came up with a “three-level” type system.
Specifically we have: - Logical type: is it an int, float, decimal, timestamp, struct, utf8, etc. A schema only specifies the logical types of fields. - Physical type: a physical instantiation of a logical type. This would parameterize the logical type with things like bit widths, precision, timestamp units, offset size, etc. Every element within an array must have the same physical type, but batches with different physical types may conform to the same schema. - Array type: the physical of the arrays themselves, i.e. how many buffers, what each buffer represents, etc. This also includes stuff like RLE and Dictionary array types, and this is where other encodings would go. RLE for example just has a `run_lengths` buffer and a child array. Similar with dictionary. This is a very powerful way of composing encodings, as you could now have an RLE buffer with a child dictionary buffer which itself has a child RLE buffer (or something like that). The warts appear when some logical types are amenable to some encodings and others are not: bit packing, delta, and FOR for example work for integers but not strings, but I think this can be worked fairly easily. Now, Arrow currently specifies physical types directly in the schema, which is fine, I think that’s how most database type systems work. However, given that Dict<Int8> is a different type from Int8, it seems that Arrow conflates the bottom two levels. I think what we really need to do is refactor the type system to separate out the array type from the physical type. Schemas will deal in physical types and actual materialized batches will have an associated array type. Let me know if I need to clarify anything, that was a lot of text :) Sasha Krassovsky > On Jul 29, 2022, at 4:18 PM, Wes McKinney <wesmck...@gmail.com> wrote: > > of the implementation when it comes to the IPC format and the C > interface. >