Hi,
I’ve also had quite a few thoughts on this, as it is somewhat strange at the 
moment (within the context of Acero at least) that e.g. IntegerDictionary is 
not the same type as an Integer, meaning that we have to manually cast between 
the two or reject any operation that mixes the two. I was mucking around with 
implementing my own Arrow type system (just to see how I’d design myself) and 
came up with a “three-level” type system. 

Specifically we have: 
- Logical type: is it an int, float, decimal, timestamp, struct, utf8, etc. A 
schema only specifies the logical types of fields.
- Physical type: a physical instantiation of a logical type. This would 
parameterize the logical type with things like bit widths, precision, timestamp 
units, offset size, etc. Every element within an array must have the same 
physical type, but batches with different physical types may conform to the 
same schema. 
- Array type: the physical of the arrays themselves, i.e. how many buffers, 
what each buffer represents, etc. This also includes stuff like RLE and 
Dictionary array types, and this is where other encodings would go. RLE for 
example just has a `run_lengths` buffer and a child array. Similar with 
dictionary. This is a very powerful way of composing encodings, as you could 
now have an RLE buffer with a child dictionary buffer which itself has a child 
RLE buffer (or something like that).

The warts appear when some logical types are amenable to some encodings and 
others are not: bit packing, delta, and FOR for example work for integers but 
not strings, but I think this can be worked fairly easily.  

Now, Arrow currently specifies physical types directly in the schema, which is 
fine, I think that’s how most database type systems work. However, given that 
Dict<Int8> is a different type from Int8, it seems that Arrow conflates the 
bottom two levels. I think what we really need to do is refactor the type 
system to separate out the array type from the physical type. Schemas will deal 
in physical types and actual materialized batches will have an associated array 
type. 

Let me know if I need to clarify anything, that was a lot of text :) 

Sasha Krassovsky

> On Jul 29, 2022, at 4:18 PM, Wes McKinney <wesmck...@gmail.com> wrote:
> 
> of the implementation when it comes to the IPC format and the C
> interface.
> 

Reply via email to