jorgecarleitao opened a new pull request #9506: URL: https://github.com/apache/arrow/pull/9506
Be warned, this is one of the largest PRs I am proposing thus far. I am placing here as a placeholder as it significantly affects many. This is WIP and the whole justification is found below. The code does not compile as there are many changes needed for this to work. I also temporarily removed some code until I can place it again. # Background Most formats have two main "planes", the logical and the physical, that represent two fundamentally different aspects of a format. Typically, a physical type, such as `i64`, represents many logical types (`Int64`, `Date64`, `Timestamp()`, etc.), and a logical type is uniquely represented by a physical type. From the memory's perspective, the physical plane is the only plane that matters. From the representations' point of view (e.g. as a `string`), the logical plane matters. Allocate, transmute, the exact reg call all pertain to the physical plane. Operations such as `toString`, which physical operation `a + b` represents, etc, pertain to the logical plane. Generics are Rust's mechanism to write less code that relies on different physical representations. Unfortunately, our dear arrow crate currently uses the type `trait ArrowPrimitiveType` to differentiate _logical_ representations. This trait is implemented for zero-sized empty `struct`s that contain both the logical and physical representation. For example, ```rust #[derive(Debug)] pub struct Date32 {} impl ArrowPrimitiveType for Date32 { type Native = i32; const DATA_TYPE: DataType = DataType::Date32; } ``` We have such structs for 25 logical types (e.g. `Date32`, `Timestamp(TimestampMillisecondType, _)`), and these are used to connect a logical type with a physical type. However, this has many undesirable consequences, the two most notable ones being: 1. we can't write generics that are valid for different logical types with the same physical representation. 2. we do not know what to rely on to get an array's logical data type ### 1. In almost all kernels that support a Primitive type, we write something like the following: ```rust match data_type { ... DataType::Int32 => downcast_take!(Int32Type, values, indices), DataType::Date32 => downcast_take!(Date32Type, values, indices), DataType::Date64 => downcast_take!(Date64Type, values, indices), DataType::Time32(Second) => downcast_take!(Time32SecondType, values, indices), DataType::Time32(Millisecond) => downcast_take!(Time32MillisecondType, values, indices), DataType::Time64(Microsecond) => downcast_take!(Time64MillisecondType, values, indices), DataType::Timestamp(Millisecond, _) => downcast_take!(TimestampMillisecondType, values, indices), ... } ``` Note how there are two physical representations of all the types above: `i32` and `i64`. I.e. from the physical representations' point of view, there are only two branches there. Note also that for `Timestamp`, we don't support timestamps with timezones, as they do not have a corresponding Rust type. What happens beneath the `downcast_take` and many other downcasts is that we downcast to the logical type (e.g. `Date32`, and then perform the operation on it using the same physical type as any other physical type. The main problem with these is that there is a proliferation of match cases consequent of us using Rust's physical type system to represent logical types. This becomes really painful in any kernel that matches two DataTypes, as we must outline `NxN` of the valid types. The most dramatic example is `cast`, but any sufficiently generic kernel suffers from this. ### 2. Currently, we have two ways of getting an arrays' datatype inside a generic that consumes `array: PrimitiveArray<T: ArrowPrimitiveType>`: `T::DATA_TYPE` and `array.data_type`. The core issue is that because `DataType` is a _value_, not a type (because it is only known at runtime), there can be a mismatch between them. We even abuse it to DRY with things like ```rust DataType::Timestamp(_, _) => downcast_something!(Int64Type, data_type) ``` because we know that the physical representation is correct and thus do not bother writing all cases. The problem with all of this is that because `ArrayData::data_type()` is used to transmute byte buffers, e.g. whenever we pass it to `make_array()` to create an `Arc<dyn Array>`, we expose ourselves to a large risk of unsound code. # This PR This PR proposes that we remove `ArrowPrimitiveArray` altogether and have as the only types passed to `PrimitiveArray<T>` rust's primitive types, such as `i32`, that implement `ArrowNativeType`. This way, we stop having to downcast every single logical type, and instead only have to downcast physical types. This also makes it obvious that `<T>` corresponds solely to a physical type, and that logical types are purely decided by `ArrayData::data_type()` at runtime. This will also allow us to more easily perform runtime checks to whether the DataType passed to a physical type is correct, which will allow us to write `unsafe` free Arrays (more details [here](https://github.com/jorgecarleitao/arrow2/blob/57a129dd8778a4a4bf3d949840aa5fe278f5af4d/README.md)). ## backward incompatibility There are two main consequences of this change: ## collect needs re-write We will no longer be able to write ```rust let array = iter.collect::<PrimitiveArray<Int32Type>>(); ``` This is because `Int32Type` will no longer exist and an iterator of `Option<T>` no longer has a Logical type associated to it. My proposal to address this is to use a transient struct for this, that I already concluded addresses it. The corresponding code for the above would be ```rust let array = iter.collect::<Primitive<i32>>().to(DataType::Int32); ``` Here, `Primitive<i32>` is a struct with a `validity` and `values` buffer but without a `Datatype`, and `.to` converts it to a `PrimitiveArray<i32>` (physical type) with a logical type `DataType::Int32` (that panics if the `DataType` does not fit the physical type). This has a bit more characters to write. OTOH, it enables full support for `Timestamp` with timezones, as they can be created via ```rust let array = iter.collect::<Primitive<i64>>().to(DataType::Timestamp(...)); ``` and thus all our iterators' APIs start to work with it out of the box (they currently do not). ## match cases can be trimmed down All match cases whose physical representation is the same and whose logical representation induces no semantic change to the kernel can be collapse in a single case. They also become a bit easier to understand: ```rust DataType::Int32 | DataType::Date32 | DataType::Time32(_) => downcast_take!(i32, values, indices), DataType::Int64 | DataType::Date64 | DataType::Time64(_) | DataType::Timestamp(_) => downcast_take!(i64, values, indices), ``` for kernels whose logic depends on the logical type, they are also greatly simplified. For example, what our current master currently accomplishes in 16 cases (we use an `unsafe` trick to reduce it down to 3 cases that does not work if we would do the proper safety checks) can be accomplished as follows: ```rust (Timestamp(from_unit, None), Timestamp(to_unit, None)) => { let array = array .as_any() .downcast_ref::<PrimitiveArray<i64>>() .unwrap(); let from_size = time_unit_multiple(&from_unit); let to_size = time_unit_multiple(&to_unit); // we either divide or multiply, depending on size of each unit let array = if from_size >= to_size { unary::<_, _, i64>(&array, |x| (x / (from_size / to_size)), to_type) } else { unary::<_, _, i64>(&array, |x| (x * (to_size / from_size)), to_type) }; Ok(Arc::new(array)) } ``` Note how because the physical type is the same for all logical types, we perform 1 downcast to `i64` (physical representation), and handle logical variations at runtime (the factor to multiply/divide). ## Replace `IntXXType` by `iX`, `UIntXXType` by `uX` With this change, we no longer need Rust types to represent array types. We may want to keep `pub type Int32Type = i32;` for the sake of a smaller number of changes on downstream dependencies, but the only downcasts only happen when there is a different physical, not logical, representation of the data. This would significantly reduce the number of branches in all DataFusion's kernels, as well as reduce the risk of errors and unsound code resulting from those errors. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org