jorgecarleitao opened a new pull request #9506:
URL: https://github.com/apache/arrow/pull/9506


   Be warned, this is one of the largest PRs I am proposing thus far. I am 
placing here as a placeholder as it significantly affects many.
   
   This is WIP and the whole justification is found below. The code does not 
compile as there are many changes needed for this to work. I also temporarily 
removed some code until I can place it again.
   
   # Background
   
   Most formats have two main "planes", the logical and the physical, that 
represent two fundamentally different aspects of a format. Typically, a 
physical type, such as `i64`, represents many logical types (`Int64`, `Date64`, 
`Timestamp()`, etc.), and a logical type is uniquely represented by a physical 
type.
   
   From the memory's perspective, the physical plane is the only plane that 
matters. From the representations' point of view (e.g. as a `string`), the 
logical plane matters. Allocate, transmute, the exact reg call all pertain to 
the physical plane. Operations such as `toString`, which physical operation `a 
+ b` represents, etc, pertain to the logical plane.
   
   Generics are Rust's mechanism to write less code that relies on different 
physical representations.
   
   Unfortunately, our dear arrow crate currently uses the type `trait 
ArrowPrimitiveType` to differentiate _logical_ representations.
   
   This trait is implemented for zero-sized empty `struct`s that contain both 
the logical and physical representation. For example,
   
   ```rust
   #[derive(Debug)]
   pub struct Date32 {}
   
   impl ArrowPrimitiveType for Date32 {
       type Native = i32;
       const DATA_TYPE: DataType = DataType::Date32;
   }
   ```
   
   We have such structs for 25 logical types (e.g. `Date32`, 
`Timestamp(TimestampMillisecondType, _)`), and these are used to connect a 
logical type with a physical type.
   
   However, this has many undesirable consequences, the two most notable ones 
being:
   
   1. we can't write generics that are valid for different logical types with 
the same physical representation.
   2. we do not know what to rely on to get an array's logical data type
   
   ### 1. 
   
   In almost all kernels that support a Primitive type, we write something like 
the following:
   
   ```rust
   match data_type {
           ...
           DataType::Int32 => downcast_take!(Int32Type, values, indices),
           DataType::Date32 => downcast_take!(Date32Type, values, indices),
           DataType::Date64 => downcast_take!(Date64Type, values, indices),
           DataType::Time32(Second) => downcast_take!(Time32SecondType, values, 
indices),
           DataType::Time32(Millisecond) => 
downcast_take!(Time32MillisecondType, values, indices),
           DataType::Time64(Microsecond) => 
downcast_take!(Time64MillisecondType, values, indices),
           DataType::Timestamp(Millisecond, _) => 
downcast_take!(TimestampMillisecondType, values, indices),
           ...
   }
   ```
   
   Note how there are two physical representations of all the types above: 
`i32` and `i64`.  I.e. from the physical representations' point of view, there 
are only two branches there.
   
   Note also that for `Timestamp`, we don't support timestamps with timezones, 
as they do not have a corresponding Rust type.
   
   What happens beneath the `downcast_take` and many other downcasts is that we 
downcast to the logical type (e.g. `Date32`, and then perform the operation on 
it using the same physical type as any other physical type.
   
   The main problem with these is that there is a proliferation of match cases 
consequent of us using Rust's physical type system to represent logical types. 
This becomes really painful in any kernel that matches two DataTypes, as we 
must outline `NxN` of the valid types. The most dramatic example is `cast`, but 
any sufficiently generic kernel suffers from this.
   
   ### 2.
   
   Currently, we have two ways of getting an arrays' datatype inside a generic 
that consumes `array: PrimitiveArray<T: ArrowPrimitiveType>`: `T::DATA_TYPE` 
and `array.data_type`. The core issue is that because `DataType` is a _value_, 
not a type (because it is only known at runtime), there can be a mismatch 
between them. We even abuse it to DRY with things like
   
   ```rust
           DataType::Timestamp(_, _) => downcast_something!(Int64Type, 
data_type)
   ```
   
   because we know that the physical representation is correct and thus do not 
bother writing all cases.
   
   The problem with all of this is that because `ArrayData::data_type()` is 
used to transmute byte buffers, e.g. whenever we pass it to `make_array()` to 
create an `Arc<dyn Array>`, we expose ourselves to a large risk of unsound code.
   
   # This PR
   
   This PR proposes that we remove `ArrowPrimitiveArray` altogether and have as 
the only types passed to `PrimitiveArray<T>` rust's primitive types, such as 
`i32`, that implement `ArrowNativeType`.
   
   This way, we stop having to downcast every single logical type, and instead 
only have to downcast physical types. This also makes it obvious that `<T>` 
corresponds solely to a physical type, and that logical types are purely 
decided by `ArrayData::data_type()` at runtime.
   
   This will also allow us to more easily perform runtime checks to whether the 
DataType passed to a physical type is correct, which will allow us to write 
`unsafe` free Arrays (more details 
[here](https://github.com/jorgecarleitao/arrow2/blob/57a129dd8778a4a4bf3d949840aa5fe278f5af4d/README.md)).
   
   ## backward incompatibility
   
   There are two main consequences of this change: 
   
   ## collect needs re-write
   
   We will no longer be able to write 
   
   ```rust
   let array = iter.collect::<PrimitiveArray<Int32Type>>();
   ```
   This is because `Int32Type` will no longer exist and an iterator of 
`Option<T>` no longer has a Logical type associated to it.
   
   My proposal to address this is to use a transient struct for this, that I 
already concluded addresses it. The corresponding code for the above would be
   
   ```rust
   let array = iter.collect::<Primitive<i32>>().to(DataType::Int32);
   ```
   
   Here, `Primitive<i32>` is a struct with a `validity` and `values`  buffer 
but without a `Datatype`, and `.to` converts it to a `PrimitiveArray<i32>` 
(physical type) with a logical type `DataType::Int32` (that panics if the 
`DataType` does not fit the physical type).
   
   This has a bit more characters to write. OTOH, it enables full support for 
`Timestamp` with timezones, as they can be created via
   
   ```rust
   let array = iter.collect::<Primitive<i64>>().to(DataType::Timestamp(...));
   ```
   
   and thus all our iterators' APIs start to work with it out of the box (they 
currently do not).
   
   ## match cases can be trimmed down
   
   All match cases whose physical representation is the same and whose logical 
representation induces no semantic change to the kernel can be collapse in a 
single case. They also become a bit easier to understand:
   
   ```rust
       DataType::Int32 | DataType::Date32 | DataType::Time32(_) => 
downcast_take!(i32, values, indices),
       DataType::Int64 | DataType::Date64 | DataType::Time64(_) | 
DataType::Timestamp(_) => downcast_take!(i64, values, indices),
   ```
   
   for kernels whose logic depends on the logical type, they are also greatly 
simplified. For example, what our current master currently accomplishes in 16 
cases (we use an `unsafe` trick to reduce it down to 3 cases that does not work 
if we would do the proper safety checks) can be accomplished as follows:
   
   ```rust
       (Timestamp(from_unit, None), Timestamp(to_unit, None)) => {
               let array = array
                   .as_any()
                   .downcast_ref::<PrimitiveArray<i64>>()
                   .unwrap();
   
               let from_size = time_unit_multiple(&from_unit);
               let to_size = time_unit_multiple(&to_unit);
               // we either divide or multiply, depending on size of each unit
               let array = if from_size >= to_size {
                   unary::<_, _, i64>(&array, |x| (x / (from_size / to_size)), 
to_type)
               } else {
                   unary::<_, _, i64>(&array, |x| (x * (to_size / from_size)), 
to_type)
               };
               Ok(Arc::new(array))
           }
   ```
   
   Note how because the physical type is the same for all logical types, we 
perform 1 downcast to `i64` (physical representation), and handle logical 
variations at runtime (the factor to multiply/divide).
   
   ## Replace `IntXXType` by `iX`, `UIntXXType` by `uX`
   
   With this change, we no longer need Rust types to represent array types. We 
may want to keep `pub type Int32Type = i32;` for the sake of a smaller number 
of changes on downstream dependencies, but the only downcasts only happen when 
there is a different physical, not logical, representation of the data.
   
   This would significantly reduce the number of branches in all DataFusion's 
kernels, as well as reduce the risk of errors and unsound code resulting from 
those errors.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to