Re: [DISCUSS] Data Type framework

serge rielau . com Wed, 10 Sep 2025 06:45:09 -0700

I think this is a great idea. There is a signifcant backlog of types which 
should be added: E.g TIMESTAMP(9), TIMESTAMP WITH TIME ZONE, TIME WITH 
TIMEZONE, some sort of big decimal to name a few).
Making these more "plug and play" is goodness.


+1

On Sep 10, 2025, at 1:22 PM, Max Gekk <max.g...@gmail.com> wrote:

Hi All,

I would like to propose refactoring of internal operations over Catalyst's data 
types. In the current implementation, data types are handled in an adhoc 
manner, and processing logic is dispersed  across the entire code base. There 
are more than 100 places where every data type is pattern matched. For example, 
formatting of type values (converting to strings) is implemented in the same 
way in ToStringBase and in toString (literals.scala). This leads to a few 
issues:

1. If you change the handling in one place, you might miss other places. The 
compiler won't help you in such cases.
2. Adding a new data type has constant and significant overhead. Based on our 
experience of adding new data types: ANSI intervals 
(https://issues.apache.org/jira/browse/SPARK-27790) took > 1.5 years, 
TIMESTAMP_NTZ (https://issues.apache.org/jira/browse/SPARK-35662) took > 1 
year, TIME (https://issues.apache.org/jira/browse/SPARK-51162) has not been 
finished yet, but we spent more than half-year so far.

I propose to define a set of interfaces, and operation classes for every data 
type. The operation classes (Ops) should implement subsets of interfaces that 
are suitable for a particular data type.
For example, TimeType will have the companion class TimeTypeOps which 
implements the following operations:
- Operations over the underlying physical type
- Literal related operationsig decimal (like DECFLOAT(
- Formatting of type values to strings
- Converting to/from external Java type: java.time.LocalTime in the case of 
TimeType
- Hashing data type values

On the handling side, we won't need to examine every data type. We can check 
that a data type and its ops instance supports a required interface, and invoke 
the needed method. For example:
---
  override def sql: String = dataTypeOps match {
    case fops: FormatTypeOps => fops.toSQLValue(value)
    case _ => value.toString
  }
---
Here is the prototype of the proposal: 
https://github.com/apache/spark/pull/51467

Your comments and feedback would be greatly appreciated.

Yours faithfully,
Max Gekk

Re: [DISCUSS] Data Type framework

Reply via email to