Re: [DISCUSS] Data Type framework

Max Gekk Thu, 11 Sep 2025 13:05:04 -0700

Hello Dongjoon,

> can we do this migration safely in a step-by-step manner over multiple
Apache Spark versions without blocking any Apache Spark releases?


Sure, we can start from the TIME type, and refactor the existing pattern
mathings. After that I would support new features of TIME using the
framework (highly likely we will need to add new interfaces). This is not
risky since the type hasn't been released yet. After the release 4.1.0, we
could refactor some of existing data types, for example TIMESTAMP or/and
DATE.

Yours faithfully,
Max Gekk


On Thu, Sep 11, 2025 at 5:01 PM Dongjoon Hyun <[email protected]> wrote:

> Thank you for sharing the direction, Max.
>
> Since this is internal refactoring, can we do this migration safely in a
> step-by-step manner over multiple Apache Spark versions without blocking
> any Apache Spark releases?
>
> The proposed direction itself looks reasonable and doable for me.
>
> Thanks,
> Dongjoon.
>
> On 2025/09/10 13:44:45 "serge rielau.com" wrote:
> > I think this is a great idea. There is a signifcant backlog of types
> which should be added: E.g TIMESTAMP(9), TIMESTAMP WITH TIME ZONE, TIME
> WITH TIMEZONE, some sort of big decimal to name a few).
> > Making these more "plug and play" is goodness.
> >
> > +1
> >
> > On Sep 10, 2025, at 1:22 PM, Max Gekk <[email protected]> wrote:
> >
> > Hi All,
> >
> > I would like to propose refactoring of internal operations over
> Catalyst's data types. In the current implementation, data types are
> handled in an adhoc manner, and processing logic is dispersed  across the
> entire code base. There are more than 100 places where every data type is
> pattern matched. For example, formatting of type values (converting to
> strings) is implemented in the same way in ToStringBase and in toString
> (literals.scala). This leads to a few issues:
> >
> > 1. If you change the handling in one place, you might miss other places.
> The compiler won't help you in such cases.
> > 2. Adding a new data type has constant and significant overhead. Based
> on our experience of adding new data types: ANSI intervals (
> https://issues.apache.org/jira/browse/SPARK-27790) took > 1.5 years,
> TIMESTAMP_NTZ (https://issues.apache.org/jira/browse/SPARK-35662) took >
> 1 year, TIME (https://issues.apache.org/jira/browse/SPARK-51162) has not
> been finished yet, but we spent more than half-year so far.
> >
> > I propose to define a set of interfaces, and operation classes for every
> data type. The operation classes (Ops) should implement subsets of
> interfaces that are suitable for a particular data type.
> > For example, TimeType will have the companion class TimeTypeOps which
> implements the following operations:
> > - Operations over the underlying physical type
> > - Literal related operationsig decimal (like DECFLOAT(
> > - Formatting of type values to strings
> > - Converting to/from external Java type: java.time.LocalTime in the case
> of TimeType
> > - Hashing data type values
> >
> > On the handling side, we won't need to examine every data type. We can
> check that a data type and its ops instance supports a required interface,
> and invoke the needed method. For example:
> > ---
> >   override def sql: String = dataTypeOps match {
> >     case fops: FormatTypeOps => fops.toSQLValue(value)
> >     case _ => value.toString
> >   }
> > ---
> > Here is the prototype of the proposal:
> https://github.com/apache/spark/pull/51467
> >
> > Your comments and feedback would be greatly appreciated.
> >
> > Yours faithfully,
> > Max Gekk
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [email protected]
>
>

Re: [DISCUSS] Data Type framework

Reply via email to