Re: [DISCUSS] Data Type framework

Holden Karau Thu, 11 Sep 2025 21:25:07 -0700

+1

Twitter: https://twitter.com/holdenkarau
Fight Health Insurance: https://www.fighthealthinsurance.com/
<https://www.fighthealthinsurance.com/?q=hk_email>
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau
Pronouns: she/her



On Thu, Sep 11, 2025 at 8:56 PM Dongjoon Hyun <[email protected]>
wrote:

> Sounds like a great plan! Thank you.
>
> +1 for the refactoring.
>
> Dongjoon.
>
> On Thu, Sep 11, 2025 at 1:04 PM Max Gekk <[email protected]> wrote:
>
>> Hello Dongjoon,
>>
>> > can we do this migration safely in a step-by-step manner over multiple
>> Apache Spark versions without blocking any Apache Spark releases?
>>
>> Sure, we can start from the TIME type, and refactor the existing pattern
>> mathings. After that I would support new features of TIME using the
>> framework (highly likely we will need to add new interfaces). This is not
>> risky since the type hasn't been released yet. After the release 4.1.0, we
>> could refactor some of existing data types, for example TIMESTAMP or/and
>> DATE.
>>
>> Yours faithfully,
>> Max Gekk
>>
>>
>> On Thu, Sep 11, 2025 at 5:01 PM Dongjoon Hyun <[email protected]>
>> wrote:
>>
>>> Thank you for sharing the direction, Max.
>>>
>>> Since this is internal refactoring, can we do this migration safely in a
>>> step-by-step manner over multiple Apache Spark versions without blocking
>>> any Apache Spark releases?
>>>
>>> The proposed direction itself looks reasonable and doable for me.
>>>
>>> Thanks,
>>> Dongjoon.
>>>
>>> On 2025/09/10 13:44:45 "serge rielau.com" wrote:
>>> > I think this is a great idea. There is a signifcant backlog of types
>>> which should be added: E.g TIMESTAMP(9), TIMESTAMP WITH TIME ZONE, TIME
>>> WITH TIMEZONE, some sort of big decimal to name a few).
>>> > Making these more "plug and play" is goodness.
>>> >
>>> > +1
>>> >
>>> > On Sep 10, 2025, at 1:22 PM, Max Gekk <[email protected]> wrote:
>>> >
>>> > Hi All,
>>> >
>>> > I would like to propose refactoring of internal operations over
>>> Catalyst's data types. In the current implementation, data types are
>>> handled in an adhoc manner, and processing logic is dispersed  across the
>>> entire code base. There are more than 100 places where every data type is
>>> pattern matched. For example, formatting of type values (converting to
>>> strings) is implemented in the same way in ToStringBase and in toString
>>> (literals.scala). This leads to a few issues:
>>> >
>>> > 1. If you change the handling in one place, you might miss other
>>> places. The compiler won't help you in such cases.
>>> > 2. Adding a new data type has constant and significant overhead. Based
>>> on our experience of adding new data types: ANSI intervals (
>>> https://issues.apache.org/jira/browse/SPARK-27790) took > 1.5 years,
>>> TIMESTAMP_NTZ (https://issues.apache.org/jira/browse/SPARK-35662) took
>>> > 1 year, TIME (https://issues.apache.org/jira/browse/SPARK-51162) has
>>> not been finished yet, but we spent more than half-year so far.
>>> >
>>> > I propose to define a set of interfaces, and operation classes for
>>> every data type. The operation classes (Ops) should implement subsets of
>>> interfaces that are suitable for a particular data type.
>>> > For example, TimeType will have the companion class TimeTypeOps which
>>> implements the following operations:
>>> > - Operations over the underlying physical type
>>> > - Literal related operationsig decimal (like DECFLOAT(
>>> > - Formatting of type values to strings
>>> > - Converting to/from external Java type: java.time.LocalTime in the
>>> case of TimeType
>>> > - Hashing data type values
>>> >
>>> > On the handling side, we won't need to examine every data type. We can
>>> check that a data type and its ops instance supports a required interface,
>>> and invoke the needed method. For example:
>>> > ---
>>> >   override def sql: String = dataTypeOps match {
>>> >     case fops: FormatTypeOps => fops.toSQLValue(value)
>>> >     case _ => value.toString
>>> >   }
>>> > ---
>>> > Here is the prototype of the proposal:
>>> https://github.com/apache/spark/pull/51467
>>> >
>>> > Your comments and feedback would be greatly appreciated.
>>> >
>>> > Yours faithfully,
>>> > Max Gekk
>>> >
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: [email protected]
>>>
>>>

Re: [DISCUSS] Data Type framework

Reply via email to