+1 Twitter: https://twitter.com/holdenkarau Fight Health Insurance: https://www.fighthealthinsurance.com/ <https://www.fighthealthinsurance.com/?q=hk_email> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> YouTube Live Streams: https://www.youtube.com/user/holdenkarau Pronouns: she/her
On Thu, Sep 11, 2025 at 8:56 PM Dongjoon Hyun <[email protected]> wrote: > Sounds like a great plan! Thank you. > > +1 for the refactoring. > > Dongjoon. > > On Thu, Sep 11, 2025 at 1:04 PM Max Gekk <[email protected]> wrote: > >> Hello Dongjoon, >> >> > can we do this migration safely in a step-by-step manner over multiple >> Apache Spark versions without blocking any Apache Spark releases? >> >> Sure, we can start from the TIME type, and refactor the existing pattern >> mathings. After that I would support new features of TIME using the >> framework (highly likely we will need to add new interfaces). This is not >> risky since the type hasn't been released yet. After the release 4.1.0, we >> could refactor some of existing data types, for example TIMESTAMP or/and >> DATE. >> >> Yours faithfully, >> Max Gekk >> >> >> On Thu, Sep 11, 2025 at 5:01 PM Dongjoon Hyun <[email protected]> >> wrote: >> >>> Thank you for sharing the direction, Max. >>> >>> Since this is internal refactoring, can we do this migration safely in a >>> step-by-step manner over multiple Apache Spark versions without blocking >>> any Apache Spark releases? >>> >>> The proposed direction itself looks reasonable and doable for me. >>> >>> Thanks, >>> Dongjoon. >>> >>> On 2025/09/10 13:44:45 "serge rielau.com" wrote: >>> > I think this is a great idea. There is a signifcant backlog of types >>> which should be added: E.g TIMESTAMP(9), TIMESTAMP WITH TIME ZONE, TIME >>> WITH TIMEZONE, some sort of big decimal to name a few). >>> > Making these more "plug and play" is goodness. >>> > >>> > +1 >>> > >>> > On Sep 10, 2025, at 1:22 PM, Max Gekk <[email protected]> wrote: >>> > >>> > Hi All, >>> > >>> > I would like to propose refactoring of internal operations over >>> Catalyst's data types. In the current implementation, data types are >>> handled in an adhoc manner, and processing logic is dispersed across the >>> entire code base. There are more than 100 places where every data type is >>> pattern matched. For example, formatting of type values (converting to >>> strings) is implemented in the same way in ToStringBase and in toString >>> (literals.scala). This leads to a few issues: >>> > >>> > 1. If you change the handling in one place, you might miss other >>> places. The compiler won't help you in such cases. >>> > 2. Adding a new data type has constant and significant overhead. Based >>> on our experience of adding new data types: ANSI intervals ( >>> https://issues.apache.org/jira/browse/SPARK-27790) took > 1.5 years, >>> TIMESTAMP_NTZ (https://issues.apache.org/jira/browse/SPARK-35662) took >>> > 1 year, TIME (https://issues.apache.org/jira/browse/SPARK-51162) has >>> not been finished yet, but we spent more than half-year so far. >>> > >>> > I propose to define a set of interfaces, and operation classes for >>> every data type. The operation classes (Ops) should implement subsets of >>> interfaces that are suitable for a particular data type. >>> > For example, TimeType will have the companion class TimeTypeOps which >>> implements the following operations: >>> > - Operations over the underlying physical type >>> > - Literal related operationsig decimal (like DECFLOAT( >>> > - Formatting of type values to strings >>> > - Converting to/from external Java type: java.time.LocalTime in the >>> case of TimeType >>> > - Hashing data type values >>> > >>> > On the handling side, we won't need to examine every data type. We can >>> check that a data type and its ops instance supports a required interface, >>> and invoke the needed method. For example: >>> > --- >>> > override def sql: String = dataTypeOps match { >>> > case fops: FormatTypeOps => fops.toSQLValue(value) >>> > case _ => value.toString >>> > } >>> > --- >>> > Here is the prototype of the proposal: >>> https://github.com/apache/spark/pull/51467 >>> > >>> > Your comments and feedback would be greatly appreciated. >>> > >>> > Yours faithfully, >>> > Max Gekk >>> > >>> > >>> >>> --------------------------------------------------------------------- >>> To unsubscribe e-mail: [email protected] >>> >>>
