Re: [Discussion] Follow up discussion for Interval Type

Raphael Taylor-Davies Thu, 03 Jul 2025 09:22:42 -0700

Hi,

I may be misunderstanding something, but it appears that the motivationfor this effort is that the current interval type represents a supersetof the functionality required by ANSI SQL, and therefore can causecompatibility problems for some databases that only support the minimumrequired by ANSI SQL?

In particular the parquet Interval (along with Arrow's MonthDayNano andpostgres' interval) encode a number of months, days and some seconds ina single unified interval type, whereas the SQL standard instead definestwo interval types - one consisting of a combination of years andmonths, and another days and some seconds quantity.

Whilst I can see that bifurcating the parquet interval types to matchthe minimum required by the SQL standard might help those systems withsuch bifurcated interval types, it will be a net negative for thosesystems that instead have a unified interval type. I therefore suspectthis will end up being a repeat of the ongoing Int96 saga, especially asIMO having a unified interval representation is a significant UXimprovement whilst also providing strictly more functionality. I wonderif instead it might be better to define some mapping that can be used bysystems with bifurcated interval types - perhaps as a struct columnconsisting of the two components? This would avoid an interminablemigration period, whilst also not representing a loss of functionalityor ergonomics for existing workloads.

Putting this to one side, I think the proposed DayTimeInterval stillneeds a separate day component in order to be SQL compliant, as days maynot have 24 hours due to daylight savings time. I think what is beingsuggested might be better described as a duration type.


Kind Regards,

Raphael

On 03/07/2025 16:01, Felipe Oliveira Carvalho wrote:

Hi,

I recently took the time to understand MonthDayNano better and it's not as
crazy as my first impression of it.

I think we could solve many of the problems you're trying to solve by
embracing it. Replies inline:

On Tue, Jul 1, 2025 at 2:56 AM yun zou <[email protected]> wrote:

Hi Team,

Resending the previous email about the Interval Type discussion from the
last Parquet community sync, with some formatting adjustments.

The primary focus of the conversation is the proposed INTERVAL type's
*compatibility
with Apache Arrow*. Several key issues have been raised:

1. *Is there a more descriptive name for DayTimeInterval?*
While the name DayTimeInterval closely follows the SQL standard and matches
naming conventions used by most engines, some suggest that a name
emphasizing precision—such as *DayNanoInterval*—might provide better
clarity.

2. *Should we consider representing DayTimeInterval using Arrow's
MonthDayNano?*
Mapping DayTimeInterval to Arrow's MonthDayNano type is problematic due to
semantic differences:

    -  MonthDayNano combines both calendar-based and duration-based
    components, whereas DayTimeInterval represents a pure duration.

Month and Day components actually allow the SQL engine to delay caring
about calendars when parsing interval literals:

     INTERVAL '2-11 28' YEAR TO DAY

month = 2 * 12 + 11. (there are no leap months, so converting from years to
month is easy)
day = 28
nanos = 0

If you get an interval/duration that is a single integer (as in the number
of nanoseconds), you can leave month and day set to 0. The annoying
limitation is that a 64-bit nanos field can't represent 10K years. How else
would you turn that weird SQL interval literal into a single integer
duration? Would you assume a year is 365 days?

This is the internal representation of intervals in PostgreSQL. It follows
the same structure as MonthDayNano: a 64-bit offset and 32-bit days and
months. This has the nice feature of aligning nicely to 8 bytes without
padding and still not requiring 128-bit math for operations.

/*
  * Storage format for type interval.
  */
typedef struct {
     TimeOffset time; /* all time units other than days, months and years */

      int32 day; /* days, after time for alignment */
      int32 month; /* months and years, after time for alignment */
} Interval;

    - MonthDayNano allows mixed signs across components (e.g., positive

    months and negative days), which complicates comparison and evaluation.

Yes, but adding more types to Arrow also complicates engines and
implementations.

Adding month and days to a timestamp wouldn't be so hard if timestamps also
stored the number of months and days since an epoch, but they are usually a
single value (millis, micros, nanos) since an epoch. Couldn't an argument
be made that any SQL system already needs to convert between single-integer
timestamps and YYYY-MM-DD HH:MM:SS literals, so this operation is not that
hard to support?

Complicated expressions involving timestamps and intervals could be
evaluated by converting timestamps to MonthDayNano durations since epoch
making every value become a duration, then at the end convert the duration
since epoch to a single-integer timestamp. Given how SQL literals are, I
don't see how operations on these values can be correct without these
concerns being taken into account.

Given these differences, MonthDayNano is not a suitable candidate for
representing DayTimeInterval and *we recommend not mapping DayTimeInterval
to Arrow's MonthDayNano*.

3.* Memory Footprint: Is 16 bytes necessary for DayTimeInterval? *

- Some engines (e.g., Spark, Trino) represent DayTimeInterval using only
8 bytes, while others (like Oracle and Snowflake) support a wider range,
potentially requiring more than 8 bytes. Additionally, there is
interest in
future support for higher precision, such as picoseconds, which would
also
demand a larger footprint.
- One proposal is to parameterize the size or precision, allowing
engines to define their own representations. However, this approach
introduces complexity and makes standardization difficult. A fixed-size
format that provides enough range for most use cases is considered more
robust.
- Several alternative strategies have been proposed:
1. Use a 10-byte array, which is likely sufficient for all current
engine requirements.
2. Use a 16-byte array now, with the option to evolve it into a
standardized int128 in the future.
3. Start with an int64 representation, and plan for a future transition
to int128, updating related types such as timestamps and intervals in
parallel.

Looking forward to hearing your thoughts on the above questions!

Link to the proposal:

https://docs.google.com/document/d/12ghQxWxyAhSQeZyy0IWiwJ02gTqFOgfYm8x851HZFLk/edit?tab=t.0

Link to the PR: https://github.com/apache/parquet-format/pull/496/files

Best Regards,
Yun

So, can you expand on why MonthDayNano wouldn't work well? Yes, it's
complicated, but operating on Timestamps and Durations in SQL engines is a
complicated problem and after thinking about these types in the context of
a SQL engine I think the complexity of MonthDayNano is essential, not
accidental.

--
Felipe

Re: [Discussion] Follow up discussion for Interval Type

Reply via email to