I'm in favor of the CalendarDuration and TimeDuration types as better names for what we are trying to express here. I also think going forward with Int64 for now probably makes sense with us also doing some work to start getting an official int128 in as well. I don't have a problem with FLBA(10) but I would hope we could do some better encoding tricks with an Int128. I'm relatively a novice in this area so take that with a grain of salt.
On Mon, Jul 7, 2025 at 10:39 PM Micah Kornfield <[email protected]> wrote: > > > > However, the reverse is not guaranteed: a MonthDayNano value cannot > > reliably be converted back > > into a DayTimeInterval. This is because there's no way to determine > whether > > the calendar component > > is used without looking into the data, which introduces ambiguity. This > > ambiguity can negatively impact > > interoperability across different engines and systems. > > > Ultimately, this is something that systems will need to deal with at some > point but this can delayed until someone has the bandwidth to have a formal > proposal for persisting MonthDayNano in parquet (and it would still be up > to the consuming system on how to do the translation so I'm not clear that > defining the translation is strictly necessary). > > > > Regarding whether we should use FLBA(16) or INT128, while INT128 does > have > > a natural > > fitting for ordering, I think one concern I had is if that type will only > > be used by the Day Time Interval. > > > I think there are a few use-cases that have at least been mentioned where > it would be useful to have int128: > > 1. A replacement for int96 timestamp that can handle the full range of > ANSI SQL Nanoseconds. > 2. Picoseconds has at least been mentioned in passing and that would > require int128. > > If we don't model it as a 128 we should minimize the range to reflect what > ANSI SQL requires (i.e. FLBA(10) I believe). We should probably allow the > logical type to annotate both int64 and FLBA(10), since int64 is a common > representation for nanoseconds (this is similar to what we already do for > Decimal values). > > Regarding the name for DayTimeInterval, if we all agree that "Duration" > > provides better clarity, > > I'm fully on board with using that instead. > > > +1, IIUC I think this addresses the majority of concerns. If others in the > community want to define a parquet representation for MonthDayNanos arrow > interval that would be welcome as well. I think the main question then > becomes on Arrow side if we want to define the new type or deal with the > unlikely case of overflow for the duration type. > > > > On Mon, Jul 7, 2025 at 4:38 PM yun zou <[email protected]> wrote: > > > Hi, > > > > Thanks all for the valuable feedback! > > > > Regarding the MonthDayNano type, one important point that may not be > > explicitly stated > > is the lack of true interoperability between YearMonthInterval, > > DayTimeInterval, and MonthDayNano. > > > > While YearMonthInterval and DayTimeInterval are not directly > interoperable > > with each other, > > they can both be converted into MonthDayNano by setting certain > components > > to zero. > > However, the reverse is not guaranteed: a MonthDayNano value cannot > > reliably be converted back > > into a DayTimeInterval. This is because there's no way to determine > whether > > the calendar component > > is used without looking into the data, which introduces ambiguity. This > > ambiguity can negatively impact > > interoperability across different engines and systems. > > > > > Doesn't capture semantics for engines that treat day as a calendar > type. > > I don't actually see the above as a drawback of introducing two separate > > interval types, > > since when the day is used as a calendar type, it can be mapped to the > > MonthDayNano type. > > In fact, I believe all three types are necessary to fully support the > range > > of use cases. > > What’s important is that we clearly define the interoperability rules > > between them to ensure > > consistent behavior across systems. > > > > > While I understand the desire to be able to represent all values > > > allowable in ANSI SQL, I really don't understand why our types should > > > not be allowed to represent any values *outside* of the range allowed > > > in ANSI SQL. > > I completely agree—if there are valid use cases beyond ANSI SQL, we > should > > absolutely support them. It makes sense to leave range validation to the > > engine or > > client implementation, as they are best suited to handle their own > specific > > requirements.. > > > > Regarding whether we should use FLBA(16) or INT128, while INT128 does > have > > a natural > > fitting for ordering, I think one concern I had is if that type will only > > be used by the Day Time Interval. > > > > Regarding the name for DayTimeInterval, if we all agree that "Duration" > > provides better clarity, > > I'm fully on board with using that instead. > > > > Best Regards, > > Yun > > >
