I think it is plausible that we use Arrow structs to create a synthetic interval type for DataFusion (I don't have a compelling usecase to store the intervals themselves, or to expose them outside of DataFusion).
However it seems a little unfortunate that there is now way to represent a "common" interval like "1 week and 1 hour" with native arrow types On Fri, Apr 2, 2021 at 4:38 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > The real usecase I have is "postgres compatibility" > > > Yeah, I'm a little conflicted on this. A broader analysis might be > necessary and I'd welcome others thoughts, but at what point should we > mostly consider the type system closed? Should we be aiming for full > parity with ANSI SQL/Postgres SQL or something else? > > >> I have no known need for the actual postgres timestamp internal >> representation. > > > I suppose there is an edge case that the seconds range is larger for > microseconds compared to nanoseconds with the simple representation. But > that seems minor. > > On Fri, Apr 2, 2021 at 1:25 PM Andrew Lamb <al...@influxdata.com> wrote: > >> The real usecase I have is "postgres compatibility" - in the sense that >> we can write SQL queries / expressions that use postgres interval type [1] >> and corresponding expressions with the full postgres interval range. I have >> no known need for the actual postgres timestamp internal representation. >> >> A more vague usecase I would like (but do not strictly need) is for >> expressing intervals up to the same precision as IOx's underlying >> timestamps (which are stored as nanosecond precision). >> >> [1] >> https://www.postgresql.org/docs/current/datatype-datetime.html#DATATYPE-INTERVAL-INPUT >> >> "Internally interval values are stored as months, days, and seconds. >> This is done because the number of days in a month varies, and a day can >> have 23 or 25 hours if a daylight savings time adjustment is involved. The >> months and days fields are integers while the seconds field can store >> fractions. Because intervals are usually created from constant strings or >> timestamp subtraction, this storage method works well in most cases, >> but can cause unexpected results: >> " >> >> On Fri, Apr 2, 2021 at 3:53 PM Micah Kornfield <emkornfi...@gmail.com> >> wrote: >> >>> Andrew is the use-case you have simply postgres compatibility or is it >>> more >>> extensive? >>> >>> One potential problem with combining Month and Day fields, is that the >>> type >>> no longer has a defined sort order (the existing Day-Millisecond type >>> without assumptions, in particular because I don't think today there is >>> an >>> explicit constraint on the bounds for the millisecond component). >>> >>> -Micah >>> >>> >>> >>> On Wed, Mar 31, 2021 at 9:03 AM Antoine Pitrou <anto...@python.org> >>> wrote: >>> >>> > >>> > Le 31/03/2021 à 17:55, Micah Kornfield a écrit : >>> > > Thanks for the feedback. A couple of points here and some responses >>> > below. >>> > > >>> > > * One other question is whether the Nanoseconds should actually be >>> > > configurable (i.e. use milliseconds or microseconds). I would lean >>> > towards >>> > > no. >>> > >>> > Same for me. >>> > >>> > > * I'm also still not 100% convinced we need this as a first class >>> type in >>> > > arrow or if we should be looking more closely at the Struct (in the >>> Arrow >>> > > sense) based implementation. In the future where alternative >>> encodings >>> > are >>> > > supported, this could allow for much smaller footprints for this >>> type. >>> > >>> > Having a "packed" first class type allows for better locality when >>> > accessing data. It doesn't sound very likely that you'd access only >>> one >>> > component of the interval. >>> > >>> > But I have no idea how important this is, and temporal datetypes are >>> > generally cumbersome to add support for (conversions, arithmetic, >>> etc.), >>> > so it would be nice to avoid adding too many of them :-) >>> > >>> > Regards >>> > >>> > Antoine. >>> > >>> > >>> > >>> > > >>> > > The 3 >>> > >> field implementation doesn't seem to have any way to represent >>> integral >>> > >> days, so I am also not sure about that one. >>> > > >>> > > >>> > > Sorry this was an email gaffe. I intended Month (32 bit int), Day >>> (32 >>> > bit >>> > > int), Nanosecond (64 bit int). >>> > > >>> > > OTOH I don't really understand the point of supporting "the most >>> > >> reasonable ranges for Year, Month and Nanoseconds independently". >>> What >>> > >> does it bring to encode more than one month in the nanoseconds >>> field? >>> > > >>> > > >>> > > I'm happy with simplicity. In the past there has been some >>> reference to >>> > > people wanting to store very large timestamps (fall out of >>> Nanoseconds >>> > max >>> > > representable value) but we've concluded that this wasn't something >>> that >>> > we >>> > > wanted to really support. >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > On Wed, Mar 31, 2021 at 4:49 AM Antoine Pitrou <anto...@python.org> >>> > wrote: >>> > > >>> > >> >>> > >> I would favour the following characteristics : >>> > >> - support for nanoseconds (especially as other Arrow temporal types >>> > >> support it) >>> > >> - easy to handle (which excludes the ZetaSQL representtaion IMHO) >>> > >> >>> > >> OTOH I don't really understand the point of supporting "the most >>> > >> reasonable ranges for Year, Month and Nanoseconds independently". >>> What >>> > >> does it bring to encode more than one month in the nanoseconds >>> field? >>> > >> You can already use the Duration type for that. >>> > >> >>> > >> Regards >>> > >> >>> > >> Antoine. >>> > >> >>> > >> >>> > >> Le 31/03/2021 à 05:48, Micah Kornfield a écrit : >>> > >>> To follow-up on this conversation I did some analysis on interval >>> > types: >>> > >>> >>> > >>> >>> > >> >>> > >>> https://docs.google.com/document/d/1i1E_fdQ_xODZcAhsV11Pfq27O50k679OYHXFJpm9NS0/edit >>> > >> Please feel free to add more details/systems I missed. >>> > >>> >>> > >>> Given the disparate requirements of different systems I think the >>> > >> following might make sense for official types (if there isn't >>> > consensus, I >>> > >> might try to contributation extension Array implementations for >>> them to >>> > >> Java and C++/Python separately). >>> > >>> >>> > >>> 1. 3 fields: Year (32 bit), Month (32 bit), Nanoseconds (64 bit) >>> all >>> > >> signed. >>> > >>> 2. Postgres representation (Downside is it doesn't support >>> > Nanoseconds, >>> > >> only microseconds). >>> > >>> 3. ZetaSQL implementation (Requires some bit manipulation) but >>> > supports >>> > >> the most reasonable ranges for Year, Month and Nanoseconds >>> > independently. >>> > >>> >>> > >>> Thoughts? >>> > >>> >>> > >>> Micah >>> > >>> >>> > >>> On 2021/02/18 04:30:55 Micah Kornfield wrote: >>> > >>>>> >>> > >>>>> I didn’t find any page/documentation on how to do RFC in Arrow >>> > >> protocol, >>> > >>>>> so can anyone point me to it or PR with email will be enough? >>> > >>>> >>> > >>>> That is enough to start discussion. Before formal acceptance and >>> > >> merging >>> > >>>> of the PR there needs to be a Java and C++ implementations for the >>> > type >>> > >>>> that pass integration tests. At the time this guideline was >>> > instituted >>> > >>>> Java and C++ were considered the "reference" implementations (I >>> think >>> > >> they >>> > >>>> still have the most complete integration test coverage). >>> > >>>> >>> > >>>> My understanding is that the current modelling of intervals >>> mimics SQL >>> > >>>> standards (e.g. SQL Server [1]). So it would also be good to step >>> > back >>> > >> and >>> > >>>> understand what problem DF is trying to solve and how it differs >>> from >>> > >> other >>> > >>>> SQL implementations. I'd be hesitant to accept COMPLEX as a new >>> type >>> > >>>> without a much deeper analysis into calendar representations >>> within >>> > >> Arrow >>> > >>>> and how they relate to other existing systems (e.g. Hive and some >>> > >>>> assortment of existing SQL databases). For instance the current >>> > >> modelling >>> > >>>> of timestamps does not lend itself to constructing a COMPLEX >>> interval >>> > >> type >>> > >>>> particularly well. (Duration was introduced for this reason). >>> > >>>> >>> > >>>> I think both Wes's suggestion of FixedSizeBinary and Andrew's of >>> > >> composing >>> > >>>> the with a struct are good stop-gaps. These obviously have >>> different >>> > >>>> trade-offs. Ultimately, it would be good to define common >>> extension >>> > >> types >>> > >>>> that can represent this use-case if there really is demand for it >>> (if >>> > it >>> > >>>> doesn't become a top level type). >>> > >>>> >>> > >>>> [1] >>> > >>>> >>> > >> >>> > >>> https://docs.microsoft.com/en-us/sql/odbc/reference/appendixes/interval-data-types?view=sql-server-ver15 >>> > >>>> >>> > >>>> -Micah >>> > >>>> >>> > >>>> On Wed, Feb 17, 2021 at 2:05 PM Andrew Lamb <al...@influxdata.com >>> > >>> > >> wrote: >>> > >>>> >>> > >>>>> That is a great suggestion Wes, thank you. >>> > >>>>> >>> > >>>>> I wonder if we could get away with a 128 bit representation that >>> is >>> > the >>> > >>>>> concatenation of the two existing interval types >>> > (YearMonth)(DayTime). >>> > >> Or >>> > >>>>> maybe even define a `struct` type with those fields that is used >>> by >>> > >>>>> DataFusion. >>> > >>>>> >>> > >>>>> Basically, given our reading of the Arrow spec[1], it is >>> currently >>> > not >>> > >>>>> possible to precisely represent an interval that has both >>> monthly and >>> > >>>>> sub-montly granularity. >>> > >>>>> >>> > >>>>> As Dmtry says, if you have an interval seemingly simple like 1 >>> > month, >>> > >> 1 >>> > >>>>> day >>> > >>>>> >>> > >>>>> Using IntervalUnit(YEAR_MONTH) can't represent the 1 day >>> > >>>>> Using IntervalUnit(DAY_TIME) can't represent the month as >>> different >>> > >> months >>> > >>>>> have different numbers of days >>> > >>>>> >>> > >>>>> [1] >>> > >>>>> >>> > >> >>> https://github.com/apache/arrow/blob/master/format/Schema.fbs#L249-L260 >>> > >>>>> >>> > >>>>> >>> > >>>>> On Wed, Feb 17, 2021 at 5:01 PM Wes McKinney < >>> wesmck...@gmail.com> >>> > >> wrote: >>> > >>>>> >>> > >>>>>> On Wed, Feb 17, 2021 at 3:46 PM <t...@dmtry.me> wrote: >>> > >>>>>>> >>> > >>>>>>>> It's unclear to me that this needs to be introduced into the >>> > >>>>> top-level >>> > >>>>>>> >>> > >>>>>>> Similar thing to columnar format, How to store interval like 1 >>> > month >>> > >> 1 >>> > >>>>>> day 1 hour? It’s not possible to do it without converting 1 >>> month to >>> > >> 30 >>> > >>>>>> days, which is a bad way. >>> > >>>>>>> >>> > >>>>>> >>> > >>>>>> Presumably you can represent a complex interval in a fixed >>> number of >>> > >>>>>> bytes, and then embed the data in a FixedSizeBinary type. You >>> can >>> > >>>>>> adorn this type with extension type metadata so that DataFusion >>> can >>> > >>>>>> then apply Interval semantics to it. This could also serve as an >>> > >>>>>> interim strategy for you to proceed with implementation while >>> > >>>>>> proposing a top-level type to the Arrow format (which may or >>> may not >>> > >>>>>> be accepting) so you aren't blocked on acceptance of changes >>> into >>> > >>>>>> Schema.fbs. >>> > >>>>>> >>> > >>>>>>>> On 17 Feb 2021, at 21:02, Wes McKinney <wesmck...@gmail.com> >>> > wrote: >>> > >>>>>>>> >>> > >>>>>>>> It's unclear to me that this needs to be introduced into the >>> > >>>>> top-level >>> > >>>>>>>> columnar format without more analysis — have you considered >>> > >>>>>>>> implementing this for DataFusion as an extension type for the >>> time >>> > >>>>>>>> being? >>> > >>>>>>>> >>> > >>>>>>>> On Wed, Feb 17, 2021 at 11:59 AM t...@dmtry.me <mailto: >>> > >> t...@dmtry.me >>> > >>>>>> >>> > >>>>>> <t...@dmtry.me <mailto:t...@dmtry.me>> wrote: >>> > >>>>>>>>> >>> > >>>>>>>>> Hi, >>> > >>>>>>>>> >>> > >>>>>>>>> For now, There are only two types of IntervalUnit inside >>> Arrow: >>> > >>>>>>>>> >>> > >>>>>>>>> - YearMonth - month stored as int32 >>> > >>>>>>>>> - DayTime - days as int32 and time in milliseconds as in32. >>> > Total >>> > >>>>>> (64 bites) >>> > >>>>>>>>> >>> > >>>>>>>>> Since DF is using Arrow, It’s not possible to store “Complex” >>> > >>>>>> intervals such 1 MONTH 1 DAY 1 HOUR. >>> > >>>>>>>>> I think, the best way to understand the problem will be to >>> read a >>> > >>>>>> comment from DF codebase: >>> > >>>>>> >>> > >>>>> >>> > >> >>> > >>> https://github.com/apache/arrow/blob/bca7d2fe84ccd8fc1129cb4d85448eb0779c52c3/rust/datafusion/src/sql/planner.rs#L1148 >>> > >>>>>>>>> >>> > >>>>>>>>> // Interval is tricky thing >>> > >>>>>>>>> // 1 day is not 24 hours because timezones, 1 year >>> != >>> > >>>>> 365/364! >>> > >>>>>> 30 days != 1 month >>> > >>>>>>>>> // The true way to store and calculate intervals is >>> to >>> > >> store >>> > >>>>>> it as it defined >>> > >>>>>>>>> // Due the fact that Arrow supports only two types >>> > >> YearMonth >>> > >>>>>> (month) and DayTime (day, time) >>> > >>>>>>>>> // It's not possible to store complex intervals >>> > >>>>>>>>> // It's possible to do select (NOW() + INTERVAL '1 >>> > year') + >>> > >>>>>> INTERVAL '1 day'; as workaround >>> > >>>>>>>>> if result_month != 0 && (result_days != 0 || >>> > result_millis >>> > >> != >>> > >>>>>> 0) { >>> > >>>>>>>>> return >>> Err(DataFusionError::NotImplemented(format!( >>> > >>>>>>>>> "DF does not support intervals that have >>> both a >>> > >>>>>> Year/Month part as well as Days/Hours/Mins/Seconds: {:?}. Hint: >>> try >>> > >>>>>> breaking the interval into two parts, one with Year/Month and >>> the >>> > >> other >>> > >>>>>> with Days/Hours/Mins/Seconds - e.g. (NOW() + INTERVAL '1 year') >>> + >>> > >>>>> INTERVAL >>> > >>>>>> '1 day'", >>> > >>>>>>>>> value >>> > >>>>>>>>> ))); >>> > >>>>>>>>> } >>> > >>>>>>>>> >>> > >>>>>>>>> >>> > >>>>>>>>> >>> > >>>>>>>>> I prepared a PR >>> https://github.com/apache/arrow/pull/9516/files >>> > < >>> > >>>>>> https://github.com/apache/arrow/pull/9516/files> < >>> > >>>>>> https://github.com/apache/arrow/pull/9516/files < >>> > >>>>>> https://github.com/apache/arrow/pull/9516/files>> that >>> introduce a >>> > >> new >>> > >>>>>> type for IntervalUnit called Complex, that store both YearMonth >>> and >>> > >>>>> DayTime >>> > >>>>>> to support complex interval. >>> > >>>>>>>>> I didn’t find any page/documentation on how to do RFC in >>> Arrow >>> > >>>>>> protocol, so can anyone point me to it or PR with email will be >>> > >> enough? >>> > >>>>>>>>> >>> > >>>>>>>>> Thanks. >>> > >>>>>>> >>> > >>>>>> >>> > >>>>> >>> > >>>> >>> > >> >>> > > >>> > >>> >>