Andrew is the use-case you have simply postgres compatibility or is it more extensive?
One potential problem with combining Month and Day fields, is that the type no longer has a defined sort order (the existing Day-Millisecond type without assumptions, in particular because I don't think today there is an explicit constraint on the bounds for the millisecond component). -Micah On Wed, Mar 31, 2021 at 9:03 AM Antoine Pitrou <anto...@python.org> wrote: > > Le 31/03/2021 à 17:55, Micah Kornfield a écrit : > > Thanks for the feedback. A couple of points here and some responses > below. > > > > * One other question is whether the Nanoseconds should actually be > > configurable (i.e. use milliseconds or microseconds). I would lean > towards > > no. > > Same for me. > > > * I'm also still not 100% convinced we need this as a first class type in > > arrow or if we should be looking more closely at the Struct (in the Arrow > > sense) based implementation. In the future where alternative encodings > are > > supported, this could allow for much smaller footprints for this type. > > Having a "packed" first class type allows for better locality when > accessing data. It doesn't sound very likely that you'd access only one > component of the interval. > > But I have no idea how important this is, and temporal datetypes are > generally cumbersome to add support for (conversions, arithmetic, etc.), > so it would be nice to avoid adding too many of them :-) > > Regards > > Antoine. > > > > > > > The 3 > >> field implementation doesn't seem to have any way to represent integral > >> days, so I am also not sure about that one. > > > > > > Sorry this was an email gaffe. I intended Month (32 bit int), Day (32 > bit > > int), Nanosecond (64 bit int). > > > > OTOH I don't really understand the point of supporting "the most > >> reasonable ranges for Year, Month and Nanoseconds independently". What > >> does it bring to encode more than one month in the nanoseconds field? > > > > > > I'm happy with simplicity. In the past there has been some reference to > > people wanting to store very large timestamps (fall out of Nanoseconds > max > > representable value) but we've concluded that this wasn't something that > we > > wanted to really support. > > > > > > > > > > > > > > On Wed, Mar 31, 2021 at 4:49 AM Antoine Pitrou <anto...@python.org> > wrote: > > > >> > >> I would favour the following characteristics : > >> - support for nanoseconds (especially as other Arrow temporal types > >> support it) > >> - easy to handle (which excludes the ZetaSQL representtaion IMHO) > >> > >> OTOH I don't really understand the point of supporting "the most > >> reasonable ranges for Year, Month and Nanoseconds independently". What > >> does it bring to encode more than one month in the nanoseconds field? > >> You can already use the Duration type for that. > >> > >> Regards > >> > >> Antoine. > >> > >> > >> Le 31/03/2021 à 05:48, Micah Kornfield a écrit : > >>> To follow-up on this conversation I did some analysis on interval > types: > >>> > >>> > >> > https://docs.google.com/document/d/1i1E_fdQ_xODZcAhsV11Pfq27O50k679OYHXFJpm9NS0/edit > >> Please feel free to add more details/systems I missed. > >>> > >>> Given the disparate requirements of different systems I think the > >> following might make sense for official types (if there isn't > consensus, I > >> might try to contributation extension Array implementations for them to > >> Java and C++/Python separately). > >>> > >>> 1. 3 fields: Year (32 bit), Month (32 bit), Nanoseconds (64 bit) all > >> signed. > >>> 2. Postgres representation (Downside is it doesn't support > Nanoseconds, > >> only microseconds). > >>> 3. ZetaSQL implementation (Requires some bit manipulation) but > supports > >> the most reasonable ranges for Year, Month and Nanoseconds > independently. > >>> > >>> Thoughts? > >>> > >>> Micah > >>> > >>> On 2021/02/18 04:30:55 Micah Kornfield wrote: > >>>>> > >>>>> I didn’t find any page/documentation on how to do RFC in Arrow > >> protocol, > >>>>> so can anyone point me to it or PR with email will be enough? > >>>> > >>>> That is enough to start discussion. Before formal acceptance and > >> merging > >>>> of the PR there needs to be a Java and C++ implementations for the > type > >>>> that pass integration tests. At the time this guideline was > instituted > >>>> Java and C++ were considered the "reference" implementations (I think > >> they > >>>> still have the most complete integration test coverage). > >>>> > >>>> My understanding is that the current modelling of intervals mimics SQL > >>>> standards (e.g. SQL Server [1]). So it would also be good to step > back > >> and > >>>> understand what problem DF is trying to solve and how it differs from > >> other > >>>> SQL implementations. I'd be hesitant to accept COMPLEX as a new type > >>>> without a much deeper analysis into calendar representations within > >> Arrow > >>>> and how they relate to other existing systems (e.g. Hive and some > >>>> assortment of existing SQL databases). For instance the current > >> modelling > >>>> of timestamps does not lend itself to constructing a COMPLEX interval > >> type > >>>> particularly well. (Duration was introduced for this reason). > >>>> > >>>> I think both Wes's suggestion of FixedSizeBinary and Andrew's of > >> composing > >>>> the with a struct are good stop-gaps. These obviously have different > >>>> trade-offs. Ultimately, it would be good to define common extension > >> types > >>>> that can represent this use-case if there really is demand for it (if > it > >>>> doesn't become a top level type). > >>>> > >>>> [1] > >>>> > >> > https://docs.microsoft.com/en-us/sql/odbc/reference/appendixes/interval-data-types?view=sql-server-ver15 > >>>> > >>>> -Micah > >>>> > >>>> On Wed, Feb 17, 2021 at 2:05 PM Andrew Lamb <al...@influxdata.com> > >> wrote: > >>>> > >>>>> That is a great suggestion Wes, thank you. > >>>>> > >>>>> I wonder if we could get away with a 128 bit representation that is > the > >>>>> concatenation of the two existing interval types > (YearMonth)(DayTime). > >> Or > >>>>> maybe even define a `struct` type with those fields that is used by > >>>>> DataFusion. > >>>>> > >>>>> Basically, given our reading of the Arrow spec[1], it is currently > not > >>>>> possible to precisely represent an interval that has both monthly and > >>>>> sub-montly granularity. > >>>>> > >>>>> As Dmtry says, if you have an interval seemingly simple like 1 > month, > >> 1 > >>>>> day > >>>>> > >>>>> Using IntervalUnit(YEAR_MONTH) can't represent the 1 day > >>>>> Using IntervalUnit(DAY_TIME) can't represent the month as different > >> months > >>>>> have different numbers of days > >>>>> > >>>>> [1] > >>>>> > >> https://github.com/apache/arrow/blob/master/format/Schema.fbs#L249-L260 > >>>>> > >>>>> > >>>>> On Wed, Feb 17, 2021 at 5:01 PM Wes McKinney <wesmck...@gmail.com> > >> wrote: > >>>>> > >>>>>> On Wed, Feb 17, 2021 at 3:46 PM <t...@dmtry.me> wrote: > >>>>>>> > >>>>>>>> It's unclear to me that this needs to be introduced into the > >>>>> top-level > >>>>>>> > >>>>>>> Similar thing to columnar format, How to store interval like 1 > month > >> 1 > >>>>>> day 1 hour? It’s not possible to do it without converting 1 month to > >> 30 > >>>>>> days, which is a bad way. > >>>>>>> > >>>>>> > >>>>>> Presumably you can represent a complex interval in a fixed number of > >>>>>> bytes, and then embed the data in a FixedSizeBinary type. You can > >>>>>> adorn this type with extension type metadata so that DataFusion can > >>>>>> then apply Interval semantics to it. This could also serve as an > >>>>>> interim strategy for you to proceed with implementation while > >>>>>> proposing a top-level type to the Arrow format (which may or may not > >>>>>> be accepting) so you aren't blocked on acceptance of changes into > >>>>>> Schema.fbs. > >>>>>> > >>>>>>>> On 17 Feb 2021, at 21:02, Wes McKinney <wesmck...@gmail.com> > wrote: > >>>>>>>> > >>>>>>>> It's unclear to me that this needs to be introduced into the > >>>>> top-level > >>>>>>>> columnar format without more analysis — have you considered > >>>>>>>> implementing this for DataFusion as an extension type for the time > >>>>>>>> being? > >>>>>>>> > >>>>>>>> On Wed, Feb 17, 2021 at 11:59 AM t...@dmtry.me <mailto: > >> t...@dmtry.me > >>>>>> > >>>>>> <t...@dmtry.me <mailto:t...@dmtry.me>> wrote: > >>>>>>>>> > >>>>>>>>> Hi, > >>>>>>>>> > >>>>>>>>> For now, There are only two types of IntervalUnit inside Arrow: > >>>>>>>>> > >>>>>>>>> - YearMonth - month stored as int32 > >>>>>>>>> - DayTime - days as int32 and time in milliseconds as in32. > Total > >>>>>> (64 bites) > >>>>>>>>> > >>>>>>>>> Since DF is using Arrow, It’s not possible to store “Complex” > >>>>>> intervals such 1 MONTH 1 DAY 1 HOUR. > >>>>>>>>> I think, the best way to understand the problem will be to read a > >>>>>> comment from DF codebase: > >>>>>> > >>>>> > >> > https://github.com/apache/arrow/blob/bca7d2fe84ccd8fc1129cb4d85448eb0779c52c3/rust/datafusion/src/sql/planner.rs#L1148 > >>>>>>>>> > >>>>>>>>> // Interval is tricky thing > >>>>>>>>> // 1 day is not 24 hours because timezones, 1 year != > >>>>> 365/364! > >>>>>> 30 days != 1 month > >>>>>>>>> // The true way to store and calculate intervals is to > >> store > >>>>>> it as it defined > >>>>>>>>> // Due the fact that Arrow supports only two types > >> YearMonth > >>>>>> (month) and DayTime (day, time) > >>>>>>>>> // It's not possible to store complex intervals > >>>>>>>>> // It's possible to do select (NOW() + INTERVAL '1 > year') + > >>>>>> INTERVAL '1 day'; as workaround > >>>>>>>>> if result_month != 0 && (result_days != 0 || > result_millis > >> != > >>>>>> 0) { > >>>>>>>>> return Err(DataFusionError::NotImplemented(format!( > >>>>>>>>> "DF does not support intervals that have both a > >>>>>> Year/Month part as well as Days/Hours/Mins/Seconds: {:?}. Hint: try > >>>>>> breaking the interval into two parts, one with Year/Month and the > >> other > >>>>>> with Days/Hours/Mins/Seconds - e.g. (NOW() + INTERVAL '1 year') + > >>>>> INTERVAL > >>>>>> '1 day'", > >>>>>>>>> value > >>>>>>>>> ))); > >>>>>>>>> } > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> I prepared a PR https://github.com/apache/arrow/pull/9516/files > < > >>>>>> https://github.com/apache/arrow/pull/9516/files> < > >>>>>> https://github.com/apache/arrow/pull/9516/files < > >>>>>> https://github.com/apache/arrow/pull/9516/files>> that introduce a > >> new > >>>>>> type for IntervalUnit called Complex, that store both YearMonth and > >>>>> DayTime > >>>>>> to support complex interval. > >>>>>>>>> I didn’t find any page/documentation on how to do RFC in Arrow > >>>>>> protocol, so can anyone point me to it or PR with email will be > >> enough? > >>>>>>>>> > >>>>>>>>> Thanks. > >>>>>>> > >>>>>> > >>>>> > >>>> > >> > > >