Hi Jacek, This seems like reasonable functionality. I think the probably comes in two parts: 1. This might be a good candidate for a "Well Known"/Officially supported Extension type. I can think of a few different representations but I would guess something like Struct[start: T, struct: end]] with well defined extension metadata to define open/closed on start and end might be the best (we should probably spin this off into a separate discussion thread). 2. Adding the right computation Kernels to work with the type.
Do you want to start a new thread or open up some JIRAs to track this work? Thanks, Micah On Mon, May 3, 2021 at 5:32 AM Jacek Pliszka <jacek.plis...@gmail.com> wrote: > Sorry, my mistake. > > You are right - I meant anchored intervals as in pandas - ones with > defined start and end - and I think many future users will make the > same mistake. > > I would love to be able to do fast overlap joins on arrow level. > > Best Regards, > > Jacek > > > > > niedz., 2 maj 2021 o 23:06 Wes McKinney <wesmck...@gmail.com> napisał(a): > > > > I also don't understand the comment about closed / open / semi-open > > intervals. Perhaps there is a confusion, since "interval" as we mean > > it here is called a "time delta" in some other projects. An interval > > here does not refer to a time span with a distinct start and end point > > (I understand this might be confusing to a pandas user since pandas > > has an interval data type where each value is a tuple of arbitrary > > start/end). > > > > On Sun, May 2, 2021 at 3:46 PM Micah Kornfield <emkornfi...@gmail.com> > wrote: > > > > > > Hi Jacek, > > > I'm not sure I fully understand the proposal, could you elaborate with > more > > > examples/details? For instance DAY_TIME isn't just a UINT64, it > actually > > > contains 2 seperate fields (days and milliseconds). > > > > > > In terms of closed vs half-open, in my limited understanding, that is > more > > > a concern of functions using interval types rather than the type > itself. > > > For instance a quick search of postgres [1] docs only talks about > half-open > > > in relation to the "Overlaps" operator > > > > > > Thanks, > > > -Micah > > > > > > [1] https://www.postgresql.org/docs/9.1/functions-datetime.html > > > > > > > > > > > > On Sun, May 2, 2021 at 12:25 AM Jacek Pliszka <jacek.plis...@gmail.com > > > > > wrote: > > > > > > > Hi! > > > > > > > > I wonder if it were possible to have generic interval with integers > of > > > > specified size just to have common base for interval arithmetic. > > > > > > > > Then user can convert their period to ordinals and use the arithmetic > > > > (joining, deoverlapping, common parts, explosion etc.). > > > > > > > > So YEAR_MONTH and DAY_TIME would be just special cases of > > > > INTERVAL_UINT32 and INTERVAL_UINT64 > > > > > > > > Also I believe it is worth to state whether there are only closed > > > > intervals or open/semi-open ones are allowed as well. > > > > > > > > I believe I am just one of many reinventing the wheel here and > writing > > > > own versions of the above. > > > > > > > > BR, > > > > > > > > Jacek > > > > > > > > > > > > pt., 2 kwi 2021 o 21:53 Micah Kornfield <emkornfi...@gmail.com> > > > > napisał(a): > > > > > > > > > > Andrew is the use-case you have simply postgres compatibility or > is it > > > > more > > > > > extensive? > > > > > > > > > > One potential problem with combining Month and Day fields, is that > the > > > > type > > > > > no longer has a defined sort order (the existing Day-Millisecond > type > > > > > without assumptions, in particular because I don't think today > there is > > > > an > > > > > explicit constraint on the bounds for the millisecond component). > > > > > > > > > > -Micah > > > > > > > > > > > > > > > > > > > > On Wed, Mar 31, 2021 at 9:03 AM Antoine Pitrou <anto...@python.org > > > > > > wrote: > > > > > > > > > > > > > > > > > Le 31/03/2021 à 17:55, Micah Kornfield a écrit : > > > > > > > Thanks for the feedback. A couple of points here and some > responses > > > > > > below. > > > > > > > > > > > > > > * One other question is whether the Nanoseconds should > actually be > > > > > > > configurable (i.e. use milliseconds or microseconds). I would > lean > > > > > > towards > > > > > > > no. > > > > > > > > > > > > Same for me. > > > > > > > > > > > > > * I'm also still not 100% convinced we need this as a first > class > > > > type in > > > > > > > arrow or if we should be looking more closely at the Struct > (in the > > > > Arrow > > > > > > > sense) based implementation. In the future where alternative > > > > encodings > > > > > > are > > > > > > > supported, this could allow for much smaller footprints for > this > > > > type. > > > > > > > > > > > > Having a "packed" first class type allows for better locality > when > > > > > > accessing data. It doesn't sound very likely that you'd access > only > > > > one > > > > > > component of the interval. > > > > > > > > > > > > But I have no idea how important this is, and temporal datetypes > are > > > > > > generally cumbersome to add support for (conversions, arithmetic, > > > > etc.), > > > > > > so it would be nice to avoid adding too many of them :-) > > > > > > > > > > > > Regards > > > > > > > > > > > > Antoine. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The 3 > > > > > > >> field implementation doesn't seem to have any way to represent > > > > integral > > > > > > >> days, so I am also not sure about that one. > > > > > > > > > > > > > > > > > > > > > Sorry this was an email gaffe. I intended Month (32 bit int), > Day > > > > (32 > > > > > > bit > > > > > > > int), Nanosecond (64 bit int). > > > > > > > > > > > > > > OTOH I don't really understand the point of supporting "the > most > > > > > > >> reasonable ranges for Year, Month and Nanoseconds > independently". > > > > What > > > > > > >> does it bring to encode more than one month in the nanoseconds > > > > field? > > > > > > > > > > > > > > > > > > > > > I'm happy with simplicity. In the past there has been some > > > > reference to > > > > > > > people wanting to store very large timestamps (fall out of > > > > Nanoseconds > > > > > > max > > > > > > > representable value) but we've concluded that this wasn't > something > > > > that > > > > > > we > > > > > > > wanted to really support. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Mar 31, 2021 at 4:49 AM Antoine Pitrou < > anto...@python.org> > > > > > > wrote: > > > > > > > > > > > > > >> > > > > > > >> I would favour the following characteristics : > > > > > > >> - support for nanoseconds (especially as other Arrow temporal > types > > > > > > >> support it) > > > > > > >> - easy to handle (which excludes the ZetaSQL representtaion > IMHO) > > > > > > >> > > > > > > >> OTOH I don't really understand the point of supporting "the > most > > > > > > >> reasonable ranges for Year, Month and Nanoseconds > independently". > > > > What > > > > > > >> does it bring to encode more than one month in the nanoseconds > > > > field? > > > > > > >> You can already use the Duration type for that. > > > > > > >> > > > > > > >> Regards > > > > > > >> > > > > > > >> Antoine. > > > > > > >> > > > > > > >> > > > > > > >> Le 31/03/2021 à 05:48, Micah Kornfield a écrit : > > > > > > >>> To follow-up on this conversation I did some analysis on > interval > > > > > > types: > > > > > > >>> > > > > > > >>> > > > > > > >> > > > > > > > > > > > https://docs.google.com/document/d/1i1E_fdQ_xODZcAhsV11Pfq27O50k679OYHXFJpm9NS0/edit > > > > > > >> Please feel free to add more details/systems I missed. > > > > > > >>> > > > > > > >>> Given the disparate requirements of different systems I > think the > > > > > > >> following might make sense for official types (if there isn't > > > > > > consensus, I > > > > > > >> might try to contributation extension Array implementations > for > > > > them to > > > > > > >> Java and C++/Python separately). > > > > > > >>> > > > > > > >>> 1. 3 fields: Year (32 bit), Month (32 bit), Nanoseconds (64 > bit) > > > > all > > > > > > >> signed. > > > > > > >>> 2. Postgres representation (Downside is it doesn't support > > > > > > Nanoseconds, > > > > > > >> only microseconds). > > > > > > >>> 3. ZetaSQL implementation (Requires some bit manipulation) > but > > > > > > supports > > > > > > >> the most reasonable ranges for Year, Month and Nanoseconds > > > > > > independently. > > > > > > >>> > > > > > > >>> Thoughts? > > > > > > >>> > > > > > > >>> Micah > > > > > > >>> > > > > > > >>> On 2021/02/18 04:30:55 Micah Kornfield wrote: > > > > > > >>>>> > > > > > > >>>>> I didn’t find any page/documentation on how to do RFC in > Arrow > > > > > > >> protocol, > > > > > > >>>>> so can anyone point me to it or PR with email will be > enough? > > > > > > >>>> > > > > > > >>>> That is enough to start discussion. Before formal > acceptance and > > > > > > >> merging > > > > > > >>>> of the PR there needs to be a Java and C++ implementations > for the > > > > > > type > > > > > > >>>> that pass integration tests. At the time this guideline was > > > > > > instituted > > > > > > >>>> Java and C++ were considered the "reference" > implementations (I > > > > think > > > > > > >> they > > > > > > >>>> still have the most complete integration test coverage). > > > > > > >>>> > > > > > > >>>> My understanding is that the current modelling of intervals > > > > mimics SQL > > > > > > >>>> standards (e.g. SQL Server [1]). So it would also be good > to step > > > > > > back > > > > > > >> and > > > > > > >>>> understand what problem DF is trying to solve and how it > differs > > > > from > > > > > > >> other > > > > > > >>>> SQL implementations. I'd be hesitant to accept COMPLEX as > a new > > > > type > > > > > > >>>> without a much deeper analysis into calendar representations > > > > within > > > > > > >> Arrow > > > > > > >>>> and how they relate to other existing systems (e.g. Hive > and some > > > > > > >>>> assortment of existing SQL databases). For instance the > current > > > > > > >> modelling > > > > > > >>>> of timestamps does not lend itself to constructing a COMPLEX > > > > interval > > > > > > >> type > > > > > > >>>> particularly well. (Duration was introduced for this > reason). > > > > > > >>>> > > > > > > >>>> I think both Wes's suggestion of FixedSizeBinary and > Andrew's of > > > > > > >> composing > > > > > > >>>> the with a struct are good stop-gaps. These obviously have > > > > different > > > > > > >>>> trade-offs. Ultimately, it would be good to define common > > > > extension > > > > > > >> types > > > > > > >>>> that can represent this use-case if there really is demand > for it > > > > (if > > > > > > it > > > > > > >>>> doesn't become a top level type). > > > > > > >>>> > > > > > > >>>> [1] > > > > > > >>>> > > > > > > >> > > > > > > > > > > > https://docs.microsoft.com/en-us/sql/odbc/reference/appendixes/interval-data-types?view=sql-server-ver15 > > > > > > >>>> > > > > > > >>>> -Micah > > > > > > >>>> > > > > > > >>>> On Wed, Feb 17, 2021 at 2:05 PM Andrew Lamb < > al...@influxdata.com > > > > > > > > > > > >> wrote: > > > > > > >>>> > > > > > > >>>>> That is a great suggestion Wes, thank you. > > > > > > >>>>> > > > > > > >>>>> I wonder if we could get away with a 128 bit > representation that > > > > is > > > > > > the > > > > > > >>>>> concatenation of the two existing interval types > > > > > > (YearMonth)(DayTime). > > > > > > >> Or > > > > > > >>>>> maybe even define a `struct` type with those fields that > is used > > > > by > > > > > > >>>>> DataFusion. > > > > > > >>>>> > > > > > > >>>>> Basically, given our reading of the Arrow spec[1], it is > > > > currently > > > > > > not > > > > > > >>>>> possible to precisely represent an interval that has both > > > > monthly and > > > > > > >>>>> sub-montly granularity. > > > > > > >>>>> > > > > > > >>>>> As Dmtry says, if you have an interval seemingly simple > like 1 > > > > > > month, > > > > > > >> 1 > > > > > > >>>>> day > > > > > > >>>>> > > > > > > >>>>> Using IntervalUnit(YEAR_MONTH) can't represent the 1 day > > > > > > >>>>> Using IntervalUnit(DAY_TIME) can't represent the month as > > > > different > > > > > > >> months > > > > > > >>>>> have different numbers of days > > > > > > >>>>> > > > > > > >>>>> [1] > > > > > > >>>>> > > > > > > >> > > > > > https://github.com/apache/arrow/blob/master/format/Schema.fbs#L249-L260 > > > > > > >>>>> > > > > > > >>>>> > > > > > > >>>>> On Wed, Feb 17, 2021 at 5:01 PM Wes McKinney < > > > > wesmck...@gmail.com> > > > > > > >> wrote: > > > > > > >>>>> > > > > > > >>>>>> On Wed, Feb 17, 2021 at 3:46 PM <t...@dmtry.me> wrote: > > > > > > >>>>>>> > > > > > > >>>>>>>> It's unclear to me that this needs to be introduced > into the > > > > > > >>>>> top-level > > > > > > >>>>>>> > > > > > > >>>>>>> Similar thing to columnar format, How to store interval > like 1 > > > > > > month > > > > > > >> 1 > > > > > > >>>>>> day 1 hour? It’s not possible to do it without converting > 1 > > > > month to > > > > > > >> 30 > > > > > > >>>>>> days, which is a bad way. > > > > > > >>>>>>> > > > > > > >>>>>> > > > > > > >>>>>> Presumably you can represent a complex interval in a fixed > > > > number of > > > > > > >>>>>> bytes, and then embed the data in a FixedSizeBinary type. > You > > > > can > > > > > > >>>>>> adorn this type with extension type metadata so that > DataFusion > > > > can > > > > > > >>>>>> then apply Interval semantics to it. This could also > serve as an > > > > > > >>>>>> interim strategy for you to proceed with implementation > while > > > > > > >>>>>> proposing a top-level type to the Arrow format (which may > or > > > > may not > > > > > > >>>>>> be accepting) so you aren't blocked on acceptance of > changes > > > > into > > > > > > >>>>>> Schema.fbs. > > > > > > >>>>>> > > > > > > >>>>>>>> On 17 Feb 2021, at 21:02, Wes McKinney < > wesmck...@gmail.com> > > > > > > wrote: > > > > > > >>>>>>>> > > > > > > >>>>>>>> It's unclear to me that this needs to be introduced > into the > > > > > > >>>>> top-level > > > > > > >>>>>>>> columnar format without more analysis — have you > considered > > > > > > >>>>>>>> implementing this for DataFusion as an extension type > for the > > > > time > > > > > > >>>>>>>> being? > > > > > > >>>>>>>> > > > > > > >>>>>>>> On Wed, Feb 17, 2021 at 11:59 AM t...@dmtry.me <mailto: > > > > > > >> t...@dmtry.me > > > > > > >>>>>> > > > > > > >>>>>> <t...@dmtry.me <mailto:t...@dmtry.me>> wrote: > > > > > > >>>>>>>>> > > > > > > >>>>>>>>> Hi, > > > > > > >>>>>>>>> > > > > > > >>>>>>>>> For now, There are only two types of IntervalUnit > inside > > > > Arrow: > > > > > > >>>>>>>>> > > > > > > >>>>>>>>> - YearMonth - month stored as int32 > > > > > > >>>>>>>>> - DayTime - days as int32 and time in milliseconds as > in32. > > > > > > Total > > > > > > >>>>>> (64 bites) > > > > > > >>>>>>>>> > > > > > > >>>>>>>>> Since DF is using Arrow, It’s not possible to store > “Complex” > > > > > > >>>>>> intervals such 1 MONTH 1 DAY 1 HOUR. > > > > > > >>>>>>>>> I think, the best way to understand the problem will > be to > > > > read a > > > > > > >>>>>> comment from DF codebase: > > > > > > >>>>>> > > > > > > >>>>> > > > > > > >> > > > > > > > > > > > https://github.com/apache/arrow/blob/bca7d2fe84ccd8fc1129cb4d85448eb0779c52c3/rust/datafusion/src/sql/planner.rs#L1148 > > > > > > >>>>>>>>> > > > > > > >>>>>>>>> // Interval is tricky thing > > > > > > >>>>>>>>> // 1 day is not 24 hours because timezones, 1 > year > > > > != > > > > > > >>>>> 365/364! > > > > > > >>>>>> 30 days != 1 month > > > > > > >>>>>>>>> // The true way to store and calculate > intervals is > > > > to > > > > > > >> store > > > > > > >>>>>> it as it defined > > > > > > >>>>>>>>> // Due the fact that Arrow supports only two > types > > > > > > >> YearMonth > > > > > > >>>>>> (month) and DayTime (day, time) > > > > > > >>>>>>>>> // It's not possible to store complex > intervals > > > > > > >>>>>>>>> // It's possible to do select (NOW() + > INTERVAL '1 > > > > > > year') + > > > > > > >>>>>> INTERVAL '1 day'; as workaround > > > > > > >>>>>>>>> if result_month != 0 && (result_days != 0 || > > > > > > result_millis > > > > > > >> != > > > > > > >>>>>> 0) { > > > > > > >>>>>>>>> return > > > > Err(DataFusionError::NotImplemented(format!( > > > > > > >>>>>>>>> "DF does not support intervals that > have > > > > both a > > > > > > >>>>>> Year/Month part as well as Days/Hours/Mins/Seconds: {:?}. > Hint: > > > > try > > > > > > >>>>>> breaking the interval into two parts, one with Year/Month > and > > > > the > > > > > > >> other > > > > > > >>>>>> with Days/Hours/Mins/Seconds - e.g. (NOW() + INTERVAL '1 > year') > > > > + > > > > > > >>>>> INTERVAL > > > > > > >>>>>> '1 day'", > > > > > > >>>>>>>>> value > > > > > > >>>>>>>>> ))); > > > > > > >>>>>>>>> } > > > > > > >>>>>>>>> > > > > > > >>>>>>>>> > > > > > > >>>>>>>>> > > > > > > >>>>>>>>> I prepared a PR > > > > https://github.com/apache/arrow/pull/9516/files > > > > > > < > > > > > > >>>>>> https://github.com/apache/arrow/pull/9516/files> < > > > > > > >>>>>> https://github.com/apache/arrow/pull/9516/files < > > > > > > >>>>>> https://github.com/apache/arrow/pull/9516/files>> that > > > > introduce a > > > > > > >> new > > > > > > >>>>>> type for IntervalUnit called Complex, that store both > YearMonth > > > > and > > > > > > >>>>> DayTime > > > > > > >>>>>> to support complex interval. > > > > > > >>>>>>>>> I didn’t find any page/documentation on how to do RFC > in > > > > Arrow > > > > > > >>>>>> protocol, so can anyone point me to it or PR with email > will be > > > > > > >> enough? > > > > > > >>>>>>>>> > > > > > > >>>>>>>>> Thanks. > > > > > > >>>>>>> > > > > > > >>>>>> > > > > > > >>>>> > > > > > > >>>> > > > > > > >> > > > > > > > > > > > > > > > > > >