I'll address the feedback. I think in the past we've waited for implementations in java and c++ with integration tests before formally voting. If there is no more feedback I can start looking at implementations (happy to have help)
On Thursday, May 6, 2021, Wes McKinney <wesmck...@gmail.com> wrote: > The PR looks good. I just left some comments about typos. I would say > it's probably about time to call a vote. Anywhere else where we should > be soliciting feedback? > > On Mon, May 3, 2021 at 2:17 PM Jacek Pliszka <jacek.plis...@gmail.com> > wrote: > > > > Good idea, I've created JIRA issue: > > > > https://issues.apache.org/jira/browse/ARROW-12637 > > > > And named it range to avoid confusion with intervals... > > Though confusion will stay as it is called interval in Pandas and in > > logic (Allen's interval algebra) > > > > BR, > > > > Jacek > > > > pon., 3 maj 2021 o 18:05 Micah Kornfield <emkornfi...@gmail.com> > napisał(a): > > > > > > Hi Jacek, > > > This seems like reasonable functionality. I think the probably comes > in > > > two parts: > > > 1. This might be a good candidate for a "Well Known"/Officially > supported > > > Extension type. I can think of a few different representations but I > would > > > guess something like Struct[start: T, struct: end]] with well defined > > > extension metadata to define open/closed on start and end might be the > best > > > (we should probably spin this off into a separate discussion thread). > > > 2. Adding the right computation Kernels to work with the type. > > > > > > Do you want to start a new thread or open up some JIRAs to track this > work? > > > > > > Thanks, > > > Micah > > > > > > On Mon, May 3, 2021 at 5:32 AM Jacek Pliszka <jacek.plis...@gmail.com> > > > wrote: > > > > > > > Sorry, my mistake. > > > > > > > > You are right - I meant anchored intervals as in pandas - ones with > > > > defined start and end - and I think many future users will make the > > > > same mistake. > > > > > > > > I would love to be able to do fast overlap joins on arrow level. > > > > > > > > Best Regards, > > > > > > > > Jacek > > > > > > > > > > > > > > > > > > > > niedz., 2 maj 2021 o 23:06 Wes McKinney <wesmck...@gmail.com> > napisał(a): > > > > > > > > > > I also don't understand the comment about closed / open / semi-open > > > > > intervals. Perhaps there is a confusion, since "interval" as we > mean > > > > > it here is called a "time delta" in some other projects. An > interval > > > > > here does not refer to a time span with a distinct start and end > point > > > > > (I understand this might be confusing to a pandas user since pandas > > > > > has an interval data type where each value is a tuple of arbitrary > > > > > start/end). > > > > > > > > > > On Sun, May 2, 2021 at 3:46 PM Micah Kornfield < > emkornfi...@gmail.com> > > > > wrote: > > > > > > > > > > > > Hi Jacek, > > > > > > I'm not sure I fully understand the proposal, could you > elaborate with > > > > more > > > > > > examples/details? For instance DAY_TIME isn't just a UINT64, it > > > > actually > > > > > > contains 2 seperate fields (days and milliseconds). > > > > > > > > > > > > In terms of closed vs half-open, in my limited understanding, > that is > > > > more > > > > > > a concern of functions using interval types rather than the type > > > > itself. > > > > > > For instance a quick search of postgres [1] docs only talks about > > > > half-open > > > > > > in relation to the "Overlaps" operator > > > > > > > > > > > > Thanks, > > > > > > -Micah > > > > > > > > > > > > [1] https://www.postgresql.org/docs/9.1/functions-datetime.html > > > > > > > > > > > > > > > > > > > > > > > > On Sun, May 2, 2021 at 12:25 AM Jacek Pliszka < > jacek.plis...@gmail.com > > > > > > > > > > > wrote: > > > > > > > > > > > > > Hi! > > > > > > > > > > > > > > I wonder if it were possible to have generic interval with > integers > > > > of > > > > > > > specified size just to have common base for interval > arithmetic. > > > > > > > > > > > > > > Then user can convert their period to ordinals and use the > arithmetic > > > > > > > (joining, deoverlapping, common parts, explosion etc.). > > > > > > > > > > > > > > So YEAR_MONTH and DAY_TIME would be just special cases of > > > > > > > INTERVAL_UINT32 and INTERVAL_UINT64 > > > > > > > > > > > > > > Also I believe it is worth to state whether there are only > closed > > > > > > > intervals or open/semi-open ones are allowed as well. > > > > > > > > > > > > > > I believe I am just one of many reinventing the wheel here and > > > > writing > > > > > > > own versions of the above. > > > > > > > > > > > > > > BR, > > > > > > > > > > > > > > Jacek > > > > > > > > > > > > > > > > > > > > > pt., 2 kwi 2021 o 21:53 Micah Kornfield <emkornfi...@gmail.com > > > > > > > > > napisał(a): > > > > > > > > > > > > > > > > Andrew is the use-case you have simply postgres > compatibility or > > > > is it > > > > > > > more > > > > > > > > extensive? > > > > > > > > > > > > > > > > One potential problem with combining Month and Day fields, > is that > > > > the > > > > > > > type > > > > > > > > no longer has a defined sort order (the existing > Day-Millisecond > > > > type > > > > > > > > without assumptions, in particular because I don't think > today > > > > there is > > > > > > > an > > > > > > > > explicit constraint on the bounds for the millisecond > component). > > > > > > > > > > > > > > > > -Micah > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Mar 31, 2021 at 9:03 AM Antoine Pitrou < > anto...@python.org > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > Le 31/03/2021 à 17:55, Micah Kornfield a écrit : > > > > > > > > > > Thanks for the feedback. A couple of points here and > some > > > > responses > > > > > > > > > below. > > > > > > > > > > > > > > > > > > > > * One other question is whether the Nanoseconds should > > > > actually be > > > > > > > > > > configurable (i.e. use milliseconds or microseconds). I > would > > > > lean > > > > > > > > > towards > > > > > > > > > > no. > > > > > > > > > > > > > > > > > > Same for me. > > > > > > > > > > > > > > > > > > > * I'm also still not 100% convinced we need this as a > first > > > > class > > > > > > > type in > > > > > > > > > > arrow or if we should be looking more closely at the > Struct > > > > (in the > > > > > > > Arrow > > > > > > > > > > sense) based implementation. In the future where > alternative > > > > > > > encodings > > > > > > > > > are > > > > > > > > > > supported, this could allow for much smaller footprints > for > > > > this > > > > > > > type. > > > > > > > > > > > > > > > > > > Having a "packed" first class type allows for better > locality > > > > when > > > > > > > > > accessing data. It doesn't sound very likely that you'd > access > > > > only > > > > > > > one > > > > > > > > > component of the interval. > > > > > > > > > > > > > > > > > > But I have no idea how important this is, and temporal > datetypes > > > > are > > > > > > > > > generally cumbersome to add support for (conversions, > arithmetic, > > > > > > > etc.), > > > > > > > > > so it would be nice to avoid adding too many of them :-) > > > > > > > > > > > > > > > > > > Regards > > > > > > > > > > > > > > > > > > Antoine. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The 3 > > > > > > > > > >> field implementation doesn't seem to have any way to > represent > > > > > > > integral > > > > > > > > > >> days, so I am also not sure about that one. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sorry this was an email gaffe. I intended Month (32 bit > int), > > > > Day > > > > > > > (32 > > > > > > > > > bit > > > > > > > > > > int), Nanosecond (64 bit int). > > > > > > > > > > > > > > > > > > > > OTOH I don't really understand the point of supporting > "the > > > > most > > > > > > > > > >> reasonable ranges for Year, Month and Nanoseconds > > > > independently". > > > > > > > What > > > > > > > > > >> does it bring to encode more than one month in the > nanoseconds > > > > > > > field? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I'm happy with simplicity. In the past there has been > some > > > > > > > reference to > > > > > > > > > > people wanting to store very large timestamps (fall out > of > > > > > > > Nanoseconds > > > > > > > > > max > > > > > > > > > > representable value) but we've concluded that this wasn't > > > > something > > > > > > > that > > > > > > > > > we > > > > > > > > > > wanted to really support. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Mar 31, 2021 at 4:49 AM Antoine Pitrou < > > > > anto...@python.org> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > >> I would favour the following characteristics : > > > > > > > > > >> - support for nanoseconds (especially as other Arrow > temporal > > > > types > > > > > > > > > >> support it) > > > > > > > > > >> - easy to handle (which excludes the ZetaSQL > representtaion > > > > IMHO) > > > > > > > > > >> > > > > > > > > > >> OTOH I don't really understand the point of supporting > "the > > > > most > > > > > > > > > >> reasonable ranges for Year, Month and Nanoseconds > > > > independently". > > > > > > > What > > > > > > > > > >> does it bring to encode more than one month in the > nanoseconds > > > > > > > field? > > > > > > > > > >> You can already use the Duration type for that. > > > > > > > > > >> > > > > > > > > > >> Regards > > > > > > > > > >> > > > > > > > > > >> Antoine. > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> Le 31/03/2021 à 05:48, Micah Kornfield a écrit : > > > > > > > > > >>> To follow-up on this conversation I did some analysis > on > > > > interval > > > > > > > > > types: > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1i1E_fdQ_ > xODZcAhsV11Pfq27O50k679OYHXFJpm9NS0/edit > > > > > > > > > >> Please feel free to add more details/systems I missed. > > > > > > > > > >>> > > > > > > > > > >>> Given the disparate requirements of different systems I > > > > think the > > > > > > > > > >> following might make sense for official types (if there > isn't > > > > > > > > > consensus, I > > > > > > > > > >> might try to contributation extension Array > implementations > > > > for > > > > > > > them to > > > > > > > > > >> Java and C++/Python separately). > > > > > > > > > >>> > > > > > > > > > >>> 1. 3 fields: Year (32 bit), Month (32 bit), > Nanoseconds (64 > > > > bit) > > > > > > > all > > > > > > > > > >> signed. > > > > > > > > > >>> 2. Postgres representation (Downside is it doesn't > support > > > > > > > > > Nanoseconds, > > > > > > > > > >> only microseconds). > > > > > > > > > >>> 3. ZetaSQL implementation (Requires some bit > manipulation) > > > > but > > > > > > > > > supports > > > > > > > > > >> the most reasonable ranges for Year, Month and > Nanoseconds > > > > > > > > > independently. > > > > > > > > > >>> > > > > > > > > > >>> Thoughts? > > > > > > > > > >>> > > > > > > > > > >>> Micah > > > > > > > > > >>> > > > > > > > > > >>> On 2021/02/18 04:30:55 Micah Kornfield wrote: > > > > > > > > > >>>>> > > > > > > > > > >>>>> I didn’t find any page/documentation on how to do > RFC in > > > > Arrow > > > > > > > > > >> protocol, > > > > > > > > > >>>>> so can anyone point me to it or PR with email will be > > > > enough? > > > > > > > > > >>>> > > > > > > > > > >>>> That is enough to start discussion. Before formal > > > > acceptance and > > > > > > > > > >> merging > > > > > > > > > >>>> of the PR there needs to be a Java and C++ > implementations > > > > for the > > > > > > > > > type > > > > > > > > > >>>> that pass integration tests. At the time this > guideline was > > > > > > > > > instituted > > > > > > > > > >>>> Java and C++ were considered the "reference" > > > > implementations (I > > > > > > > think > > > > > > > > > >> they > > > > > > > > > >>>> still have the most complete integration test > coverage). > > > > > > > > > >>>> > > > > > > > > > >>>> My understanding is that the current modelling of > intervals > > > > > > > mimics SQL > > > > > > > > > >>>> standards (e.g. SQL Server [1]). So it would also be > good > > > > to step > > > > > > > > > back > > > > > > > > > >> and > > > > > > > > > >>>> understand what problem DF is trying to solve and how > it > > > > differs > > > > > > > from > > > > > > > > > >> other > > > > > > > > > >>>> SQL implementations. I'd be hesitant to accept > COMPLEX as > > > > a new > > > > > > > type > > > > > > > > > >>>> without a much deeper analysis into calendar > representations > > > > > > > within > > > > > > > > > >> Arrow > > > > > > > > > >>>> and how they relate to other existing systems (e.g. > Hive > > > > and some > > > > > > > > > >>>> assortment of existing SQL databases). For instance > the > > > > current > > > > > > > > > >> modelling > > > > > > > > > >>>> of timestamps does not lend itself to constructing a > COMPLEX > > > > > > > interval > > > > > > > > > >> type > > > > > > > > > >>>> particularly well. (Duration was introduced for this > > > > reason). > > > > > > > > > >>>> > > > > > > > > > >>>> I think both Wes's suggestion of FixedSizeBinary and > > > > Andrew's of > > > > > > > > > >> composing > > > > > > > > > >>>> the with a struct are good stop-gaps. These > obviously have > > > > > > > different > > > > > > > > > >>>> trade-offs. Ultimately, it would be good to define > common > > > > > > > extension > > > > > > > > > >> types > > > > > > > > > >>>> that can represent this use-case if there really is > demand > > > > for it > > > > > > > (if > > > > > > > > > it > > > > > > > > > >>>> doesn't become a top level type). > > > > > > > > > >>>> > > > > > > > > > >>>> [1] > > > > > > > > > >>>> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > https://docs.microsoft.com/en-us/sql/odbc/reference/ > appendixes/interval-data-types?view=sql-server-ver15 > > > > > > > > > >>>> > > > > > > > > > >>>> -Micah > > > > > > > > > >>>> > > > > > > > > > >>>> On Wed, Feb 17, 2021 at 2:05 PM Andrew Lamb < > > > > al...@influxdata.com > > > > > > > > > > > > > > > > > >> wrote: > > > > > > > > > >>>> > > > > > > > > > >>>>> That is a great suggestion Wes, thank you. > > > > > > > > > >>>>> > > > > > > > > > >>>>> I wonder if we could get away with a 128 bit > > > > representation that > > > > > > > is > > > > > > > > > the > > > > > > > > > >>>>> concatenation of the two existing interval types > > > > > > > > > (YearMonth)(DayTime). > > > > > > > > > >> Or > > > > > > > > > >>>>> maybe even define a `struct` type with those fields > that > > > > is used > > > > > > > by > > > > > > > > > >>>>> DataFusion. > > > > > > > > > >>>>> > > > > > > > > > >>>>> Basically, given our reading of the Arrow spec[1], > it is > > > > > > > currently > > > > > > > > > not > > > > > > > > > >>>>> possible to precisely represent an interval that has > both > > > > > > > monthly and > > > > > > > > > >>>>> sub-montly granularity. > > > > > > > > > >>>>> > > > > > > > > > >>>>> As Dmtry says, if you have an interval seemingly > simple > > > > like 1 > > > > > > > > > month, > > > > > > > > > >> 1 > > > > > > > > > >>>>> day > > > > > > > > > >>>>> > > > > > > > > > >>>>> Using IntervalUnit(YEAR_MONTH) can't represent the 1 > day > > > > > > > > > >>>>> Using IntervalUnit(DAY_TIME) can't represent the > month as > > > > > > > different > > > > > > > > > >> months > > > > > > > > > >>>>> have different numbers of days > > > > > > > > > >>>>> > > > > > > > > > >>>>> [1] > > > > > > > > > >>>>> > > > > > > > > > >> > > > > > > > > > > > https://github.com/apache/arrow/blob/master/format/ > Schema.fbs#L249-L260 > > > > > > > > > >>>>> > > > > > > > > > >>>>> > > > > > > > > > >>>>> On Wed, Feb 17, 2021 at 5:01 PM Wes McKinney < > > > > > > > wesmck...@gmail.com> > > > > > > > > > >> wrote: > > > > > > > > > >>>>> > > > > > > > > > >>>>>> On Wed, Feb 17, 2021 at 3:46 PM <t...@dmtry.me> > wrote: > > > > > > > > > >>>>>>> > > > > > > > > > >>>>>>>> It's unclear to me that this needs to be > introduced > > > > into the > > > > > > > > > >>>>> top-level > > > > > > > > > >>>>>>> > > > > > > > > > >>>>>>> Similar thing to columnar format, How to store > interval > > > > like 1 > > > > > > > > > month > > > > > > > > > >> 1 > > > > > > > > > >>>>>> day 1 hour? It’s not possible to do it without > converting > > > > 1 > > > > > > > month to > > > > > > > > > >> 30 > > > > > > > > > >>>>>> days, which is a bad way. > > > > > > > > > >>>>>>> > > > > > > > > > >>>>>> > > > > > > > > > >>>>>> Presumably you can represent a complex interval in > a fixed > > > > > > > number of > > > > > > > > > >>>>>> bytes, and then embed the data in a FixedSizeBinary > type. > > > > You > > > > > > > can > > > > > > > > > >>>>>> adorn this type with extension type metadata so that > > > > DataFusion > > > > > > > can > > > > > > > > > >>>>>> then apply Interval semantics to it. This could also > > > > serve as an > > > > > > > > > >>>>>> interim strategy for you to proceed with > implementation > > > > while > > > > > > > > > >>>>>> proposing a top-level type to the Arrow format > (which may > > > > or > > > > > > > may not > > > > > > > > > >>>>>> be accepting) so you aren't blocked on acceptance of > > > > changes > > > > > > > into > > > > > > > > > >>>>>> Schema.fbs. > > > > > > > > > >>>>>> > > > > > > > > > >>>>>>>> On 17 Feb 2021, at 21:02, Wes McKinney < > > > > wesmck...@gmail.com> > > > > > > > > > wrote: > > > > > > > > > >>>>>>>> > > > > > > > > > >>>>>>>> It's unclear to me that this needs to be > introduced > > > > into the > > > > > > > > > >>>>> top-level > > > > > > > > > >>>>>>>> columnar format without more analysis — have you > > > > considered > > > > > > > > > >>>>>>>> implementing this for DataFusion as an extension > type > > > > for the > > > > > > > time > > > > > > > > > >>>>>>>> being? > > > > > > > > > >>>>>>>> > > > > > > > > > >>>>>>>> On Wed, Feb 17, 2021 at 11:59 AM t...@dmtry.me > <mailto: > > > > > > > > > >> t...@dmtry.me > > > > > > > > > >>>>>> > > > > > > > > > >>>>>> <t...@dmtry.me <mailto:t...@dmtry.me>> wrote: > > > > > > > > > >>>>>>>>> > > > > > > > > > >>>>>>>>> Hi, > > > > > > > > > >>>>>>>>> > > > > > > > > > >>>>>>>>> For now, There are only two types of IntervalUnit > > > > inside > > > > > > > Arrow: > > > > > > > > > >>>>>>>>> > > > > > > > > > >>>>>>>>> - YearMonth - month stored as int32 > > > > > > > > > >>>>>>>>> - DayTime - days as int32 and time in > milliseconds as > > > > in32. > > > > > > > > > Total > > > > > > > > > >>>>>> (64 bites) > > > > > > > > > >>>>>>>>> > > > > > > > > > >>>>>>>>> Since DF is using Arrow, It’s not possible to > store > > > > “Complex” > > > > > > > > > >>>>>> intervals such 1 MONTH 1 DAY 1 HOUR. > > > > > > > > > >>>>>>>>> I think, the best way to understand the problem > will > > > > be to > > > > > > > read a > > > > > > > > > >>>>>> comment from DF codebase: > > > > > > > > > >>>>>> > > > > > > > > > >>>>> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > https://github.com/apache/arrow/blob/bca7d2fe84ccd8fc1129cb4d85448e > b0779c52c3/rust/datafusion/src/sql/planner.rs#L1148 > > > > > > > > > >>>>>>>>> > > > > > > > > > >>>>>>>>> // Interval is tricky thing > > > > > > > > > >>>>>>>>> // 1 day is not 24 hours because > timezones, 1 > > > > year > > > > > > > != > > > > > > > > > >>>>> 365/364! > > > > > > > > > >>>>>> 30 days != 1 month > > > > > > > > > >>>>>>>>> // The true way to store and calculate > > > > intervals is > > > > > > > to > > > > > > > > > >> store > > > > > > > > > >>>>>> it as it defined > > > > > > > > > >>>>>>>>> // Due the fact that Arrow supports > only two > > > > types > > > > > > > > > >> YearMonth > > > > > > > > > >>>>>> (month) and DayTime (day, time) > > > > > > > > > >>>>>>>>> // It's not possible to store complex > > > > intervals > > > > > > > > > >>>>>>>>> // It's possible to do select (NOW() + > > > > INTERVAL '1 > > > > > > > > > year') + > > > > > > > > > >>>>>> INTERVAL '1 day'; as workaround > > > > > > > > > >>>>>>>>> if result_month != 0 && (result_days != > 0 || > > > > > > > > > result_millis > > > > > > > > > >> != > > > > > > > > > >>>>>> 0) { > > > > > > > > > >>>>>>>>> return > > > > > > > Err(DataFusionError::NotImplemented(format!( > > > > > > > > > >>>>>>>>> "DF does not support intervals > that > > > > have > > > > > > > both a > > > > > > > > > >>>>>> Year/Month part as well as Days/Hours/Mins/Seconds: > {:?}. > > > > Hint: > > > > > > > try > > > > > > > > > >>>>>> breaking the interval into two parts, one with > Year/Month > > > > and > > > > > > > the > > > > > > > > > >> other > > > > > > > > > >>>>>> with Days/Hours/Mins/Seconds - e.g. (NOW() + > INTERVAL '1 > > > > year') > > > > > > > + > > > > > > > > > >>>>> INTERVAL > > > > > > > > > >>>>>> '1 day'", > > > > > > > > > >>>>>>>>> value > > > > > > > > > >>>>>>>>> ))); > > > > > > > > > >>>>>>>>> } > > > > > > > > > >>>>>>>>> > > > > > > > > > >>>>>>>>> > > > > > > > > > >>>>>>>>> > > > > > > > > > >>>>>>>>> I prepared a PR > > > > > > > https://github.com/apache/arrow/pull/9516/files > > > > > > > > > < > > > > > > > > > >>>>>> https://github.com/apache/arrow/pull/9516/files> < > > > > > > > > > >>>>>> https://github.com/apache/arrow/pull/9516/files < > > > > > > > > > >>>>>> https://github.com/apache/arrow/pull/9516/files>> > that > > > > > > > introduce a > > > > > > > > > >> new > > > > > > > > > >>>>>> type for IntervalUnit called Complex, that store > both > > > > YearMonth > > > > > > > and > > > > > > > > > >>>>> DayTime > > > > > > > > > >>>>>> to support complex interval. > > > > > > > > > >>>>>>>>> I didn’t find any page/documentation on how to > do RFC > > > > in > > > > > > > Arrow > > > > > > > > > >>>>>> protocol, so can anyone point me to it or PR with > email > > > > will be > > > > > > > > > >> enough? > > > > > > > > > >>>>>>>>> > > > > > > > > > >>>>>>>>> Thanks. > > > > > > > > > >>>>>>> > > > > > > > > > >>>>>> > > > > > > > > > >>>>> > > > > > > > > > >>>> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >