Hi Jacek,
This seems like reasonable functionality.  I think the probably comes in
two parts:
1.  This might be a good candidate for a "Well Known"/Officially supported
Extension type. I can think of a few different representations but I would
guess something like Struct[start: T, struct: end]] with well defined
extension metadata to define open/closed on start and end might be the best
(we should probably spin this off into a separate discussion thread).
2.  Adding the right computation Kernels to work with the type.

Do you want to start a new thread or open up some JIRAs to track this work?

Thanks,
Micah

On Mon, May 3, 2021 at 5:32 AM Jacek Pliszka <jacek.plis...@gmail.com>
wrote:

> Sorry, my mistake.
>
> You are right - I meant anchored intervals as in pandas - ones with
> defined start and end - and I think many future users will make the
> same mistake.
>
> I would love to be able to do fast overlap joins on arrow level.
>
> Best Regards,
>
> Jacek
>
>
>
>
> niedz., 2 maj 2021 o 23:06 Wes McKinney <wesmck...@gmail.com> napisał(a):
> >
> > I also don't understand the comment about closed / open / semi-open
> > intervals. Perhaps there is a confusion, since "interval" as we mean
> > it here is called a "time delta" in some other projects. An interval
> > here does not refer to a time span with a distinct start and end point
> > (I understand this might be confusing to a pandas user since pandas
> > has an interval data type where each value is a tuple of arbitrary
> > start/end).
> >
> > On Sun, May 2, 2021 at 3:46 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
> > >
> > > Hi Jacek,
> > > I'm not sure I fully understand the proposal, could you elaborate with
> more
> > > examples/details?  For instance DAY_TIME isn't just a UINT64, it
> actually
> > > contains 2 seperate fields (days and milliseconds).
> > >
> > > In terms of closed vs half-open, in my limited understanding, that is
> more
> > > a concern of functions using interval types rather than the type
> itself.
> > > For instance a quick search of postgres [1] docs only talks about
> half-open
> > > in relation to the "Overlaps" operator
> > >
> > > Thanks,
> > > -Micah
> > >
> > > [1] https://www.postgresql.org/docs/9.1/functions-datetime.html
> > >
> > >
> > >
> > > On Sun, May 2, 2021 at 12:25 AM Jacek Pliszka <jacek.plis...@gmail.com
> >
> > > wrote:
> > >
> > > > Hi!
> > > >
> > > > I wonder if it were possible to have generic interval with integers
> of
> > > > specified size just to have common base for interval arithmetic.
> > > >
> > > > Then user can convert their period to ordinals and use the arithmetic
> > > > (joining, deoverlapping, common parts, explosion etc.).
> > > >
> > > > So YEAR_MONTH and DAY_TIME would be just special cases of
> > > > INTERVAL_UINT32 and INTERVAL_UINT64
> > > >
> > > > Also I believe it is worth to state whether there are only closed
> > > > intervals or open/semi-open ones are allowed as well.
> > > >
> > > > I believe I am just one of many reinventing the wheel here and
> writing
> > > > own versions of the above.
> > > >
> > > > BR,
> > > >
> > > > Jacek
> > > >
> > > >
> > > > pt., 2 kwi 2021 o 21:53 Micah Kornfield <emkornfi...@gmail.com>
> > > > napisał(a):
> > > > >
> > > > > Andrew is the use-case you have simply postgres compatibility or
> is it
> > > > more
> > > > > extensive?
> > > > >
> > > > > One potential problem with combining Month and Day fields, is that
> the
> > > > type
> > > > > no longer has a defined sort order (the existing Day-Millisecond
> type
> > > > > without assumptions, in particular because I don't think today
> there is
> > > > an
> > > > > explicit constraint on the bounds for the millisecond component).
> > > > >
> > > > > -Micah
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Mar 31, 2021 at 9:03 AM Antoine Pitrou <anto...@python.org
> >
> > > > wrote:
> > > > >
> > > > > >
> > > > > > Le 31/03/2021 à 17:55, Micah Kornfield a écrit :
> > > > > > > Thanks for the feedback.  A couple of points here and some
> responses
> > > > > > below.
> > > > > > >
> > > > > > > * One other question is whether the Nanoseconds should
> actually be
> > > > > > > configurable (i.e. use milliseconds or microseconds).  I would
> lean
> > > > > > towards
> > > > > > > no.
> > > > > >
> > > > > > Same for me.
> > > > > >
> > > > > > > * I'm also still not 100% convinced we need this as a first
> class
> > > > type in
> > > > > > > arrow or if we should be looking more closely at the Struct
> (in the
> > > > Arrow
> > > > > > > sense) based implementation.  In the future where alternative
> > > > encodings
> > > > > > are
> > > > > > > supported, this could allow for much smaller footprints for
> this
> > > > type.
> > > > > >
> > > > > > Having a "packed" first class type allows for better locality
> when
> > > > > > accessing data.  It doesn't sound very likely that you'd access
> only
> > > > one
> > > > > > component of the interval.
> > > > > >
> > > > > > But I have no idea how important this is, and temporal datetypes
> are
> > > > > > generally cumbersome to add support for (conversions, arithmetic,
> > > > etc.),
> > > > > > so it would be nice to avoid adding too many of them :-)
> > > > > >
> > > > > > Regards
> > > > > >
> > > > > > Antoine.
> > > > > >
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > The 3
> > > > > > >> field implementation doesn't seem to have any way to represent
> > > > integral
> > > > > > >> days, so I am also not sure about that one.
> > > > > > >
> > > > > > >
> > > > > > > Sorry this was an email gaffe.  I intended Month (32 bit int),
> Day
> > > > (32
> > > > > > bit
> > > > > > > int), Nanosecond (64 bit int).
> > > > > > >
> > > > > > > OTOH I don't really understand the point of supporting "the
> most
> > > > > > >> reasonable ranges for Year, Month and Nanoseconds
> independently".
> > > > What
> > > > > > >> does it bring to encode more than one month in the nanoseconds
> > > > field?
> > > > > > >
> > > > > > >
> > > > > > > I'm happy with simplicity.   In the past there has been some
> > > > reference to
> > > > > > > people wanting to store very large timestamps (fall out of
> > > > Nanoseconds
> > > > > > max
> > > > > > > representable value) but we've concluded that this wasn't
> something
> > > > that
> > > > > > we
> > > > > > > wanted to really support.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Mar 31, 2021 at 4:49 AM Antoine Pitrou <
> anto...@python.org>
> > > > > > wrote:
> > > > > > >
> > > > > > >>
> > > > > > >> I would favour the following characteristics :
> > > > > > >> - support for nanoseconds (especially as other Arrow temporal
> types
> > > > > > >> support it)
> > > > > > >> - easy to handle (which excludes the ZetaSQL representtaion
> IMHO)
> > > > > > >>
> > > > > > >> OTOH I don't really understand the point of supporting "the
> most
> > > > > > >> reasonable ranges for Year, Month and Nanoseconds
> independently".
> > > > What
> > > > > > >> does it bring to encode more than one month in the nanoseconds
> > > > field?
> > > > > > >> You can already use the Duration type for that.
> > > > > > >>
> > > > > > >> Regards
> > > > > > >>
> > > > > > >> Antoine.
> > > > > > >>
> > > > > > >>
> > > > > > >> Le 31/03/2021 à 05:48, Micah Kornfield a écrit :
> > > > > > >>> To follow-up on this conversation I did some analysis on
> interval
> > > > > > types:
> > > > > > >>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > >
> https://docs.google.com/document/d/1i1E_fdQ_xODZcAhsV11Pfq27O50k679OYHXFJpm9NS0/edit
> > > > > > >> Please feel free to add more details/systems I missed.
> > > > > > >>>
> > > > > > >>> Given the disparate requirements of different systems I
> think the
> > > > > > >> following might make sense for official types (if there isn't
> > > > > > consensus, I
> > > > > > >> might try to contributation extension Array implementations
> for
> > > > them to
> > > > > > >> Java and C++/Python separately).
> > > > > > >>>
> > > > > > >>> 1.  3 fields: Year (32 bit), Month (32 bit), Nanoseconds (64
> bit)
> > > > all
> > > > > > >> signed.
> > > > > > >>> 2.  Postgres representation (Downside is it doesn't support
> > > > > > Nanoseconds,
> > > > > > >> only microseconds).
> > > > > > >>> 3.  ZetaSQL implementation (Requires some bit manipulation)
> but
> > > > > > supports
> > > > > > >> the most reasonable ranges for Year, Month and Nanoseconds
> > > > > > independently.
> > > > > > >>>
> > > > > > >>> Thoughts?
> > > > > > >>>
> > > > > > >>> Micah
> > > > > > >>>
> > > > > > >>> On 2021/02/18 04:30:55 Micah Kornfield wrote:
> > > > > > >>>>>
> > > > > > >>>>> I didn’t find any page/documentation on how to do RFC in
> Arrow
> > > > > > >> protocol,
> > > > > > >>>>> so can anyone point me to it or PR with email will be
> enough?
> > > > > > >>>>
> > > > > > >>>> That is enough to start discussion.  Before formal
> acceptance and
> > > > > > >> merging
> > > > > > >>>> of the PR there needs to be a Java and C++ implementations
> for the
> > > > > > type
> > > > > > >>>> that pass integration tests.  At the time this guideline was
> > > > > > instituted
> > > > > > >>>> Java and C++ were considered the "reference"
> implementations (I
> > > > think
> > > > > > >> they
> > > > > > >>>> still have the most complete integration test coverage).
> > > > > > >>>>
> > > > > > >>>> My understanding is that the current modelling of intervals
> > > > mimics SQL
> > > > > > >>>> standards (e.g. SQL Server [1]).  So it would also be good
> to step
> > > > > > back
> > > > > > >> and
> > > > > > >>>> understand what problem DF is trying to solve and how it
> differs
> > > > from
> > > > > > >> other
> > > > > > >>>> SQL implementations.  I'd be hesitant to accept COMPLEX as
> a new
> > > > type
> > > > > > >>>> without a much deeper analysis into calendar representations
> > > > within
> > > > > > >> Arrow
> > > > > > >>>> and how they relate to other existing systems (e.g. Hive
> and some
> > > > > > >>>> assortment of existing SQL databases).  For instance the
> current
> > > > > > >> modelling
> > > > > > >>>> of timestamps does not lend itself to constructing a COMPLEX
> > > > interval
> > > > > > >> type
> > > > > > >>>> particularly well. (Duration was introduced for this
> reason).
> > > > > > >>>>
> > > > > > >>>> I think both Wes's suggestion of FixedSizeBinary and
> Andrew's of
> > > > > > >> composing
> > > > > > >>>> the with a struct are good stop-gaps.  These obviously have
> > > > different
> > > > > > >>>> trade-offs.  Ultimately, it would be good to define common
> > > > extension
> > > > > > >> types
> > > > > > >>>> that can represent this use-case if there really is demand
> for it
> > > > (if
> > > > > > it
> > > > > > >>>> doesn't become a top level type).
> > > > > > >>>>
> > > > > > >>>> [1]
> > > > > > >>>>
> > > > > > >>
> > > > > >
> > > >
> https://docs.microsoft.com/en-us/sql/odbc/reference/appendixes/interval-data-types?view=sql-server-ver15
> > > > > > >>>>
> > > > > > >>>> -Micah
> > > > > > >>>>
> > > > > > >>>> On Wed, Feb 17, 2021 at 2:05 PM Andrew Lamb <
> al...@influxdata.com
> > > > >
> > > > > > >> wrote:
> > > > > > >>>>
> > > > > > >>>>> That is a great suggestion Wes, thank you.
> > > > > > >>>>>
> > > > > > >>>>> I wonder if we could get away with a 128 bit
> representation that
> > > > is
> > > > > > the
> > > > > > >>>>> concatenation of the two existing interval types
> > > > > > (YearMonth)(DayTime).
> > > > > > >> Or
> > > > > > >>>>> maybe even define a `struct` type with those fields that
> is used
> > > > by
> > > > > > >>>>> DataFusion.
> > > > > > >>>>>
> > > > > > >>>>> Basically, given our reading of the Arrow spec[1], it is
> > > > currently
> > > > > > not
> > > > > > >>>>> possible to precisely represent an interval that has both
> > > > monthly and
> > > > > > >>>>> sub-montly granularity.
> > > > > > >>>>>
> > > > > > >>>>> As Dmtry says, if you have an interval seemingly simple
> like  1
> > > > > > month,
> > > > > > >> 1
> > > > > > >>>>> day
> > > > > > >>>>>
> > > > > > >>>>> Using IntervalUnit(YEAR_MONTH) can't represent the 1 day
> > > > > > >>>>> Using IntervalUnit(DAY_TIME) can't represent the month as
> > > > different
> > > > > > >> months
> > > > > > >>>>> have different numbers of days
> > > > > > >>>>>
> > > > > > >>>>> [1]
> > > > > > >>>>>
> > > > > > >>
> > > >
> https://github.com/apache/arrow/blob/master/format/Schema.fbs#L249-L260
> > > > > > >>>>>
> > > > > > >>>>>
> > > > > > >>>>> On Wed, Feb 17, 2021 at 5:01 PM Wes McKinney <
> > > > wesmck...@gmail.com>
> > > > > > >> wrote:
> > > > > > >>>>>
> > > > > > >>>>>> On Wed, Feb 17, 2021 at 3:46 PM <t...@dmtry.me> wrote:
> > > > > > >>>>>>>
> > > > > > >>>>>>>> It's unclear to me that this needs to be introduced
> into the
> > > > > > >>>>> top-level
> > > > > > >>>>>>>
> > > > > > >>>>>>> Similar thing to columnar format, How to store interval
> like 1
> > > > > > month
> > > > > > >> 1
> > > > > > >>>>>> day 1 hour? It’s not possible to do it without converting
> 1
> > > > month to
> > > > > > >> 30
> > > > > > >>>>>> days, which is a bad way.
> > > > > > >>>>>>>
> > > > > > >>>>>>
> > > > > > >>>>>> Presumably you can represent a complex interval in a fixed
> > > > number of
> > > > > > >>>>>> bytes, and then embed the data in a FixedSizeBinary type.
> You
> > > > can
> > > > > > >>>>>> adorn this type with extension type metadata so that
> DataFusion
> > > > can
> > > > > > >>>>>> then apply Interval semantics to it. This could also
> serve as an
> > > > > > >>>>>> interim strategy for you to proceed with implementation
> while
> > > > > > >>>>>> proposing a top-level type to the Arrow format (which may
> or
> > > > may not
> > > > > > >>>>>> be accepting) so you aren't blocked on acceptance of
> changes
> > > > into
> > > > > > >>>>>> Schema.fbs.
> > > > > > >>>>>>
> > > > > > >>>>>>>> On 17 Feb 2021, at 21:02, Wes McKinney <
> wesmck...@gmail.com>
> > > > > > wrote:
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> It's unclear to me that this needs to be introduced
> into the
> > > > > > >>>>> top-level
> > > > > > >>>>>>>> columnar format without more analysis — have you
> considered
> > > > > > >>>>>>>> implementing this for DataFusion as an extension type
> for the
> > > > time
> > > > > > >>>>>>>> being?
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> On Wed, Feb 17, 2021 at 11:59 AM t...@dmtry.me <mailto:
> > > > > > >> t...@dmtry.me
> > > > > > >>>>>>
> > > > > > >>>>>> <t...@dmtry.me <mailto:t...@dmtry.me>> wrote:
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> Hi,
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> For now, There are only two types of IntervalUnit
> inside
> > > > Arrow:
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> - YearMonth - month stored as int32
> > > > > > >>>>>>>>> - DayTime - days as int32 and time in milliseconds  as
> in32.
> > > > > > Total
> > > > > > >>>>>> (64 bites)
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> Since DF is using Arrow, It’s not possible to store
> “Complex”
> > > > > > >>>>>> intervals such 1 MONTH 1 DAY 1 HOUR.
> > > > > > >>>>>>>>> I think, the best way to understand the problem will
> be to
> > > > read a
> > > > > > >>>>>> comment from DF codebase:
> > > > > > >>>>>>
> > > > > > >>>>>
> > > > > > >>
> > > > > >
> > > >
> https://github.com/apache/arrow/blob/bca7d2fe84ccd8fc1129cb4d85448eb0779c52c3/rust/datafusion/src/sql/planner.rs#L1148
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>          // Interval is tricky thing
> > > > > > >>>>>>>>>          // 1 day is not 24 hours because timezones, 1
> year
> > > > !=
> > > > > > >>>>> 365/364!
> > > > > > >>>>>> 30 days != 1 month
> > > > > > >>>>>>>>>          // The true way to store and calculate
> intervals is
> > > > to
> > > > > > >> store
> > > > > > >>>>>> it as it defined
> > > > > > >>>>>>>>>          // Due the fact that Arrow supports only two
> types
> > > > > > >> YearMonth
> > > > > > >>>>>> (month) and DayTime (day, time)
> > > > > > >>>>>>>>>          // It's not possible to store complex
> intervals
> > > > > > >>>>>>>>>          // It's possible to do select (NOW() +
> INTERVAL '1
> > > > > > year') +
> > > > > > >>>>>> INTERVAL '1 day'; as workaround
> > > > > > >>>>>>>>>          if result_month != 0 && (result_days != 0 ||
> > > > > > result_millis
> > > > > > >> !=
> > > > > > >>>>>> 0) {
> > > > > > >>>>>>>>>              return
> > > > Err(DataFusionError::NotImplemented(format!(
> > > > > > >>>>>>>>>                  "DF does not support intervals that
> have
> > > > both a
> > > > > > >>>>>> Year/Month part as well as Days/Hours/Mins/Seconds: {:?}.
> Hint:
> > > > try
> > > > > > >>>>>> breaking the interval into two parts, one with Year/Month
> and
> > > > the
> > > > > > >> other
> > > > > > >>>>>> with Days/Hours/Mins/Seconds - e.g. (NOW() + INTERVAL '1
> year')
> > > > +
> > > > > > >>>>> INTERVAL
> > > > > > >>>>>> '1 day'",
> > > > > > >>>>>>>>>                  value
> > > > > > >>>>>>>>>              )));
> > > > > > >>>>>>>>>          }
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> I prepared a PR
> > > > https://github.com/apache/arrow/pull/9516/files
> > > > > > <
> > > > > > >>>>>> https://github.com/apache/arrow/pull/9516/files> <
> > > > > > >>>>>> https://github.com/apache/arrow/pull/9516/files <
> > > > > > >>>>>> https://github.com/apache/arrow/pull/9516/files>> that
> > > > introduce a
> > > > > > >> new
> > > > > > >>>>>> type for IntervalUnit called Complex, that store both
> YearMonth
> > > > and
> > > > > > >>>>> DayTime
> > > > > > >>>>>> to support complex interval.
> > > > > > >>>>>>>>> I didn’t find any page/documentation on how to do RFC
> in
> > > > Arrow
> > > > > > >>>>>> protocol, so can anyone point me to it or PR with email
> will be
> > > > > > >> enough?
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> Thanks.
> > > > > > >>>>>>>
> > > > > > >>>>>>
> > > > > > >>>>>
> > > > > > >>>>
> > > > > > >>
> > > > > > >
> > > > > >
> > > >
>

Reply via email to