Hi Jacek,
I'm not sure I fully understand the proposal, could you elaborate with more
examples/details?  For instance DAY_TIME isn't just a UINT64, it actually
contains 2 seperate fields (days and milliseconds).

In terms of closed vs half-open, in my limited understanding, that is more
a concern of functions using interval types rather than the type itself.
For instance a quick search of postgres [1] docs only talks about half-open
in relation to the "Overlaps" operator

Thanks,
-Micah

[1] https://www.postgresql.org/docs/9.1/functions-datetime.html



On Sun, May 2, 2021 at 12:25 AM Jacek Pliszka <jacek.plis...@gmail.com>
wrote:

> Hi!
>
> I wonder if it were possible to have generic interval with integers of
> specified size just to have common base for interval arithmetic.
>
> Then user can convert their period to ordinals and use the arithmetic
> (joining, deoverlapping, common parts, explosion etc.).
>
> So YEAR_MONTH and DAY_TIME would be just special cases of
> INTERVAL_UINT32 and INTERVAL_UINT64
>
> Also I believe it is worth to state whether there are only closed
> intervals or open/semi-open ones are allowed as well.
>
> I believe I am just one of many reinventing the wheel here and writing
> own versions of the above.
>
> BR,
>
> Jacek
>
>
> pt., 2 kwi 2021 o 21:53 Micah Kornfield <emkornfi...@gmail.com>
> napisał(a):
> >
> > Andrew is the use-case you have simply postgres compatibility or is it
> more
> > extensive?
> >
> > One potential problem with combining Month and Day fields, is that the
> type
> > no longer has a defined sort order (the existing Day-Millisecond type
> > without assumptions, in particular because I don't think today there is
> an
> > explicit constraint on the bounds for the millisecond component).
> >
> > -Micah
> >
> >
> >
> > On Wed, Mar 31, 2021 at 9:03 AM Antoine Pitrou <anto...@python.org>
> wrote:
> >
> > >
> > > Le 31/03/2021 à 17:55, Micah Kornfield a écrit :
> > > > Thanks for the feedback.  A couple of points here and some responses
> > > below.
> > > >
> > > > * One other question is whether the Nanoseconds should actually be
> > > > configurable (i.e. use milliseconds or microseconds).  I would lean
> > > towards
> > > > no.
> > >
> > > Same for me.
> > >
> > > > * I'm also still not 100% convinced we need this as a first class
> type in
> > > > arrow or if we should be looking more closely at the Struct (in the
> Arrow
> > > > sense) based implementation.  In the future where alternative
> encodings
> > > are
> > > > supported, this could allow for much smaller footprints for this
> type.
> > >
> > > Having a "packed" first class type allows for better locality when
> > > accessing data.  It doesn't sound very likely that you'd access only
> one
> > > component of the interval.
> > >
> > > But I have no idea how important this is, and temporal datetypes are
> > > generally cumbersome to add support for (conversions, arithmetic,
> etc.),
> > > so it would be nice to avoid adding too many of them :-)
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > >
> > > >
> > > > The 3
> > > >> field implementation doesn't seem to have any way to represent
> integral
> > > >> days, so I am also not sure about that one.
> > > >
> > > >
> > > > Sorry this was an email gaffe.  I intended Month (32 bit int), Day
> (32
> > > bit
> > > > int), Nanosecond (64 bit int).
> > > >
> > > > OTOH I don't really understand the point of supporting "the most
> > > >> reasonable ranges for Year, Month and Nanoseconds independently".
> What
> > > >> does it bring to encode more than one month in the nanoseconds
> field?
> > > >
> > > >
> > > > I'm happy with simplicity.   In the past there has been some
> reference to
> > > > people wanting to store very large timestamps (fall out of
> Nanoseconds
> > > max
> > > > representable value) but we've concluded that this wasn't something
> that
> > > we
> > > > wanted to really support.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Wed, Mar 31, 2021 at 4:49 AM Antoine Pitrou <anto...@python.org>
> > > wrote:
> > > >
> > > >>
> > > >> I would favour the following characteristics :
> > > >> - support for nanoseconds (especially as other Arrow temporal types
> > > >> support it)
> > > >> - easy to handle (which excludes the ZetaSQL representtaion IMHO)
> > > >>
> > > >> OTOH I don't really understand the point of supporting "the most
> > > >> reasonable ranges for Year, Month and Nanoseconds independently".
> What
> > > >> does it bring to encode more than one month in the nanoseconds
> field?
> > > >> You can already use the Duration type for that.
> > > >>
> > > >> Regards
> > > >>
> > > >> Antoine.
> > > >>
> > > >>
> > > >> Le 31/03/2021 à 05:48, Micah Kornfield a écrit :
> > > >>> To follow-up on this conversation I did some analysis on interval
> > > types:
> > > >>>
> > > >>>
> > > >>
> > >
> https://docs.google.com/document/d/1i1E_fdQ_xODZcAhsV11Pfq27O50k679OYHXFJpm9NS0/edit
> > > >> Please feel free to add more details/systems I missed.
> > > >>>
> > > >>> Given the disparate requirements of different systems I think the
> > > >> following might make sense for official types (if there isn't
> > > consensus, I
> > > >> might try to contributation extension Array implementations for
> them to
> > > >> Java and C++/Python separately).
> > > >>>
> > > >>> 1.  3 fields: Year (32 bit), Month (32 bit), Nanoseconds (64 bit)
> all
> > > >> signed.
> > > >>> 2.  Postgres representation (Downside is it doesn't support
> > > Nanoseconds,
> > > >> only microseconds).
> > > >>> 3.  ZetaSQL implementation (Requires some bit manipulation) but
> > > supports
> > > >> the most reasonable ranges for Year, Month and Nanoseconds
> > > independently.
> > > >>>
> > > >>> Thoughts?
> > > >>>
> > > >>> Micah
> > > >>>
> > > >>> On 2021/02/18 04:30:55 Micah Kornfield wrote:
> > > >>>>>
> > > >>>>> I didn’t find any page/documentation on how to do RFC in Arrow
> > > >> protocol,
> > > >>>>> so can anyone point me to it or PR with email will be enough?
> > > >>>>
> > > >>>> That is enough to start discussion.  Before formal acceptance and
> > > >> merging
> > > >>>> of the PR there needs to be a Java and C++ implementations for the
> > > type
> > > >>>> that pass integration tests.  At the time this guideline was
> > > instituted
> > > >>>> Java and C++ were considered the "reference" implementations (I
> think
> > > >> they
> > > >>>> still have the most complete integration test coverage).
> > > >>>>
> > > >>>> My understanding is that the current modelling of intervals
> mimics SQL
> > > >>>> standards (e.g. SQL Server [1]).  So it would also be good to step
> > > back
> > > >> and
> > > >>>> understand what problem DF is trying to solve and how it differs
> from
> > > >> other
> > > >>>> SQL implementations.  I'd be hesitant to accept COMPLEX as a new
> type
> > > >>>> without a much deeper analysis into calendar representations
> within
> > > >> Arrow
> > > >>>> and how they relate to other existing systems (e.g. Hive and some
> > > >>>> assortment of existing SQL databases).  For instance the current
> > > >> modelling
> > > >>>> of timestamps does not lend itself to constructing a COMPLEX
> interval
> > > >> type
> > > >>>> particularly well. (Duration was introduced for this reason).
> > > >>>>
> > > >>>> I think both Wes's suggestion of FixedSizeBinary and Andrew's of
> > > >> composing
> > > >>>> the with a struct are good stop-gaps.  These obviously have
> different
> > > >>>> trade-offs.  Ultimately, it would be good to define common
> extension
> > > >> types
> > > >>>> that can represent this use-case if there really is demand for it
> (if
> > > it
> > > >>>> doesn't become a top level type).
> > > >>>>
> > > >>>> [1]
> > > >>>>
> > > >>
> > >
> https://docs.microsoft.com/en-us/sql/odbc/reference/appendixes/interval-data-types?view=sql-server-ver15
> > > >>>>
> > > >>>> -Micah
> > > >>>>
> > > >>>> On Wed, Feb 17, 2021 at 2:05 PM Andrew Lamb <al...@influxdata.com
> >
> > > >> wrote:
> > > >>>>
> > > >>>>> That is a great suggestion Wes, thank you.
> > > >>>>>
> > > >>>>> I wonder if we could get away with a 128 bit representation that
> is
> > > the
> > > >>>>> concatenation of the two existing interval types
> > > (YearMonth)(DayTime).
> > > >> Or
> > > >>>>> maybe even define a `struct` type with those fields that is used
> by
> > > >>>>> DataFusion.
> > > >>>>>
> > > >>>>> Basically, given our reading of the Arrow spec[1], it is
> currently
> > > not
> > > >>>>> possible to precisely represent an interval that has both
> monthly and
> > > >>>>> sub-montly granularity.
> > > >>>>>
> > > >>>>> As Dmtry says, if you have an interval seemingly simple like  1
> > > month,
> > > >> 1
> > > >>>>> day
> > > >>>>>
> > > >>>>> Using IntervalUnit(YEAR_MONTH) can't represent the 1 day
> > > >>>>> Using IntervalUnit(DAY_TIME) can't represent the month as
> different
> > > >> months
> > > >>>>> have different numbers of days
> > > >>>>>
> > > >>>>> [1]
> > > >>>>>
> > > >>
> https://github.com/apache/arrow/blob/master/format/Schema.fbs#L249-L260
> > > >>>>>
> > > >>>>>
> > > >>>>> On Wed, Feb 17, 2021 at 5:01 PM Wes McKinney <
> wesmck...@gmail.com>
> > > >> wrote:
> > > >>>>>
> > > >>>>>> On Wed, Feb 17, 2021 at 3:46 PM <t...@dmtry.me> wrote:
> > > >>>>>>>
> > > >>>>>>>> It's unclear to me that this needs to be introduced into the
> > > >>>>> top-level
> > > >>>>>>>
> > > >>>>>>> Similar thing to columnar format, How to store interval like 1
> > > month
> > > >> 1
> > > >>>>>> day 1 hour? It’s not possible to do it without converting 1
> month to
> > > >> 30
> > > >>>>>> days, which is a bad way.
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>> Presumably you can represent a complex interval in a fixed
> number of
> > > >>>>>> bytes, and then embed the data in a FixedSizeBinary type. You
> can
> > > >>>>>> adorn this type with extension type metadata so that DataFusion
> can
> > > >>>>>> then apply Interval semantics to it. This could also serve as an
> > > >>>>>> interim strategy for you to proceed with implementation while
> > > >>>>>> proposing a top-level type to the Arrow format (which may or
> may not
> > > >>>>>> be accepting) so you aren't blocked on acceptance of changes
> into
> > > >>>>>> Schema.fbs.
> > > >>>>>>
> > > >>>>>>>> On 17 Feb 2021, at 21:02, Wes McKinney <wesmck...@gmail.com>
> > > wrote:
> > > >>>>>>>>
> > > >>>>>>>> It's unclear to me that this needs to be introduced into the
> > > >>>>> top-level
> > > >>>>>>>> columnar format without more analysis — have you considered
> > > >>>>>>>> implementing this for DataFusion as an extension type for the
> time
> > > >>>>>>>> being?
> > > >>>>>>>>
> > > >>>>>>>> On Wed, Feb 17, 2021 at 11:59 AM t...@dmtry.me <mailto:
> > > >> t...@dmtry.me
> > > >>>>>>
> > > >>>>>> <t...@dmtry.me <mailto:t...@dmtry.me>> wrote:
> > > >>>>>>>>>
> > > >>>>>>>>> Hi,
> > > >>>>>>>>>
> > > >>>>>>>>> For now, There are only two types of IntervalUnit inside
> Arrow:
> > > >>>>>>>>>
> > > >>>>>>>>> - YearMonth - month stored as int32
> > > >>>>>>>>> - DayTime - days as int32 and time in milliseconds  as in32.
> > > Total
> > > >>>>>> (64 bites)
> > > >>>>>>>>>
> > > >>>>>>>>> Since DF is using Arrow, It’s not possible to store “Complex”
> > > >>>>>> intervals such 1 MONTH 1 DAY 1 HOUR.
> > > >>>>>>>>> I think, the best way to understand the problem will be to
> read a
> > > >>>>>> comment from DF codebase:
> > > >>>>>>
> > > >>>>>
> > > >>
> > >
> https://github.com/apache/arrow/blob/bca7d2fe84ccd8fc1129cb4d85448eb0779c52c3/rust/datafusion/src/sql/planner.rs#L1148
> > > >>>>>>>>>
> > > >>>>>>>>>          // Interval is tricky thing
> > > >>>>>>>>>          // 1 day is not 24 hours because timezones, 1 year
> !=
> > > >>>>> 365/364!
> > > >>>>>> 30 days != 1 month
> > > >>>>>>>>>          // The true way to store and calculate intervals is
> to
> > > >> store
> > > >>>>>> it as it defined
> > > >>>>>>>>>          // Due the fact that Arrow supports only two types
> > > >> YearMonth
> > > >>>>>> (month) and DayTime (day, time)
> > > >>>>>>>>>          // It's not possible to store complex intervals
> > > >>>>>>>>>          // It's possible to do select (NOW() + INTERVAL '1
> > > year') +
> > > >>>>>> INTERVAL '1 day'; as workaround
> > > >>>>>>>>>          if result_month != 0 && (result_days != 0 ||
> > > result_millis
> > > >> !=
> > > >>>>>> 0) {
> > > >>>>>>>>>              return
> Err(DataFusionError::NotImplemented(format!(
> > > >>>>>>>>>                  "DF does not support intervals that have
> both a
> > > >>>>>> Year/Month part as well as Days/Hours/Mins/Seconds: {:?}. Hint:
> try
> > > >>>>>> breaking the interval into two parts, one with Year/Month and
> the
> > > >> other
> > > >>>>>> with Days/Hours/Mins/Seconds - e.g. (NOW() + INTERVAL '1 year')
> +
> > > >>>>> INTERVAL
> > > >>>>>> '1 day'",
> > > >>>>>>>>>                  value
> > > >>>>>>>>>              )));
> > > >>>>>>>>>          }
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> I prepared a PR
> https://github.com/apache/arrow/pull/9516/files
> > > <
> > > >>>>>> https://github.com/apache/arrow/pull/9516/files> <
> > > >>>>>> https://github.com/apache/arrow/pull/9516/files <
> > > >>>>>> https://github.com/apache/arrow/pull/9516/files>> that
> introduce a
> > > >> new
> > > >>>>>> type for IntervalUnit called Complex, that store both YearMonth
> and
> > > >>>>> DayTime
> > > >>>>>> to support complex interval.
> > > >>>>>>>>> I didn’t find any page/documentation on how to do RFC in
> Arrow
> > > >>>>>> protocol, so can anyone point me to it or PR with email will be
> > > >> enough?
> > > >>>>>>>>>
> > > >>>>>>>>> Thanks.
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > >>
> > > >
> > >
>

Reply via email to