I think it is plausible that we use Arrow structs to create a synthetic
interval type for DataFusion (I don't have a compelling usecase to store
the intervals themselves, or to expose them outside of DataFusion).

However it seems a little unfortunate that there is now way to represent a
"common" interval like "1 week and 1 hour" with native arrow types



On Fri, Apr 2, 2021 at 4:38 PM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> The real usecase I have is "postgres compatibility"
>
>
> Yeah, I'm a little conflicted on this.  A broader analysis might be
> necessary and I'd welcome others thoughts, but at what point should we
> mostly consider the type system closed?  Should we be aiming for full
> parity with ANSI SQL/Postgres SQL or something else?
>
>
>> I have no known need for the actual postgres timestamp internal
>> representation.
>
>
> I suppose there is an edge case that the seconds range is larger for
> microseconds compared to nanoseconds with the simple representation.  But
> that seems minor.
>
> On Fri, Apr 2, 2021 at 1:25 PM Andrew Lamb <al...@influxdata.com> wrote:
>
>> The real usecase I have is "postgres compatibility" - in the sense that
>> we can write SQL queries / expressions that use postgres interval type [1]
>> and corresponding expressions with the full postgres interval range. I have
>> no known need for the actual postgres timestamp internal representation.
>>
>> A more vague usecase I would like (but do not strictly need) is for
>> expressing intervals up to the same precision as IOx's underlying
>> timestamps (which are stored as nanosecond precision).
>>
>> [1]
>> https://www.postgresql.org/docs/current/datatype-datetime.html#DATATYPE-INTERVAL-INPUT
>>
>> "Internally interval values are stored as months, days, and seconds.
>> This is done because the number of days in a month varies, and a day can
>> have 23 or 25 hours if a daylight savings time adjustment is involved. The
>> months and days fields are integers while the seconds field can store
>> fractions. Because intervals are usually created from constant strings or
>>  timestamp subtraction, this storage method works well in most cases,
>> but can cause unexpected results:
>> "
>>
>> On Fri, Apr 2, 2021 at 3:53 PM Micah Kornfield <emkornfi...@gmail.com>
>> wrote:
>>
>>> Andrew is the use-case you have simply postgres compatibility or is it
>>> more
>>> extensive?
>>>
>>> One potential problem with combining Month and Day fields, is that the
>>> type
>>> no longer has a defined sort order (the existing Day-Millisecond type
>>> without assumptions, in particular because I don't think today there is
>>> an
>>> explicit constraint on the bounds for the millisecond component).
>>>
>>> -Micah
>>>
>>>
>>>
>>> On Wed, Mar 31, 2021 at 9:03 AM Antoine Pitrou <anto...@python.org>
>>> wrote:
>>>
>>> >
>>> > Le 31/03/2021 à 17:55, Micah Kornfield a écrit :
>>> > > Thanks for the feedback.  A couple of points here and some responses
>>> > below.
>>> > >
>>> > > * One other question is whether the Nanoseconds should actually be
>>> > > configurable (i.e. use milliseconds or microseconds).  I would lean
>>> > towards
>>> > > no.
>>> >
>>> > Same for me.
>>> >
>>> > > * I'm also still not 100% convinced we need this as a first class
>>> type in
>>> > > arrow or if we should be looking more closely at the Struct (in the
>>> Arrow
>>> > > sense) based implementation.  In the future where alternative
>>> encodings
>>> > are
>>> > > supported, this could allow for much smaller footprints for this
>>> type.
>>> >
>>> > Having a "packed" first class type allows for better locality when
>>> > accessing data.  It doesn't sound very likely that you'd access only
>>> one
>>> > component of the interval.
>>> >
>>> > But I have no idea how important this is, and temporal datetypes are
>>> > generally cumbersome to add support for (conversions, arithmetic,
>>> etc.),
>>> > so it would be nice to avoid adding too many of them :-)
>>> >
>>> > Regards
>>> >
>>> > Antoine.
>>> >
>>> >
>>> >
>>> > >
>>> > > The 3
>>> > >> field implementation doesn't seem to have any way to represent
>>> integral
>>> > >> days, so I am also not sure about that one.
>>> > >
>>> > >
>>> > > Sorry this was an email gaffe.  I intended Month (32 bit int), Day
>>> (32
>>> > bit
>>> > > int), Nanosecond (64 bit int).
>>> > >
>>> > > OTOH I don't really understand the point of supporting "the most
>>> > >> reasonable ranges for Year, Month and Nanoseconds independently".
>>> What
>>> > >> does it bring to encode more than one month in the nanoseconds
>>> field?
>>> > >
>>> > >
>>> > > I'm happy with simplicity.   In the past there has been some
>>> reference to
>>> > > people wanting to store very large timestamps (fall out of
>>> Nanoseconds
>>> > max
>>> > > representable value) but we've concluded that this wasn't something
>>> that
>>> > we
>>> > > wanted to really support.
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > > On Wed, Mar 31, 2021 at 4:49 AM Antoine Pitrou <anto...@python.org>
>>> > wrote:
>>> > >
>>> > >>
>>> > >> I would favour the following characteristics :
>>> > >> - support for nanoseconds (especially as other Arrow temporal types
>>> > >> support it)
>>> > >> - easy to handle (which excludes the ZetaSQL representtaion IMHO)
>>> > >>
>>> > >> OTOH I don't really understand the point of supporting "the most
>>> > >> reasonable ranges for Year, Month and Nanoseconds independently".
>>> What
>>> > >> does it bring to encode more than one month in the nanoseconds
>>> field?
>>> > >> You can already use the Duration type for that.
>>> > >>
>>> > >> Regards
>>> > >>
>>> > >> Antoine.
>>> > >>
>>> > >>
>>> > >> Le 31/03/2021 à 05:48, Micah Kornfield a écrit :
>>> > >>> To follow-up on this conversation I did some analysis on interval
>>> > types:
>>> > >>>
>>> > >>>
>>> > >>
>>> >
>>> https://docs.google.com/document/d/1i1E_fdQ_xODZcAhsV11Pfq27O50k679OYHXFJpm9NS0/edit
>>> > >> Please feel free to add more details/systems I missed.
>>> > >>>
>>> > >>> Given the disparate requirements of different systems I think the
>>> > >> following might make sense for official types (if there isn't
>>> > consensus, I
>>> > >> might try to contributation extension Array implementations for
>>> them to
>>> > >> Java and C++/Python separately).
>>> > >>>
>>> > >>> 1.  3 fields: Year (32 bit), Month (32 bit), Nanoseconds (64 bit)
>>> all
>>> > >> signed.
>>> > >>> 2.  Postgres representation (Downside is it doesn't support
>>> > Nanoseconds,
>>> > >> only microseconds).
>>> > >>> 3.  ZetaSQL implementation (Requires some bit manipulation) but
>>> > supports
>>> > >> the most reasonable ranges for Year, Month and Nanoseconds
>>> > independently.
>>> > >>>
>>> > >>> Thoughts?
>>> > >>>
>>> > >>> Micah
>>> > >>>
>>> > >>> On 2021/02/18 04:30:55 Micah Kornfield wrote:
>>> > >>>>>
>>> > >>>>> I didn’t find any page/documentation on how to do RFC in Arrow
>>> > >> protocol,
>>> > >>>>> so can anyone point me to it or PR with email will be enough?
>>> > >>>>
>>> > >>>> That is enough to start discussion.  Before formal acceptance and
>>> > >> merging
>>> > >>>> of the PR there needs to be a Java and C++ implementations for the
>>> > type
>>> > >>>> that pass integration tests.  At the time this guideline was
>>> > instituted
>>> > >>>> Java and C++ were considered the "reference" implementations (I
>>> think
>>> > >> they
>>> > >>>> still have the most complete integration test coverage).
>>> > >>>>
>>> > >>>> My understanding is that the current modelling of intervals
>>> mimics SQL
>>> > >>>> standards (e.g. SQL Server [1]).  So it would also be good to step
>>> > back
>>> > >> and
>>> > >>>> understand what problem DF is trying to solve and how it differs
>>> from
>>> > >> other
>>> > >>>> SQL implementations.  I'd be hesitant to accept COMPLEX as a new
>>> type
>>> > >>>> without a much deeper analysis into calendar representations
>>> within
>>> > >> Arrow
>>> > >>>> and how they relate to other existing systems (e.g. Hive and some
>>> > >>>> assortment of existing SQL databases).  For instance the current
>>> > >> modelling
>>> > >>>> of timestamps does not lend itself to constructing a COMPLEX
>>> interval
>>> > >> type
>>> > >>>> particularly well. (Duration was introduced for this reason).
>>> > >>>>
>>> > >>>> I think both Wes's suggestion of FixedSizeBinary and Andrew's of
>>> > >> composing
>>> > >>>> the with a struct are good stop-gaps.  These obviously have
>>> different
>>> > >>>> trade-offs.  Ultimately, it would be good to define common
>>> extension
>>> > >> types
>>> > >>>> that can represent this use-case if there really is demand for it
>>> (if
>>> > it
>>> > >>>> doesn't become a top level type).
>>> > >>>>
>>> > >>>> [1]
>>> > >>>>
>>> > >>
>>> >
>>> https://docs.microsoft.com/en-us/sql/odbc/reference/appendixes/interval-data-types?view=sql-server-ver15
>>> > >>>>
>>> > >>>> -Micah
>>> > >>>>
>>> > >>>> On Wed, Feb 17, 2021 at 2:05 PM Andrew Lamb <al...@influxdata.com
>>> >
>>> > >> wrote:
>>> > >>>>
>>> > >>>>> That is a great suggestion Wes, thank you.
>>> > >>>>>
>>> > >>>>> I wonder if we could get away with a 128 bit representation that
>>> is
>>> > the
>>> > >>>>> concatenation of the two existing interval types
>>> > (YearMonth)(DayTime).
>>> > >> Or
>>> > >>>>> maybe even define a `struct` type with those fields that is used
>>> by
>>> > >>>>> DataFusion.
>>> > >>>>>
>>> > >>>>> Basically, given our reading of the Arrow spec[1], it is
>>> currently
>>> > not
>>> > >>>>> possible to precisely represent an interval that has both
>>> monthly and
>>> > >>>>> sub-montly granularity.
>>> > >>>>>
>>> > >>>>> As Dmtry says, if you have an interval seemingly simple like  1
>>> > month,
>>> > >> 1
>>> > >>>>> day
>>> > >>>>>
>>> > >>>>> Using IntervalUnit(YEAR_MONTH) can't represent the 1 day
>>> > >>>>> Using IntervalUnit(DAY_TIME) can't represent the month as
>>> different
>>> > >> months
>>> > >>>>> have different numbers of days
>>> > >>>>>
>>> > >>>>> [1]
>>> > >>>>>
>>> > >>
>>> https://github.com/apache/arrow/blob/master/format/Schema.fbs#L249-L260
>>> > >>>>>
>>> > >>>>>
>>> > >>>>> On Wed, Feb 17, 2021 at 5:01 PM Wes McKinney <
>>> wesmck...@gmail.com>
>>> > >> wrote:
>>> > >>>>>
>>> > >>>>>> On Wed, Feb 17, 2021 at 3:46 PM <t...@dmtry.me> wrote:
>>> > >>>>>>>
>>> > >>>>>>>> It's unclear to me that this needs to be introduced into the
>>> > >>>>> top-level
>>> > >>>>>>>
>>> > >>>>>>> Similar thing to columnar format, How to store interval like 1
>>> > month
>>> > >> 1
>>> > >>>>>> day 1 hour? It’s not possible to do it without converting 1
>>> month to
>>> > >> 30
>>> > >>>>>> days, which is a bad way.
>>> > >>>>>>>
>>> > >>>>>>
>>> > >>>>>> Presumably you can represent a complex interval in a fixed
>>> number of
>>> > >>>>>> bytes, and then embed the data in a FixedSizeBinary type. You
>>> can
>>> > >>>>>> adorn this type with extension type metadata so that DataFusion
>>> can
>>> > >>>>>> then apply Interval semantics to it. This could also serve as an
>>> > >>>>>> interim strategy for you to proceed with implementation while
>>> > >>>>>> proposing a top-level type to the Arrow format (which may or
>>> may not
>>> > >>>>>> be accepting) so you aren't blocked on acceptance of changes
>>> into
>>> > >>>>>> Schema.fbs.
>>> > >>>>>>
>>> > >>>>>>>> On 17 Feb 2021, at 21:02, Wes McKinney <wesmck...@gmail.com>
>>> > wrote:
>>> > >>>>>>>>
>>> > >>>>>>>> It's unclear to me that this needs to be introduced into the
>>> > >>>>> top-level
>>> > >>>>>>>> columnar format without more analysis — have you considered
>>> > >>>>>>>> implementing this for DataFusion as an extension type for the
>>> time
>>> > >>>>>>>> being?
>>> > >>>>>>>>
>>> > >>>>>>>> On Wed, Feb 17, 2021 at 11:59 AM t...@dmtry.me <mailto:
>>> > >> t...@dmtry.me
>>> > >>>>>>
>>> > >>>>>> <t...@dmtry.me <mailto:t...@dmtry.me>> wrote:
>>> > >>>>>>>>>
>>> > >>>>>>>>> Hi,
>>> > >>>>>>>>>
>>> > >>>>>>>>> For now, There are only two types of IntervalUnit inside
>>> Arrow:
>>> > >>>>>>>>>
>>> > >>>>>>>>> - YearMonth - month stored as int32
>>> > >>>>>>>>> - DayTime - days as int32 and time in milliseconds  as in32.
>>> > Total
>>> > >>>>>> (64 bites)
>>> > >>>>>>>>>
>>> > >>>>>>>>> Since DF is using Arrow, It’s not possible to store “Complex”
>>> > >>>>>> intervals such 1 MONTH 1 DAY 1 HOUR.
>>> > >>>>>>>>> I think, the best way to understand the problem will be to
>>> read a
>>> > >>>>>> comment from DF codebase:
>>> > >>>>>>
>>> > >>>>>
>>> > >>
>>> >
>>> https://github.com/apache/arrow/blob/bca7d2fe84ccd8fc1129cb4d85448eb0779c52c3/rust/datafusion/src/sql/planner.rs#L1148
>>> > >>>>>>>>>
>>> > >>>>>>>>>          // Interval is tricky thing
>>> > >>>>>>>>>          // 1 day is not 24 hours because timezones, 1 year
>>> !=
>>> > >>>>> 365/364!
>>> > >>>>>> 30 days != 1 month
>>> > >>>>>>>>>          // The true way to store and calculate intervals is
>>> to
>>> > >> store
>>> > >>>>>> it as it defined
>>> > >>>>>>>>>          // Due the fact that Arrow supports only two types
>>> > >> YearMonth
>>> > >>>>>> (month) and DayTime (day, time)
>>> > >>>>>>>>>          // It's not possible to store complex intervals
>>> > >>>>>>>>>          // It's possible to do select (NOW() + INTERVAL '1
>>> > year') +
>>> > >>>>>> INTERVAL '1 day'; as workaround
>>> > >>>>>>>>>          if result_month != 0 && (result_days != 0 ||
>>> > result_millis
>>> > >> !=
>>> > >>>>>> 0) {
>>> > >>>>>>>>>              return
>>> Err(DataFusionError::NotImplemented(format!(
>>> > >>>>>>>>>                  "DF does not support intervals that have
>>> both a
>>> > >>>>>> Year/Month part as well as Days/Hours/Mins/Seconds: {:?}. Hint:
>>> try
>>> > >>>>>> breaking the interval into two parts, one with Year/Month and
>>> the
>>> > >> other
>>> > >>>>>> with Days/Hours/Mins/Seconds - e.g. (NOW() + INTERVAL '1 year')
>>> +
>>> > >>>>> INTERVAL
>>> > >>>>>> '1 day'",
>>> > >>>>>>>>>                  value
>>> > >>>>>>>>>              )));
>>> > >>>>>>>>>          }
>>> > >>>>>>>>>
>>> > >>>>>>>>>
>>> > >>>>>>>>>
>>> > >>>>>>>>> I prepared a PR
>>> https://github.com/apache/arrow/pull/9516/files
>>> > <
>>> > >>>>>> https://github.com/apache/arrow/pull/9516/files> <
>>> > >>>>>> https://github.com/apache/arrow/pull/9516/files <
>>> > >>>>>> https://github.com/apache/arrow/pull/9516/files>> that
>>> introduce a
>>> > >> new
>>> > >>>>>> type for IntervalUnit called Complex, that store both YearMonth
>>> and
>>> > >>>>> DayTime
>>> > >>>>>> to support complex interval.
>>> > >>>>>>>>> I didn’t find any page/documentation on how to do RFC in
>>> Arrow
>>> > >>>>>> protocol, so can anyone point me to it or PR with email will be
>>> > >> enough?
>>> > >>>>>>>>>
>>> > >>>>>>>>> Thanks.
>>> > >>>>>>>
>>> > >>>>>>
>>> > >>>>>
>>> > >>>>
>>> > >>
>>> > >
>>> >
>>>
>>

Reply via email to