>
> However it seems a little unfortunate that there is now way to represent a
> "common" interval like "1 week and 1 hour" with native arrow types


I might have misunderstood,but at least in postgres, I thought this boils
down to "0 months, 7 days, 3600 seconds".  Since months is 0, this seems
like it fits squarely in the existing interval type Days_Mills.

I thought what can't be represented today is "1 Year 1 Hour".  It seems
like none of the proposals so far cover weeks as an explicit type?

On Fri, Apr 2, 2021 at 2:42 PM Andrew Lamb <al...@influxdata.com> wrote:

> I think it is plausible that we use Arrow structs to create a synthetic
> interval type for DataFusion (I don't have a compelling usecase to store
> the intervals themselves, or to expose them outside of DataFusion).
>
> However it seems a little unfortunate that there is now way to represent a
> "common" interval like "1 week and 1 hour" with native arrow types
>
>
>
> On Fri, Apr 2, 2021 at 4:38 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
>> The real usecase I have is "postgres compatibility"
>>
>>
>> Yeah, I'm a little conflicted on this.  A broader analysis might be
>> necessary and I'd welcome others thoughts, but at what point should we
>> mostly consider the type system closed?  Should we be aiming for full
>> parity with ANSI SQL/Postgres SQL or something else?
>>
>>
>>> I have no known need for the actual postgres timestamp internal
>>> representation.
>>
>>
>> I suppose there is an edge case that the seconds range is larger for
>> microseconds compared to nanoseconds with the simple representation.  But
>> that seems minor.
>>
>> On Fri, Apr 2, 2021 at 1:25 PM Andrew Lamb <al...@influxdata.com> wrote:
>>
>>> The real usecase I have is "postgres compatibility" - in the sense that
>>> we can write SQL queries / expressions that use postgres interval type [1]
>>> and corresponding expressions with the full postgres interval range. I have
>>> no known need for the actual postgres timestamp internal representation.
>>>
>>> A more vague usecase I would like (but do not strictly need) is for
>>> expressing intervals up to the same precision as IOx's underlying
>>> timestamps (which are stored as nanosecond precision).
>>>
>>> [1]
>>> https://www.postgresql.org/docs/current/datatype-datetime.html#DATATYPE-INTERVAL-INPUT
>>>
>>> "Internally interval values are stored as months, days, and seconds.
>>> This is done because the number of days in a month varies, and a day can
>>> have 23 or 25 hours if a daylight savings time adjustment is involved. The
>>> months and days fields are integers while the seconds field can store
>>> fractions. Because intervals are usually created from constant strings or
>>>  timestamp subtraction, this storage method works well in most cases,
>>> but can cause unexpected results:
>>> "
>>>
>>> On Fri, Apr 2, 2021 at 3:53 PM Micah Kornfield <emkornfi...@gmail.com>
>>> wrote:
>>>
>>>> Andrew is the use-case you have simply postgres compatibility or is it
>>>> more
>>>> extensive?
>>>>
>>>> One potential problem with combining Month and Day fields, is that the
>>>> type
>>>> no longer has a defined sort order (the existing Day-Millisecond type
>>>> without assumptions, in particular because I don't think today there is
>>>> an
>>>> explicit constraint on the bounds for the millisecond component).
>>>>
>>>> -Micah
>>>>
>>>>
>>>>
>>>> On Wed, Mar 31, 2021 at 9:03 AM Antoine Pitrou <anto...@python.org>
>>>> wrote:
>>>>
>>>> >
>>>> > Le 31/03/2021 à 17:55, Micah Kornfield a écrit :
>>>> > > Thanks for the feedback.  A couple of points here and some responses
>>>> > below.
>>>> > >
>>>> > > * One other question is whether the Nanoseconds should actually be
>>>> > > configurable (i.e. use milliseconds or microseconds).  I would lean
>>>> > towards
>>>> > > no.
>>>> >
>>>> > Same for me.
>>>> >
>>>> > > * I'm also still not 100% convinced we need this as a first class
>>>> type in
>>>> > > arrow or if we should be looking more closely at the Struct (in the
>>>> Arrow
>>>> > > sense) based implementation.  In the future where alternative
>>>> encodings
>>>> > are
>>>> > > supported, this could allow for much smaller footprints for this
>>>> type.
>>>> >
>>>> > Having a "packed" first class type allows for better locality when
>>>> > accessing data.  It doesn't sound very likely that you'd access only
>>>> one
>>>> > component of the interval.
>>>> >
>>>> > But I have no idea how important this is, and temporal datetypes are
>>>> > generally cumbersome to add support for (conversions, arithmetic,
>>>> etc.),
>>>> > so it would be nice to avoid adding too many of them :-)
>>>> >
>>>> > Regards
>>>> >
>>>> > Antoine.
>>>> >
>>>> >
>>>> >
>>>> > >
>>>> > > The 3
>>>> > >> field implementation doesn't seem to have any way to represent
>>>> integral
>>>> > >> days, so I am also not sure about that one.
>>>> > >
>>>> > >
>>>> > > Sorry this was an email gaffe.  I intended Month (32 bit int), Day
>>>> (32
>>>> > bit
>>>> > > int), Nanosecond (64 bit int).
>>>> > >
>>>> > > OTOH I don't really understand the point of supporting "the most
>>>> > >> reasonable ranges for Year, Month and Nanoseconds independently".
>>>> What
>>>> > >> does it bring to encode more than one month in the nanoseconds
>>>> field?
>>>> > >
>>>> > >
>>>> > > I'm happy with simplicity.   In the past there has been some
>>>> reference to
>>>> > > people wanting to store very large timestamps (fall out of
>>>> Nanoseconds
>>>> > max
>>>> > > representable value) but we've concluded that this wasn't something
>>>> that
>>>> > we
>>>> > > wanted to really support.
>>>> > >
>>>> > >
>>>> > >
>>>> > >
>>>> > >
>>>> > >
>>>> > > On Wed, Mar 31, 2021 at 4:49 AM Antoine Pitrou <anto...@python.org>
>>>> > wrote:
>>>> > >
>>>> > >>
>>>> > >> I would favour the following characteristics :
>>>> > >> - support for nanoseconds (especially as other Arrow temporal types
>>>> > >> support it)
>>>> > >> - easy to handle (which excludes the ZetaSQL representtaion IMHO)
>>>> > >>
>>>> > >> OTOH I don't really understand the point of supporting "the most
>>>> > >> reasonable ranges for Year, Month and Nanoseconds independently".
>>>> What
>>>> > >> does it bring to encode more than one month in the nanoseconds
>>>> field?
>>>> > >> You can already use the Duration type for that.
>>>> > >>
>>>> > >> Regards
>>>> > >>
>>>> > >> Antoine.
>>>> > >>
>>>> > >>
>>>> > >> Le 31/03/2021 à 05:48, Micah Kornfield a écrit :
>>>> > >>> To follow-up on this conversation I did some analysis on interval
>>>> > types:
>>>> > >>>
>>>> > >>>
>>>> > >>
>>>> >
>>>> https://docs.google.com/document/d/1i1E_fdQ_xODZcAhsV11Pfq27O50k679OYHXFJpm9NS0/edit
>>>> > >> Please feel free to add more details/systems I missed.
>>>> > >>>
>>>> > >>> Given the disparate requirements of different systems I think the
>>>> > >> following might make sense for official types (if there isn't
>>>> > consensus, I
>>>> > >> might try to contributation extension Array implementations for
>>>> them to
>>>> > >> Java and C++/Python separately).
>>>> > >>>
>>>> > >>> 1.  3 fields: Year (32 bit), Month (32 bit), Nanoseconds (64 bit)
>>>> all
>>>> > >> signed.
>>>> > >>> 2.  Postgres representation (Downside is it doesn't support
>>>> > Nanoseconds,
>>>> > >> only microseconds).
>>>> > >>> 3.  ZetaSQL implementation (Requires some bit manipulation) but
>>>> > supports
>>>> > >> the most reasonable ranges for Year, Month and Nanoseconds
>>>> > independently.
>>>> > >>>
>>>> > >>> Thoughts?
>>>> > >>>
>>>> > >>> Micah
>>>> > >>>
>>>> > >>> On 2021/02/18 04:30:55 Micah Kornfield wrote:
>>>> > >>>>>
>>>> > >>>>> I didn’t find any page/documentation on how to do RFC in Arrow
>>>> > >> protocol,
>>>> > >>>>> so can anyone point me to it or PR with email will be enough?
>>>> > >>>>
>>>> > >>>> That is enough to start discussion.  Before formal acceptance and
>>>> > >> merging
>>>> > >>>> of the PR there needs to be a Java and C++ implementations for
>>>> the
>>>> > type
>>>> > >>>> that pass integration tests.  At the time this guideline was
>>>> > instituted
>>>> > >>>> Java and C++ were considered the "reference" implementations (I
>>>> think
>>>> > >> they
>>>> > >>>> still have the most complete integration test coverage).
>>>> > >>>>
>>>> > >>>> My understanding is that the current modelling of intervals
>>>> mimics SQL
>>>> > >>>> standards (e.g. SQL Server [1]).  So it would also be good to
>>>> step
>>>> > back
>>>> > >> and
>>>> > >>>> understand what problem DF is trying to solve and how it differs
>>>> from
>>>> > >> other
>>>> > >>>> SQL implementations.  I'd be hesitant to accept COMPLEX as a new
>>>> type
>>>> > >>>> without a much deeper analysis into calendar representations
>>>> within
>>>> > >> Arrow
>>>> > >>>> and how they relate to other existing systems (e.g. Hive and some
>>>> > >>>> assortment of existing SQL databases).  For instance the current
>>>> > >> modelling
>>>> > >>>> of timestamps does not lend itself to constructing a COMPLEX
>>>> interval
>>>> > >> type
>>>> > >>>> particularly well. (Duration was introduced for this reason).
>>>> > >>>>
>>>> > >>>> I think both Wes's suggestion of FixedSizeBinary and Andrew's of
>>>> > >> composing
>>>> > >>>> the with a struct are good stop-gaps.  These obviously have
>>>> different
>>>> > >>>> trade-offs.  Ultimately, it would be good to define common
>>>> extension
>>>> > >> types
>>>> > >>>> that can represent this use-case if there really is demand for
>>>> it (if
>>>> > it
>>>> > >>>> doesn't become a top level type).
>>>> > >>>>
>>>> > >>>> [1]
>>>> > >>>>
>>>> > >>
>>>> >
>>>> https://docs.microsoft.com/en-us/sql/odbc/reference/appendixes/interval-data-types?view=sql-server-ver15
>>>> > >>>>
>>>> > >>>> -Micah
>>>> > >>>>
>>>> > >>>> On Wed, Feb 17, 2021 at 2:05 PM Andrew Lamb <
>>>> al...@influxdata.com>
>>>> > >> wrote:
>>>> > >>>>
>>>> > >>>>> That is a great suggestion Wes, thank you.
>>>> > >>>>>
>>>> > >>>>> I wonder if we could get away with a 128 bit representation
>>>> that is
>>>> > the
>>>> > >>>>> concatenation of the two existing interval types
>>>> > (YearMonth)(DayTime).
>>>> > >> Or
>>>> > >>>>> maybe even define a `struct` type with those fields that is
>>>> used by
>>>> > >>>>> DataFusion.
>>>> > >>>>>
>>>> > >>>>> Basically, given our reading of the Arrow spec[1], it is
>>>> currently
>>>> > not
>>>> > >>>>> possible to precisely represent an interval that has both
>>>> monthly and
>>>> > >>>>> sub-montly granularity.
>>>> > >>>>>
>>>> > >>>>> As Dmtry says, if you have an interval seemingly simple like  1
>>>> > month,
>>>> > >> 1
>>>> > >>>>> day
>>>> > >>>>>
>>>> > >>>>> Using IntervalUnit(YEAR_MONTH) can't represent the 1 day
>>>> > >>>>> Using IntervalUnit(DAY_TIME) can't represent the month as
>>>> different
>>>> > >> months
>>>> > >>>>> have different numbers of days
>>>> > >>>>>
>>>> > >>>>> [1]
>>>> > >>>>>
>>>> > >>
>>>> https://github.com/apache/arrow/blob/master/format/Schema.fbs#L249-L260
>>>> > >>>>>
>>>> > >>>>>
>>>> > >>>>> On Wed, Feb 17, 2021 at 5:01 PM Wes McKinney <
>>>> wesmck...@gmail.com>
>>>> > >> wrote:
>>>> > >>>>>
>>>> > >>>>>> On Wed, Feb 17, 2021 at 3:46 PM <t...@dmtry.me> wrote:
>>>> > >>>>>>>
>>>> > >>>>>>>> It's unclear to me that this needs to be introduced into the
>>>> > >>>>> top-level
>>>> > >>>>>>>
>>>> > >>>>>>> Similar thing to columnar format, How to store interval like 1
>>>> > month
>>>> > >> 1
>>>> > >>>>>> day 1 hour? It’s not possible to do it without converting 1
>>>> month to
>>>> > >> 30
>>>> > >>>>>> days, which is a bad way.
>>>> > >>>>>>>
>>>> > >>>>>>
>>>> > >>>>>> Presumably you can represent a complex interval in a fixed
>>>> number of
>>>> > >>>>>> bytes, and then embed the data in a FixedSizeBinary type. You
>>>> can
>>>> > >>>>>> adorn this type with extension type metadata so that
>>>> DataFusion can
>>>> > >>>>>> then apply Interval semantics to it. This could also serve as
>>>> an
>>>> > >>>>>> interim strategy for you to proceed with implementation while
>>>> > >>>>>> proposing a top-level type to the Arrow format (which may or
>>>> may not
>>>> > >>>>>> be accepting) so you aren't blocked on acceptance of changes
>>>> into
>>>> > >>>>>> Schema.fbs.
>>>> > >>>>>>
>>>> > >>>>>>>> On 17 Feb 2021, at 21:02, Wes McKinney <wesmck...@gmail.com>
>>>> > wrote:
>>>> > >>>>>>>>
>>>> > >>>>>>>> It's unclear to me that this needs to be introduced into the
>>>> > >>>>> top-level
>>>> > >>>>>>>> columnar format without more analysis — have you considered
>>>> > >>>>>>>> implementing this for DataFusion as an extension type for
>>>> the time
>>>> > >>>>>>>> being?
>>>> > >>>>>>>>
>>>> > >>>>>>>> On Wed, Feb 17, 2021 at 11:59 AM t...@dmtry.me <mailto:
>>>> > >> t...@dmtry.me
>>>> > >>>>>>
>>>> > >>>>>> <t...@dmtry.me <mailto:t...@dmtry.me>> wrote:
>>>> > >>>>>>>>>
>>>> > >>>>>>>>> Hi,
>>>> > >>>>>>>>>
>>>> > >>>>>>>>> For now, There are only two types of IntervalUnit inside
>>>> Arrow:
>>>> > >>>>>>>>>
>>>> > >>>>>>>>> - YearMonth - month stored as int32
>>>> > >>>>>>>>> - DayTime - days as int32 and time in milliseconds  as in32.
>>>> > Total
>>>> > >>>>>> (64 bites)
>>>> > >>>>>>>>>
>>>> > >>>>>>>>> Since DF is using Arrow, It’s not possible to store
>>>> “Complex”
>>>> > >>>>>> intervals such 1 MONTH 1 DAY 1 HOUR.
>>>> > >>>>>>>>> I think, the best way to understand the problem will be to
>>>> read a
>>>> > >>>>>> comment from DF codebase:
>>>> > >>>>>>
>>>> > >>>>>
>>>> > >>
>>>> >
>>>> https://github.com/apache/arrow/blob/bca7d2fe84ccd8fc1129cb4d85448eb0779c52c3/rust/datafusion/src/sql/planner.rs#L1148
>>>> > >>>>>>>>>
>>>> > >>>>>>>>>          // Interval is tricky thing
>>>> > >>>>>>>>>          // 1 day is not 24 hours because timezones, 1 year
>>>> !=
>>>> > >>>>> 365/364!
>>>> > >>>>>> 30 days != 1 month
>>>> > >>>>>>>>>          // The true way to store and calculate intervals
>>>> is to
>>>> > >> store
>>>> > >>>>>> it as it defined
>>>> > >>>>>>>>>          // Due the fact that Arrow supports only two types
>>>> > >> YearMonth
>>>> > >>>>>> (month) and DayTime (day, time)
>>>> > >>>>>>>>>          // It's not possible to store complex intervals
>>>> > >>>>>>>>>          // It's possible to do select (NOW() + INTERVAL '1
>>>> > year') +
>>>> > >>>>>> INTERVAL '1 day'; as workaround
>>>> > >>>>>>>>>          if result_month != 0 && (result_days != 0 ||
>>>> > result_millis
>>>> > >> !=
>>>> > >>>>>> 0) {
>>>> > >>>>>>>>>              return
>>>> Err(DataFusionError::NotImplemented(format!(
>>>> > >>>>>>>>>                  "DF does not support intervals that have
>>>> both a
>>>> > >>>>>> Year/Month part as well as Days/Hours/Mins/Seconds: {:?}.
>>>> Hint: try
>>>> > >>>>>> breaking the interval into two parts, one with Year/Month and
>>>> the
>>>> > >> other
>>>> > >>>>>> with Days/Hours/Mins/Seconds - e.g. (NOW() + INTERVAL '1
>>>> year') +
>>>> > >>>>> INTERVAL
>>>> > >>>>>> '1 day'",
>>>> > >>>>>>>>>                  value
>>>> > >>>>>>>>>              )));
>>>> > >>>>>>>>>          }
>>>> > >>>>>>>>>
>>>> > >>>>>>>>>
>>>> > >>>>>>>>>
>>>> > >>>>>>>>> I prepared a PR
>>>> https://github.com/apache/arrow/pull/9516/files
>>>> > <
>>>> > >>>>>> https://github.com/apache/arrow/pull/9516/files> <
>>>> > >>>>>> https://github.com/apache/arrow/pull/9516/files <
>>>> > >>>>>> https://github.com/apache/arrow/pull/9516/files>> that
>>>> introduce a
>>>> > >> new
>>>> > >>>>>> type for IntervalUnit called Complex, that store both
>>>> YearMonth and
>>>> > >>>>> DayTime
>>>> > >>>>>> to support complex interval.
>>>> > >>>>>>>>> I didn’t find any page/documentation on how to do RFC in
>>>> Arrow
>>>> > >>>>>> protocol, so can anyone point me to it or PR with email will be
>>>> > >> enough?
>>>> > >>>>>>>>>
>>>> > >>>>>>>>> Thanks.
>>>> > >>>>>>>
>>>> > >>>>>>
>>>> > >>>>>
>>>> > >>>>
>>>> > >>
>>>> > >
>>>> >
>>>>
>>>

Reply via email to