Re: Timestamps with different precision / Timedeltas

Jacques Nadeau Tue, 12 Jul 2016 17:47:08 -0700

Julian has some experience with the Oracle internals where the perfect
numeric type solves many problems...  :D




On Tue, Jul 12, 2016 at 5:43 PM, Wes McKinney <wesmck...@gmail.com> wrote:

> As one data point, none of the systems I work with use decimals for
> representing timestamps (UNIX timestamps at some resolution, second /
> milli / nano, is the most common), so having decimal as the default
> storage class would cause a computational hardship. We may consider
> incorporating the Timestamp storage type into the canonical metadata.
>
> - Wes
>
> On Tue, Jul 5, 2016 at 4:21 PM, Wes McKinney <wesmck...@gmail.com> wrote:
> > Is it worth doing a review of different file formats and database
> > systems to decide on a timestamp implementation (int64 or int96 with
> > some resolution seems to be quite popular as well)? At least in the
> > Arrow C++ codebase, we need to add decimal handling logic anyway.
> >
> > On Mon, Jun 27, 2016 at 5:20 PM, Julian Hyde <jh...@apache.org> wrote:
> >> SQL allows timestamps to be stored with any precision (i.e. number of
> digits after the decimal point) between 0 and 9. That strongly indicates to
> me that the right implementation of timestamps is as (fixed point) decimal
> values.
> >>
> >> Then devote your efforts to getting the decimal type working correctly.
> >>
> >>
> >>> On Jun 27, 2016, at 3:16 PM, Wes McKinney <wesmck...@gmail.com> wrote:
> >>>
> >>> hi Uwe,
> >>>
> >>> Thanks for bringing this up. So far we've largely been skirting the
> >>> "Logical Types Rabbit Hole", but it would be good to start a document
> >>> collecting requirements for various logical types (e.g. timestamps) so
> >>> that we can attempt to achieve good solutions on the first try based
> >>> on the experiences (good and bad) of other projects.
> >>>
> >>> In the IPC flatbuffers metadata spec that we drafted for discussion /
> >>> prototype implementation earlier this year [1], we do have a Timestamp
> >>> logical type containing only a timezone optional field [2]. If you
> >>> contrast this with Feather (which uses Arrow's physical memory layout,
> >>> but custom metadata to suit Python/R needs), that has both a unit and
> >>> timezone [3].
> >>>
> >>> Since there is little consensus in the units of timestamps (more
> >>> consensus around the UNIX 1970-01-01 epoch, but not even 100%
> >>> uniformity), I believe the best route would be to add a unit to the
> >>> metadata to indicates second through nanosecond resolution. Same goes
> >>> for a Time type.
> >>>
> >>> For example, Parquet has both milliseconds and microseconds (in
> >>> Parquet 2.0). But earlier versions of Parquet don't have this at all
> >>> [4]. Other systems like Hive and Impala are relying on their own table
> >>> metadata to convert back and forth (e.g. embedding timestamps of
> >>> whatever resolution in int64 or int96).
> >>>
> >>> For Python pandas that want to use Parquet files (via Arrow) in their
> >>> workflow, we're stuck with a couple options:
> >>>
> >>> 1) Drop sub-microsecond nanos and store timestamps as TIMESTAMP_MICROS
> >>> (or MILLIS? Not all Parquet readers may be aware of the new
> >>> microsecond ConvertedType)
> >>> 2) Store nanosecond timestamps as INT64 and add a bespoke entry to
> >>> ColumnMetaData::key_value_metadata (it's better than nothing?).
> >>>
> >>> I see use cases for both of these -- for Option 1, you may care about
> >>> interoperability with another system that uses Parquet. For Option 2,
> >>> you may care about preserving the fidelity of your pandas data.
> >>> Realistically, #1 seems like the best default option. It makes sense
> >>> to offer #2 as an option.
> >>>
> >>> I don't think addressing time zones in the first pass is strictly
> >>> necessary, but as long as we store timestamps as UTC, we can also put
> >>> the time zone in the KeyValue metadata.
> >>>
> >>> I'm not sure about the Interval type -- let's create a JIRA and tackle
> >>> that in a separate discussion. I agree that it merits inclusion as a
> >>> logical type, but I'm not sure what storage representation makes the
> >>> most sense (e.g. is is not clear to me why Parquet does not store the
> >>> interval as an absolute number of milliseconds; perhaps to accommodate
> >>> month-based intervals which may have different absolute lengths
> >>> depending on where you start).
> >>>
> >>> Let me know what you think, and if others have thoughts I'd be
> interested too.
> >>>
> >>> thanks,
> >>> Wes
> >>>
> >>> [1]: https://github.com/apache/arrow/blob/master/format/Message.fbs
> >>> [2] :
> https://github.com/apache/arrow/blob/master/format/Message.fbs#L51
> >>> [3]:
> https://github.com/wesm/feather/blob/master/cpp/src/feather/metadata.fbs#L78
> >>> [4]:
> https://github.com/apache/parquet-format/blob/parquet-format-2.0.0/src/thrift/parquet.thrift
> >>>
> >>> On Tue, Jun 21, 2016 at 1:40 PM, Uwe Korn <uw...@xhochy.com> wrote:
> >>>> Hello,
> >>>>
> >>>> in addition to categoricals, we also miss at the moment a conversion
> from
> >>>> Timestamps in Pandas/NumPy to Arrow. Currently we only have two
> (exact)
> >>>> resolutions for them: DATE for days and TIMESTAMP for milliseconds. As
> >>>> https://docs.scipy.org/doc/numpy/reference/arrays.datetime.html
> notes there
> >>>> are several more. We do not need to cater for all but at least some
> of them.
> >>>> Therefore I have the following questions which I like to have solved
> in some
> >>>> form before implementing:
> >>>>
> >>>> * Do we want to cater for other resolutions?
> >>>> * If we do not provide, e.g. nanosecond resolution (sadly the default
> >>>>   in Pandas), do we cast with precision loss to the nearest match? Or
> >>>>   should we force the user to do it?
> >>>> * Not so important for me at the moment: Do we want to support time
> zones?
> >>>>
> >>>> My current objective is to have them for Parquet file writing. Sadly
> this
> >>>> has the same limitations. So the two main options seem to be
> >>>>
> >>>> * "roundtrip will only yield correct timezone and logical type if we
> >>>>   read with Arrow/Pandas again (as we use "proprietary" metadata to
> >>>>   encode it)"
> >>>> * "we restrict us to milliseconds and days as resolution" (for the
> >>>>   latter option, we need to decide how graceful we want to be in the
> >>>>   Pandas<->Arrow conversion).
> >>>>
> >>>> Further datatype we have not yet in Arrow but partly in Parquet is
> timedelta
> >>>> (or INTERVAL in Parquet). Probably we need to add another logical
> type to
> >>>> Arrow to implement them. Open for suggestions here, too.
> >>>>
> >>>> Also in the Arrow spec there is TIME which seems to be the same as
> TIMESTAMP
> >>>> (as far as the comments in the C++ code goes). Is there maybe some
> >>>> distinction I'm missing?
> >>>>
> >>>> Cheers
> >>>>
> >>>> Uwe
> >>>>
> >>
>

Re: Timestamps with different precision / Timedeltas

Reply via email to