As one data point, none of the systems I work with use decimals for
representing timestamps (UNIX timestamps at some resolution, second /
milli / nano, is the most common), so having decimal as the default
storage class would cause a computational hardship. We may consider
incorporating the Timestamp storage type into the canonical metadata.

- Wes

On Tue, Jul 5, 2016 at 4:21 PM, Wes McKinney <[email protected]> wrote:
> Is it worth doing a review of different file formats and database
> systems to decide on a timestamp implementation (int64 or int96 with
> some resolution seems to be quite popular as well)? At least in the
> Arrow C++ codebase, we need to add decimal handling logic anyway.
>
> On Mon, Jun 27, 2016 at 5:20 PM, Julian Hyde <[email protected]> wrote:
>> SQL allows timestamps to be stored with any precision (i.e. number of digits 
>> after the decimal point) between 0 and 9. That strongly indicates to me that 
>> the right implementation of timestamps is as (fixed point) decimal values.
>>
>> Then devote your efforts to getting the decimal type working correctly.
>>
>>
>>> On Jun 27, 2016, at 3:16 PM, Wes McKinney <[email protected]> wrote:
>>>
>>> hi Uwe,
>>>
>>> Thanks for bringing this up. So far we've largely been skirting the
>>> "Logical Types Rabbit Hole", but it would be good to start a document
>>> collecting requirements for various logical types (e.g. timestamps) so
>>> that we can attempt to achieve good solutions on the first try based
>>> on the experiences (good and bad) of other projects.
>>>
>>> In the IPC flatbuffers metadata spec that we drafted for discussion /
>>> prototype implementation earlier this year [1], we do have a Timestamp
>>> logical type containing only a timezone optional field [2]. If you
>>> contrast this with Feather (which uses Arrow's physical memory layout,
>>> but custom metadata to suit Python/R needs), that has both a unit and
>>> timezone [3].
>>>
>>> Since there is little consensus in the units of timestamps (more
>>> consensus around the UNIX 1970-01-01 epoch, but not even 100%
>>> uniformity), I believe the best route would be to add a unit to the
>>> metadata to indicates second through nanosecond resolution. Same goes
>>> for a Time type.
>>>
>>> For example, Parquet has both milliseconds and microseconds (in
>>> Parquet 2.0). But earlier versions of Parquet don't have this at all
>>> [4]. Other systems like Hive and Impala are relying on their own table
>>> metadata to convert back and forth (e.g. embedding timestamps of
>>> whatever resolution in int64 or int96).
>>>
>>> For Python pandas that want to use Parquet files (via Arrow) in their
>>> workflow, we're stuck with a couple options:
>>>
>>> 1) Drop sub-microsecond nanos and store timestamps as TIMESTAMP_MICROS
>>> (or MILLIS? Not all Parquet readers may be aware of the new
>>> microsecond ConvertedType)
>>> 2) Store nanosecond timestamps as INT64 and add a bespoke entry to
>>> ColumnMetaData::key_value_metadata (it's better than nothing?).
>>>
>>> I see use cases for both of these -- for Option 1, you may care about
>>> interoperability with another system that uses Parquet. For Option 2,
>>> you may care about preserving the fidelity of your pandas data.
>>> Realistically, #1 seems like the best default option. It makes sense
>>> to offer #2 as an option.
>>>
>>> I don't think addressing time zones in the first pass is strictly
>>> necessary, but as long as we store timestamps as UTC, we can also put
>>> the time zone in the KeyValue metadata.
>>>
>>> I'm not sure about the Interval type -- let's create a JIRA and tackle
>>> that in a separate discussion. I agree that it merits inclusion as a
>>> logical type, but I'm not sure what storage representation makes the
>>> most sense (e.g. is is not clear to me why Parquet does not store the
>>> interval as an absolute number of milliseconds; perhaps to accommodate
>>> month-based intervals which may have different absolute lengths
>>> depending on where you start).
>>>
>>> Let me know what you think, and if others have thoughts I'd be interested 
>>> too.
>>>
>>> thanks,
>>> Wes
>>>
>>> [1]: https://github.com/apache/arrow/blob/master/format/Message.fbs
>>> [2] : https://github.com/apache/arrow/blob/master/format/Message.fbs#L51
>>> [3]: 
>>> https://github.com/wesm/feather/blob/master/cpp/src/feather/metadata.fbs#L78
>>> [4]: 
>>> https://github.com/apache/parquet-format/blob/parquet-format-2.0.0/src/thrift/parquet.thrift
>>>
>>> On Tue, Jun 21, 2016 at 1:40 PM, Uwe Korn <[email protected]> wrote:
>>>> Hello,
>>>>
>>>> in addition to categoricals, we also miss at the moment a conversion from
>>>> Timestamps in Pandas/NumPy to Arrow. Currently we only have two (exact)
>>>> resolutions for them: DATE for days and TIMESTAMP for milliseconds. As
>>>> https://docs.scipy.org/doc/numpy/reference/arrays.datetime.html notes there
>>>> are several more. We do not need to cater for all but at least some of 
>>>> them.
>>>> Therefore I have the following questions which I like to have solved in 
>>>> some
>>>> form before implementing:
>>>>
>>>> * Do we want to cater for other resolutions?
>>>> * If we do not provide, e.g. nanosecond resolution (sadly the default
>>>>   in Pandas), do we cast with precision loss to the nearest match? Or
>>>>   should we force the user to do it?
>>>> * Not so important for me at the moment: Do we want to support time zones?
>>>>
>>>> My current objective is to have them for Parquet file writing. Sadly this
>>>> has the same limitations. So the two main options seem to be
>>>>
>>>> * "roundtrip will only yield correct timezone and logical type if we
>>>>   read with Arrow/Pandas again (as we use "proprietary" metadata to
>>>>   encode it)"
>>>> * "we restrict us to milliseconds and days as resolution" (for the
>>>>   latter option, we need to decide how graceful we want to be in the
>>>>   Pandas<->Arrow conversion).
>>>>
>>>> Further datatype we have not yet in Arrow but partly in Parquet is 
>>>> timedelta
>>>> (or INTERVAL in Parquet). Probably we need to add another logical type to
>>>> Arrow to implement them. Open for suggestions here, too.
>>>>
>>>> Also in the Arrow spec there is TIME which seems to be the same as 
>>>> TIMESTAMP
>>>> (as far as the comments in the C++ code goes). Is there maybe some
>>>> distinction I'm missing?
>>>>
>>>> Cheers
>>>>
>>>> Uwe
>>>>
>>

Reply via email to