Re: Timestamps with different precision / Timedeltas

Uwe Korn Thu, 14 Jul 2016 07:27:51 -0700

I agree with that having a Decimal type for timestamps is a nicedefinition. Haying your time encoded as seconds or nanoseconds should bethe same as having a scale of the respective amount. But I would ratheravoid having a separate decimal physical type. Therefore I'd prefer theparquet approach where decimal is only a logical type and backed byeither a bytearray, int32 or int64.


Thus a more general timestamp could look like:

* Decimals are logical types, physical types are the same as defined inParquet [1]* Base unit for timestamps is seconds, you can get milliseconds andnanoseconds by using a different scale. .(Note that seconds and so onare all powers of ten, thus matching the specification of decimal scalereally good).* Timestamp is just another logical type that is referring to Decimal(and optionally may have a timezone) and signalling that we have a Timeand not just a "simple" decimal.* For a first iteration, I would assume no timezone or UTC but notinclude a metadata field. Once we're sure the implementation works, wecan add metadata about it.

Timedeltas could be addressed in a similar way, just without the needfor a timezone.

For my usages, I don't have the use-case for a larger than int64timestamp and would like to have it exactly as such in my computation,thus my preference for the Parquet way.

Uwe

[1]https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal


On 13.07.16 03:06, Julian Hyde wrote:

I'm talking about a fixed decimal type, not floating decimal. (Oracle
numbers are floating decimal. They have a few nice properties, but
they are variable width and can get quite large. I've seen one or two
systems that started with binary floating point numbers, which are
much worse for business computing, and then change to Java BigDecimal,
which gives the right answer but are horribly inefficient.)

A fixed decimal type has virtually zero computational overhead. It
just has a piece of metadata saying something like "every value in
this field is multiplied by 1 million" and leaves it to the client
program to do that multiplying.

My advice is to create a good fixed decimal type and lean on it heavily.

Julian


On Tue, Jul 12, 2016 at 5:46 PM, Jacques Nadeau <jacq...@apache.org> wrote:

Julian has some experience with the Oracle internals where the perfect
numeric type solves many problems...  :D



On Tue, Jul 12, 2016 at 5:43 PM, Wes McKinney <wesmck...@gmail.com> wrote:

As one data point, none of the systems I work with use decimals for
representing timestamps (UNIX timestamps at some resolution, second /
milli / nano, is the most common), so having decimal as the default
storage class would cause a computational hardship. We may consider
incorporating the Timestamp storage type into the canonical metadata.

- Wes

On Tue, Jul 5, 2016 at 4:21 PM, Wes McKinney <wesmck...@gmail.com> wrote:

Is it worth doing a review of different file formats and database
systems to decide on a timestamp implementation (int64 or int96 with
some resolution seems to be quite popular as well)? At least in the
Arrow C++ codebase, we need to add decimal handling logic anyway.

On Mon, Jun 27, 2016 at 5:20 PM, Julian Hyde <jh...@apache.org> wrote:

SQL allows timestamps to be stored with any precision (i.e. number of

digits after the decimal point) between 0 and 9. That strongly indicates to
me that the right implementation of timestamps is as (fixed point) decimal
values.

Then devote your efforts to getting the decimal type working correctly.

On Jun 27, 2016, at 3:16 PM, Wes McKinney <wesmck...@gmail.com> wrote:

hi Uwe,

Thanks for bringing this up. So far we've largely been skirting the
"Logical Types Rabbit Hole", but it would be good to start a document
collecting requirements for various logical types (e.g. timestamps) so
that we can attempt to achieve good solutions on the first try based
on the experiences (good and bad) of other projects.

In the IPC flatbuffers metadata spec that we drafted for discussion /
prototype implementation earlier this year [1], we do have a Timestamp
logical type containing only a timezone optional field [2]. If you
contrast this with Feather (which uses Arrow's physical memory layout,
but custom metadata to suit Python/R needs), that has both a unit and
timezone [3].

Since there is little consensus in the units of timestamps (more
consensus around the UNIX 1970-01-01 epoch, but not even 100%
uniformity), I believe the best route would be to add a unit to the
metadata to indicates second through nanosecond resolution. Same goes
for a Time type.

For example, Parquet has both milliseconds and microseconds (in
Parquet 2.0). But earlier versions of Parquet don't have this at all
[4]. Other systems like Hive and Impala are relying on their own table
metadata to convert back and forth (e.g. embedding timestamps of
whatever resolution in int64 or int96).

For Python pandas that want to use Parquet files (via Arrow) in their
workflow, we're stuck with a couple options:

1) Drop sub-microsecond nanos and store timestamps as TIMESTAMP_MICROS
(or MILLIS? Not all Parquet readers may be aware of the new
microsecond ConvertedType)
2) Store nanosecond timestamps as INT64 and add a bespoke entry to
ColumnMetaData::key_value_metadata (it's better than nothing?).

I see use cases for both of these -- for Option 1, you may care about
interoperability with another system that uses Parquet. For Option 2,
you may care about preserving the fidelity of your pandas data.
Realistically, #1 seems like the best default option. It makes sense
to offer #2 as an option.

I don't think addressing time zones in the first pass is strictly
necessary, but as long as we store timestamps as UTC, we can also put
the time zone in the KeyValue metadata.

I'm not sure about the Interval type -- let's create a JIRA and tackle
that in a separate discussion. I agree that it merits inclusion as a
logical type, but I'm not sure what storage representation makes the
most sense (e.g. is is not clear to me why Parquet does not store the
interval as an absolute number of milliseconds; perhaps to accommodate
month-based intervals which may have different absolute lengths
depending on where you start).

Let me know what you think, and if others have thoughts I'd be

interested too.

thanks,
Wes

[1]: https://github.com/apache/arrow/blob/master/format/Message.fbs
[2] :

https://github.com/apache/arrow/blob/master/format/Message.fbs#L51

[3]:

https://github.com/wesm/feather/blob/master/cpp/src/feather/metadata.fbs#L78

[4]:

https://github.com/apache/parquet-format/blob/parquet-format-2.0.0/src/thrift/parquet.thrift

On Tue, Jun 21, 2016 at 1:40 PM, Uwe Korn <uw...@xhochy.com> wrote:

Hello,

in addition to categoricals, we also miss at the moment a conversion

from

Timestamps in Pandas/NumPy to Arrow. Currently we only have two

(exact)

resolutions for them: DATE for days and TIMESTAMP for milliseconds. As
https://docs.scipy.org/doc/numpy/reference/arrays.datetime.html

notes there

are several more. We do not need to cater for all but at least some

of them.

Therefore I have the following questions which I like to have solved

in some

form before implementing:

* Do we want to cater for other resolutions?
* If we do not provide, e.g. nanosecond resolution (sadly the default
   in Pandas), do we cast with precision loss to the nearest match? Or
   should we force the user to do it?
* Not so important for me at the moment: Do we want to support time

zones?

My current objective is to have them for Parquet file writing. Sadly

this

has the same limitations. So the two main options seem to be

* "roundtrip will only yield correct timezone and logical type if we
   read with Arrow/Pandas again (as we use "proprietary" metadata to
   encode it)"
* "we restrict us to milliseconds and days as resolution" (for the
   latter option, we need to decide how graceful we want to be in the
   Pandas<->Arrow conversion).

Further datatype we have not yet in Arrow but partly in Parquet is

timedelta

(or INTERVAL in Parquet). Probably we need to add another logical

type to

Arrow to implement them. Open for suggestions here, too.

Also in the Arrow spec there is TIME which seems to be the same as

TIMESTAMP

(as far as the comments in the C++ code goes). Is there maybe some
distinction I'm missing?

Cheers

Uwe

Re: Timestamps with different precision / Timedeltas

Reply via email to