Is it worth doing a review of different file formats and database systems to decide on a timestamp implementation (int64 or int96 with some resolution seems to be quite popular as well)? At least in the Arrow C++ codebase, we need to add decimal handling logic anyway.
On Mon, Jun 27, 2016 at 5:20 PM, Julian Hyde <jh...@apache.org> wrote: > SQL allows timestamps to be stored with any precision (i.e. number of digits > after the decimal point) between 0 and 9. That strongly indicates to me that > the right implementation of timestamps is as (fixed point) decimal values. > > Then devote your efforts to getting the decimal type working correctly. > > >> On Jun 27, 2016, at 3:16 PM, Wes McKinney <wesmck...@gmail.com> wrote: >> >> hi Uwe, >> >> Thanks for bringing this up. So far we've largely been skirting the >> "Logical Types Rabbit Hole", but it would be good to start a document >> collecting requirements for various logical types (e.g. timestamps) so >> that we can attempt to achieve good solutions on the first try based >> on the experiences (good and bad) of other projects. >> >> In the IPC flatbuffers metadata spec that we drafted for discussion / >> prototype implementation earlier this year [1], we do have a Timestamp >> logical type containing only a timezone optional field [2]. If you >> contrast this with Feather (which uses Arrow's physical memory layout, >> but custom metadata to suit Python/R needs), that has both a unit and >> timezone [3]. >> >> Since there is little consensus in the units of timestamps (more >> consensus around the UNIX 1970-01-01 epoch, but not even 100% >> uniformity), I believe the best route would be to add a unit to the >> metadata to indicates second through nanosecond resolution. Same goes >> for a Time type. >> >> For example, Parquet has both milliseconds and microseconds (in >> Parquet 2.0). But earlier versions of Parquet don't have this at all >> [4]. Other systems like Hive and Impala are relying on their own table >> metadata to convert back and forth (e.g. embedding timestamps of >> whatever resolution in int64 or int96). >> >> For Python pandas that want to use Parquet files (via Arrow) in their >> workflow, we're stuck with a couple options: >> >> 1) Drop sub-microsecond nanos and store timestamps as TIMESTAMP_MICROS >> (or MILLIS? Not all Parquet readers may be aware of the new >> microsecond ConvertedType) >> 2) Store nanosecond timestamps as INT64 and add a bespoke entry to >> ColumnMetaData::key_value_metadata (it's better than nothing?). >> >> I see use cases for both of these -- for Option 1, you may care about >> interoperability with another system that uses Parquet. For Option 2, >> you may care about preserving the fidelity of your pandas data. >> Realistically, #1 seems like the best default option. It makes sense >> to offer #2 as an option. >> >> I don't think addressing time zones in the first pass is strictly >> necessary, but as long as we store timestamps as UTC, we can also put >> the time zone in the KeyValue metadata. >> >> I'm not sure about the Interval type -- let's create a JIRA and tackle >> that in a separate discussion. I agree that it merits inclusion as a >> logical type, but I'm not sure what storage representation makes the >> most sense (e.g. is is not clear to me why Parquet does not store the >> interval as an absolute number of milliseconds; perhaps to accommodate >> month-based intervals which may have different absolute lengths >> depending on where you start). >> >> Let me know what you think, and if others have thoughts I'd be interested >> too. >> >> thanks, >> Wes >> >> [1]: https://github.com/apache/arrow/blob/master/format/Message.fbs >> [2] : https://github.com/apache/arrow/blob/master/format/Message.fbs#L51 >> [3]: >> https://github.com/wesm/feather/blob/master/cpp/src/feather/metadata.fbs#L78 >> [4]: >> https://github.com/apache/parquet-format/blob/parquet-format-2.0.0/src/thrift/parquet.thrift >> >> On Tue, Jun 21, 2016 at 1:40 PM, Uwe Korn <uw...@xhochy.com> wrote: >>> Hello, >>> >>> in addition to categoricals, we also miss at the moment a conversion from >>> Timestamps in Pandas/NumPy to Arrow. Currently we only have two (exact) >>> resolutions for them: DATE for days and TIMESTAMP for milliseconds. As >>> https://docs.scipy.org/doc/numpy/reference/arrays.datetime.html notes there >>> are several more. We do not need to cater for all but at least some of them. >>> Therefore I have the following questions which I like to have solved in some >>> form before implementing: >>> >>> * Do we want to cater for other resolutions? >>> * If we do not provide, e.g. nanosecond resolution (sadly the default >>> in Pandas), do we cast with precision loss to the nearest match? Or >>> should we force the user to do it? >>> * Not so important for me at the moment: Do we want to support time zones? >>> >>> My current objective is to have them for Parquet file writing. Sadly this >>> has the same limitations. So the two main options seem to be >>> >>> * "roundtrip will only yield correct timezone and logical type if we >>> read with Arrow/Pandas again (as we use "proprietary" metadata to >>> encode it)" >>> * "we restrict us to milliseconds and days as resolution" (for the >>> latter option, we need to decide how graceful we want to be in the >>> Pandas<->Arrow conversion). >>> >>> Further datatype we have not yet in Arrow but partly in Parquet is timedelta >>> (or INTERVAL in Parquet). Probably we need to add another logical type to >>> Arrow to implement them. Open for suggestions here, too. >>> >>> Also in the Arrow spec there is TIME which seems to be the same as TIMESTAMP >>> (as far as the comments in the C++ code goes). Is there maybe some >>> distinction I'm missing? >>> >>> Cheers >>> >>> Uwe >>> >