hi Uwe, Thanks for bringing this up. So far we've largely been skirting the "Logical Types Rabbit Hole", but it would be good to start a document collecting requirements for various logical types (e.g. timestamps) so that we can attempt to achieve good solutions on the first try based on the experiences (good and bad) of other projects.
In the IPC flatbuffers metadata spec that we drafted for discussion / prototype implementation earlier this year [1], we do have a Timestamp logical type containing only a timezone optional field [2]. If you contrast this with Feather (which uses Arrow's physical memory layout, but custom metadata to suit Python/R needs), that has both a unit and timezone [3]. Since there is little consensus in the units of timestamps (more consensus around the UNIX 1970-01-01 epoch, but not even 100% uniformity), I believe the best route would be to add a unit to the metadata to indicates second through nanosecond resolution. Same goes for a Time type. For example, Parquet has both milliseconds and microseconds (in Parquet 2.0). But earlier versions of Parquet don't have this at all [4]. Other systems like Hive and Impala are relying on their own table metadata to convert back and forth (e.g. embedding timestamps of whatever resolution in int64 or int96). For Python pandas that want to use Parquet files (via Arrow) in their workflow, we're stuck with a couple options: 1) Drop sub-microsecond nanos and store timestamps as TIMESTAMP_MICROS (or MILLIS? Not all Parquet readers may be aware of the new microsecond ConvertedType) 2) Store nanosecond timestamps as INT64 and add a bespoke entry to ColumnMetaData::key_value_metadata (it's better than nothing?). I see use cases for both of these -- for Option 1, you may care about interoperability with another system that uses Parquet. For Option 2, you may care about preserving the fidelity of your pandas data. Realistically, #1 seems like the best default option. It makes sense to offer #2 as an option. I don't think addressing time zones in the first pass is strictly necessary, but as long as we store timestamps as UTC, we can also put the time zone in the KeyValue metadata. I'm not sure about the Interval type -- let's create a JIRA and tackle that in a separate discussion. I agree that it merits inclusion as a logical type, but I'm not sure what storage representation makes the most sense (e.g. is is not clear to me why Parquet does not store the interval as an absolute number of milliseconds; perhaps to accommodate month-based intervals which may have different absolute lengths depending on where you start). Let me know what you think, and if others have thoughts I'd be interested too. thanks, Wes [1]: https://github.com/apache/arrow/blob/master/format/Message.fbs [2] : https://github.com/apache/arrow/blob/master/format/Message.fbs#L51 [3]: https://github.com/wesm/feather/blob/master/cpp/src/feather/metadata.fbs#L78 [4]: https://github.com/apache/parquet-format/blob/parquet-format-2.0.0/src/thrift/parquet.thrift On Tue, Jun 21, 2016 at 1:40 PM, Uwe Korn <[email protected]> wrote: > Hello, > > in addition to categoricals, we also miss at the moment a conversion from > Timestamps in Pandas/NumPy to Arrow. Currently we only have two (exact) > resolutions for them: DATE for days and TIMESTAMP for milliseconds. As > https://docs.scipy.org/doc/numpy/reference/arrays.datetime.html notes there > are several more. We do not need to cater for all but at least some of them. > Therefore I have the following questions which I like to have solved in some > form before implementing: > > * Do we want to cater for other resolutions? > * If we do not provide, e.g. nanosecond resolution (sadly the default > in Pandas), do we cast with precision loss to the nearest match? Or > should we force the user to do it? > * Not so important for me at the moment: Do we want to support time zones? > > My current objective is to have them for Parquet file writing. Sadly this > has the same limitations. So the two main options seem to be > > * "roundtrip will only yield correct timezone and logical type if we > read with Arrow/Pandas again (as we use "proprietary" metadata to > encode it)" > * "we restrict us to milliseconds and days as resolution" (for the > latter option, we need to decide how graceful we want to be in the > Pandas<->Arrow conversion). > > Further datatype we have not yet in Arrow but partly in Parquet is timedelta > (or INTERVAL in Parquet). Probably we need to add another logical type to > Arrow to implement them. Open for suggestions here, too. > > Also in the Arrow spec there is TIME which seems to be the same as TIMESTAMP > (as far as the comments in the C++ code goes). Is there maybe some > distinction I'm missing? > > Cheers > > Uwe >
