I've been looking at the proposed date and time logical types and I have
a few questions. Here are the proposed logical types:
Date and Time:
* date: truncated julian day (int32)
* time_milli: (milliseconds since midnight) int32
* time_micro: (microseconds since midnight) int64
* interval (proposed 12 byte, Nong to review)
Timestamps, Always stored as epoch time. (units since utc jan 1, 1970).
Can also be annotated with ISO time zone string in footer.
* timestamp_milli int64 (milliseconds)
* timestamp_micro int64 (microseconds)
Could we use the same epoch for both date and the timestamps? I think it
will get confusing for implementations if we use Julian epoch for dates
and Unix epoch for timestamp. I like that the timestamp_milli proposed
is familiar to most java developers because both java.util.Date and
Joda's Instant are backed by it. Will Unix epoch work for date?
Why is the maximum precision in microseconds? Both previous proposals
used nanoseconds instead. The gain seems to be that timestamp_micro fits
in an int64, but that means that the time_micro type is only using 5
bits of the extra 4 bytes used to store it.
One solution I'd like to consider is what Apache Phoenix does. Phoenix
uses a separate 4 bytes to store a nanosecond offset (20 bits). This
would enable ignoring the nanoseconds in some cases, like for most
comparisons in filters. It would take no more space than the time_micro
type and would require another 4 bytes for the timestamp equivalent, but
you'd get nanosecond precision.
Another win for the 4-byte nanosecond offset is that we are more likely
to be able to use the same representation in the HBase type spec, since
Phoenix already has data using this type and will almost certainly
require nanosecond precision.
Thanks for putting together this proposal, it's great to see how close
this is getting to finished.
rb
--
Ryan Blue
Software Engineer
Cloudera, Inc.