Re: Timestamps with different precision / Timedeltas

Donald Foss Fri, 30 Sep 2016 12:21:06 -0700

+1 for nano or milli, or something else? 

TL;DR;


epochMilli++

—

Wes, the hierarchy is eminently reasonable, so +1 from me for that.  Regarding 
your aside, I am also a fan of the http://speleotrove.com/decimal/decarith.html 
<http://speleotrove.com/decimal/decarith.html> specification, though I must 
admit I am biased simply because it addresses the Rexx Lost Digits condition.

The most commonly used timestamps I see are stored as epoch milliseconds, or 
epochMillis.  It may not be canonical, however there are many billions of 
devices and software applications utilizing it.

To support extremely fine grained DateTime representations, particularly in 
common scientific applications, I’m for _epochNano_, with logical casting to 
work with existing datasets that are in epochMilli instead.  We can deal with 
the rollover in 300k years.

While I personally would prefer assigning 0 as 2000-01-01T00:00:00.00Z, I doubt 
it will ever happen. No, I’m not a millennial.

My only concern is for use of 64-bit logical DateTime at the small Physics 
level.  For that use case, UT2 is more appropriate; measurements are frequently 
in fractions of nanoseconds.  Perhaps there could be a way to logically cast a 
signed int96, which is supported by Parquet.

Timestamp [logical type]
extends FixedDecimal [logical type] (int64)
extends FixedWidth [physical type] byteArray[8]

Timestamp96 [logical type]
extends FixedDecimal [logical type] (int96)
extends FixedWidth [physical type] byteArray[12]

—

Although inappurtenant to this specific discussion, I would like to see a 
standardized DateTime specification that uses a signed int64 as the decimal 
epochSecond and an unsigned int96 as the fractional representation of a second.

TimestampHiggs [logical type]
extends FixedDecimal [logical type] [(int64), (uint96)] :: join()ing of 2 
columns, the fixed decimal epochSecond and the fractional second as (n/2^96).
extends FixedWidth [physical type] byteArray[8], byteArray[12]

—Donald

> On Sep 29, 2016, at 7:07 PM, Jacques Nadeau <jacq...@apache.org> wrote:
> 
> +1
> 
> On Thu, Sep 29, 2016 at 3:19 PM, Wes McKinney <wesmck...@gmail.com> wrote:
> 
>> hello,
>> 
>> For the current iteration of Arrow, can we agree to support int64 UNIX
>> timestamps with a particular resolution (second through nanosecond),
>> as these are reasonably common representations? We can look to expand
>> later if it is needed.
>> 
>> Thanks
>> Wes
>> 
>> On Mon, Aug 15, 2016 at 4:12 AM, Wes McKinney <wesmck...@gmail.com> wrote:
>>> Bumping this discussion. As part of finalizing a v1 Arrow spec (for
>>> purposes of moving data between systems, at minimum) we should propose
>>> timestamp metadata and physical memory representation that maximizes
>>> interoperability with other systems. It seems like a fixed decimal
>>> would meet this requirement as UNIX-like timestamps at some resolution
>>> could pass unmodified with appropriate metadata.
>>> 
>>> We will also need decimal types in Arrow (at least to accommodate
>>> common database representations and file formats like Parquet), so
>>> this seems like a reasonable potential hierarchy of types:
>>> 
>>> Timestamp [logical type]
>>> extends FixedDecimal [logical type]
>>> extends FixedWidth [physical type]
>>> 
>>> I did a bit of internet searching but did not find a canonical
>>> reference or implementation of fixed decimals; that would be helpful.
>>> 
>>> As an aside: for floating decimal numbers for numerical data we could
>>> utilize an implementation like http://www.bytereef.org/mpdecimal/
>>> which implements the spec described at
>>> http://speleotrove.com/decimal/decarith.html
>>> 
>>> Thanks
>>> Wes
>>> 
>>> On Thu, Jul 14, 2016 at 8:18 AM, Alex Samuel <a...@alexsamuel.net>
>> wrote:
>>>> Hi all,
>>>> 
>>>> May I suggest that instead of fixed-point decimals, you consider a more
>>>> general fixed-denominator rational representation, for times and other
>>>> purposes? Powers of ten are convenient for humans, but powers of two
>> more
>>>> efficient. For some applications, the efficiency of bit operations over
>>>> divmod is more useful than an exact representation of integral
>> nanoseconds.
>>>> 
>>>> std::chrono takes this approach. I'll also humbly point you at my own
>>>> date/time library, https://github.com/alexhsamuel/cron (incomplete but
>>>> basically working), which may provide ideas or useful code. It was
>> intended
>>>> for precisely this sort of application.
>>>> 
>>>> Regards,
>>>> Alex
>>>> 
>>>> 
>>>> On Thu, Jul 14, 2016 at 10:27 AM Uwe Korn <uw...@xhochy.com> wrote:
>>>> 
>>>>> I agree with that having a Decimal type for timestamps is a nice
>>>>> definition. Haying your time encoded as seconds or nanoseconds should
>> be
>>>>> the same as having a scale of the respective amount. But I would rather
>>>>> avoid having a separate decimal physical type. Therefore I'd prefer the
>>>>> parquet approach where decimal is only a logical type and backed by
>>>>> either a bytearray, int32 or int64.
>>>>> 
>>>>> Thus a more general timestamp could look like:
>>>>> 
>>>>> * Decimals are logical types, physical types are the same as defined in
>>>>> Parquet [1]
>>>>> * Base unit for timestamps is seconds, you can get milliseconds and
>>>>> nanoseconds by using a different scale. .(Note that seconds and so on
>>>>> are all powers of ten, thus matching the specification of decimal scale
>>>>> really good).
>>>>> * Timestamp is just another logical type that is referring to Decimal
>>>>> (and optionally may have a timezone) and signalling that we have a Time
>>>>> and not just a "simple" decimal.
>>>>> * For a first iteration, I would assume no timezone or UTC but not
>>>>> include a metadata field. Once we're sure the implementation works, we
>>>>> can add metadata about it.
>>>>> 
>>>>> Timedeltas could be addressed in a similar way, just without the need
>>>>> for a timezone.
>>>>> 
>>>>> For my usages, I don't have the use-case for a larger than int64
>>>>> timestamp and would like to have it exactly as such in my computation,
>>>>> thus my preference for the Parquet way.
>>>>> 
>>>>> Uwe
>>>>> 
>>>>> [1]
>>>>> 
>>>>> https://github.com/apache/parquet-format/blob/master/
>> LogicalTypes.md#decimal
>>>>> 
>>>>> On 13.07.16 03:06, Julian Hyde wrote:
>>>>>> I'm talking about a fixed decimal type, not floating decimal. (Oracle
>>>>>> numbers are floating decimal. They have a few nice properties, but
>>>>>> they are variable width and can get quite large. I've seen one or two
>>>>>> systems that started with binary flo
>>>> 
>>>> 
>>>>> * Base unit for timestamps is seconds, you can get milliseconds and
>>>> 
>>>> nanoseconds by using a different scale. .(Note that seconds and so on
>>>> 
>>>> are all powers of ten, thus matching the specification of decimal scale
>>>> 
>>>> really good).
>>>> 
>>>> * Timestamp is just another logical type that is referring to Decimal
>>>> 
>>>> (and optionally may have a timezone) and signalling that we have a Tim
>>>> 
>>>> ating point numbers, which are
>>>>>> much worse for business computing, and then change to Java
>> BigDecimal,
>>>>>> which gives the right answer but are horribly inefficient.)
>>>>>> 
>>>>>> A fixed decimal type has virtually zero computational overhead. It
>>>>>> just has a piece of metadata saying something like "every value in
>>>>>> this field is multiplied by 1 million" and leaves it to the client
>>>>>> program to do that multiplying.
>>>>>> 
>>>>>> My advice is to create a good fixed decimal type and lean on it
>> heavily.
>>>>>> 
>>>>>> Julian
>>>>>> 
>>>>> 
>>>>> 
>>

Re: Timestamps with different precision / Timedeltas

Reply via email to