Here is a PR for the change in timestamp:
https://github.com/apache/arrow/pull/156

We should also clarify Date:
 https://issues.apache.org/jira/browse/ARROW-316

On Mon, Oct 3, 2016 at 3:23 PM, Julien Le Dem <jul...@dremio.com> wrote:

> I created a JIRA for the Timestamp type if you want to comment in it:
> https://issues.apache.org/jira/browse/ARROW-315
>
> On Mon, Oct 3, 2016 at 3:16 PM, Julien Le Dem <jul...@dremio.com> wrote:
>
>> consistency with Parquet a +
>> Parquet supports timestamp millis and micros (no nanos)
>> https://github.com/apache/parquet-format/blob/master/Logical
>> Types.md#datetime-types
>>
>> currently Arrow timestamps have a timezone field.
>> https://github.com/apache/arrow/blob/master/format/Message.fbs#L67
>> Wes: regarding your suggestion do we want to change timestamp as follows?
>> - remove "timestamp" field and say it's UTC
>> - add unit field (MICROS | MILLIS)
>>
>>
>>
>> On Fri, Sep 30, 2016 at 12:20 PM, Donald Foss <donald.f...@gmail.com>
>> wrote:
>>
>>> +1 for nano or milli, or something else?
>>>
>>> TL;DR;
>>>
>>> epochMilli++
>>>
>>> —
>>>
>>> Wes, the hierarchy is eminently reasonable, so +1 from me for that.
>>> Regarding your aside, I am also a fan of the
>>> http://speleotrove.com/decimal/decarith.html <
>>> http://speleotrove.com/decimal/decarith.html> specification, though I
>>> must admit I am biased simply because it addresses the Rexx Lost Digits
>>> condition.
>>>
>>> The most commonly used timestamps I see are stored as epoch
>>> milliseconds, or epochMillis.  It may not be canonical, however there are
>>> many billions of devices and software applications utilizing it.
>>>
>>> To support extremely fine grained DateTime representations, particularly
>>> in common scientific applications, I’m for _epochNano_, with logical
>>> casting to work with existing datasets that are in epochMilli instead.  We
>>> can deal with the rollover in 300k years.
>>>
>>> While I personally would prefer assigning 0 as 2000-01-01T00:00:00.00Z,
>>> I doubt it will ever happen. No, I’m not a millennial.
>>>
>>> My only concern is for use of 64-bit logical DateTime at the small
>>> Physics level.  For that use case, UT2 is more appropriate; measurements
>>> are frequently in fractions of nanoseconds.  Perhaps there could be a way
>>> to logically cast a signed int96, which is supported by Parquet.
>>>
>>> Timestamp [logical type]
>>> extends FixedDecimal [logical type] (int64)
>>> extends FixedWidth [physical type] byteArray[8]
>>>
>>> Timestamp96 [logical type]
>>> extends FixedDecimal [logical type] (int96)
>>> extends FixedWidth [physical type] byteArray[12]
>>>
>>> —
>>>
>>> Although inappurtenant to this specific discussion, I would like to see
>>> a standardized DateTime specification that uses a signed int64 as the
>>> decimal epochSecond and an unsigned int96 as the fractional representation
>>> of a second.
>>>
>>> TimestampHiggs [logical type]
>>> extends FixedDecimal [logical type] [(int64), (uint96)] :: join()ing of
>>> 2 columns, the fixed decimal epochSecond and the fractional second as
>>> (n/2^96).
>>> extends FixedWidth [physical type] byteArray[8], byteArray[12]
>>>
>>> —Donald
>>>
>>> > On Sep 29, 2016, at 7:07 PM, Jacques Nadeau <jacq...@apache.org>
>>> wrote:
>>> >
>>> > +1
>>> >
>>> > On Thu, Sep 29, 2016 at 3:19 PM, Wes McKinney <wesmck...@gmail.com>
>>> wrote:
>>> >
>>> >> hello,
>>> >>
>>> >> For the current iteration of Arrow, can we agree to support int64 UNIX
>>> >> timestamps with a particular resolution (second through nanosecond),
>>> >> as these are reasonably common representations? We can look to expand
>>> >> later if it is needed.
>>> >>
>>> >> Thanks
>>> >> Wes
>>> >>
>>> >> On Mon, Aug 15, 2016 at 4:12 AM, Wes McKinney <wesmck...@gmail.com>
>>> wrote:
>>> >>> Bumping this discussion. As part of finalizing a v1 Arrow spec (for
>>> >>> purposes of moving data between systems, at minimum) we should
>>> propose
>>> >>> timestamp metadata and physical memory representation that maximizes
>>> >>> interoperability with other systems. It seems like a fixed decimal
>>> >>> would meet this requirement as UNIX-like timestamps at some
>>> resolution
>>> >>> could pass unmodified with appropriate metadata.
>>> >>>
>>> >>> We will also need decimal types in Arrow (at least to accommodate
>>> >>> common database representations and file formats like Parquet), so
>>> >>> this seems like a reasonable potential hierarchy of types:
>>> >>>
>>> >>> Timestamp [logical type]
>>> >>> extends FixedDecimal [logical type]
>>> >>> extends FixedWidth [physical type]
>>> >>>
>>> >>> I did a bit of internet searching but did not find a canonical
>>> >>> reference or implementation of fixed decimals; that would be helpful.
>>> >>>
>>> >>> As an aside: for floating decimal numbers for numerical data we could
>>> >>> utilize an implementation like http://www.bytereef.org/mpdecimal/
>>> >>> which implements the spec described at
>>> >>> http://speleotrove.com/decimal/decarith.html
>>> >>>
>>> >>> Thanks
>>> >>> Wes
>>> >>>
>>> >>> On Thu, Jul 14, 2016 at 8:18 AM, Alex Samuel <a...@alexsamuel.net>
>>> >> wrote:
>>> >>>> Hi all,
>>> >>>>
>>> >>>> May I suggest that instead of fixed-point decimals, you consider a
>>> more
>>> >>>> general fixed-denominator rational representation, for times and
>>> other
>>> >>>> purposes? Powers of ten are convenient for humans, but powers of two
>>> >> more
>>> >>>> efficient. For some applications, the efficiency of bit operations
>>> over
>>> >>>> divmod is more useful than an exact representation of integral
>>> >> nanoseconds.
>>> >>>>
>>> >>>> std::chrono takes this approach. I'll also humbly point you at my
>>> own
>>> >>>> date/time library, https://github.com/alexhsamuel/cron (incomplete
>>> but
>>> >>>> basically working), which may provide ideas or useful code. It was
>>> >> intended
>>> >>>> for precisely this sort of application.
>>> >>>>
>>> >>>> Regards,
>>> >>>> Alex
>>> >>>>
>>> >>>>
>>> >>>> On Thu, Jul 14, 2016 at 10:27 AM Uwe Korn <uw...@xhochy.com> wrote:
>>> >>>>
>>> >>>>> I agree with that having a Decimal type for timestamps is a nice
>>> >>>>> definition. Haying your time encoded as seconds or nanoseconds
>>> should
>>> >> be
>>> >>>>> the same as having a scale of the respective amount. But I would
>>> rather
>>> >>>>> avoid having a separate decimal physical type. Therefore I'd
>>> prefer the
>>> >>>>> parquet approach where decimal is only a logical type and backed by
>>> >>>>> either a bytearray, int32 or int64.
>>> >>>>>
>>> >>>>> Thus a more general timestamp could look like:
>>> >>>>>
>>> >>>>> * Decimals are logical types, physical types are the same as
>>> defined in
>>> >>>>> Parquet [1]
>>> >>>>> * Base unit for timestamps is seconds, you can get milliseconds and
>>> >>>>> nanoseconds by using a different scale. .(Note that seconds and so
>>> on
>>> >>>>> are all powers of ten, thus matching the specification of decimal
>>> scale
>>> >>>>> really good).
>>> >>>>> * Timestamp is just another logical type that is referring to
>>> Decimal
>>> >>>>> (and optionally may have a timezone) and signalling that we have a
>>> Time
>>> >>>>> and not just a "simple" decimal.
>>> >>>>> * For a first iteration, I would assume no timezone or UTC but not
>>> >>>>> include a metadata field. Once we're sure the implementation
>>> works, we
>>> >>>>> can add metadata about it.
>>> >>>>>
>>> >>>>> Timedeltas could be addressed in a similar way, just without the
>>> need
>>> >>>>> for a timezone.
>>> >>>>>
>>> >>>>> For my usages, I don't have the use-case for a larger than int64
>>> >>>>> timestamp and would like to have it exactly as such in my
>>> computation,
>>> >>>>> thus my preference for the Parquet way.
>>> >>>>>
>>> >>>>> Uwe
>>> >>>>>
>>> >>>>> [1]
>>> >>>>>
>>> >>>>> https://github.com/apache/parquet-format/blob/master/
>>> >> LogicalTypes.md#decimal
>>> >>>>>
>>> >>>>> On 13.07.16 03:06, Julian Hyde wrote:
>>> >>>>>> I'm talking about a fixed decimal type, not floating decimal.
>>> (Oracle
>>> >>>>>> numbers are floating decimal. They have a few nice properties, but
>>> >>>>>> they are variable width and can get quite large. I've seen one or
>>> two
>>> >>>>>> systems that started with binary flo
>>> >>>>
>>> >>>>
>>> >>>>> * Base unit for timestamps is seconds, you can get milliseconds and
>>> >>>>
>>> >>>> nanoseconds by using a different scale. .(Note that seconds and so
>>> on
>>> >>>>
>>> >>>> are all powers of ten, thus matching the specification of decimal
>>> scale
>>> >>>>
>>> >>>> really good).
>>> >>>>
>>> >>>> * Timestamp is just another logical type that is referring to
>>> Decimal
>>> >>>>
>>> >>>> (and optionally may have a timezone) and signalling that we have a
>>> Tim
>>> >>>>
>>> >>>> ating point numbers, which are
>>> >>>>>> much worse for business computing, and then change to Java
>>> >> BigDecimal,
>>> >>>>>> which gives the right answer but are horribly inefficient.)
>>> >>>>>>
>>> >>>>>> A fixed decimal type has virtually zero computational overhead. It
>>> >>>>>> just has a piece of metadata saying something like "every value in
>>> >>>>>> this field is multiplied by 1 million" and leaves it to the client
>>> >>>>>> program to do that multiplying.
>>> >>>>>>
>>> >>>>>> My advice is to create a good fixed decimal type and lean on it
>>> >> heavily.
>>> >>>>>>
>>> >>>>>> Julian
>>> >>>>>>
>>> >>>>>
>>> >>>>>
>>> >>
>>>
>>>
>>
>>
>> --
>> Julien
>>
>
>
>
> --
> Julien
>



-- 
Julien

Reply via email to