Here is a PR for the change in timestamp: https://github.com/apache/arrow/pull/156
We should also clarify Date: https://issues.apache.org/jira/browse/ARROW-316 On Mon, Oct 3, 2016 at 3:23 PM, Julien Le Dem <jul...@dremio.com> wrote: > I created a JIRA for the Timestamp type if you want to comment in it: > https://issues.apache.org/jira/browse/ARROW-315 > > On Mon, Oct 3, 2016 at 3:16 PM, Julien Le Dem <jul...@dremio.com> wrote: > >> consistency with Parquet a + >> Parquet supports timestamp millis and micros (no nanos) >> https://github.com/apache/parquet-format/blob/master/Logical >> Types.md#datetime-types >> >> currently Arrow timestamps have a timezone field. >> https://github.com/apache/arrow/blob/master/format/Message.fbs#L67 >> Wes: regarding your suggestion do we want to change timestamp as follows? >> - remove "timestamp" field and say it's UTC >> - add unit field (MICROS | MILLIS) >> >> >> >> On Fri, Sep 30, 2016 at 12:20 PM, Donald Foss <donald.f...@gmail.com> >> wrote: >> >>> +1 for nano or milli, or something else? >>> >>> TL;DR; >>> >>> epochMilli++ >>> >>> — >>> >>> Wes, the hierarchy is eminently reasonable, so +1 from me for that. >>> Regarding your aside, I am also a fan of the >>> http://speleotrove.com/decimal/decarith.html < >>> http://speleotrove.com/decimal/decarith.html> specification, though I >>> must admit I am biased simply because it addresses the Rexx Lost Digits >>> condition. >>> >>> The most commonly used timestamps I see are stored as epoch >>> milliseconds, or epochMillis. It may not be canonical, however there are >>> many billions of devices and software applications utilizing it. >>> >>> To support extremely fine grained DateTime representations, particularly >>> in common scientific applications, I’m for _epochNano_, with logical >>> casting to work with existing datasets that are in epochMilli instead. We >>> can deal with the rollover in 300k years. >>> >>> While I personally would prefer assigning 0 as 2000-01-01T00:00:00.00Z, >>> I doubt it will ever happen. No, I’m not a millennial. >>> >>> My only concern is for use of 64-bit logical DateTime at the small >>> Physics level. For that use case, UT2 is more appropriate; measurements >>> are frequently in fractions of nanoseconds. Perhaps there could be a way >>> to logically cast a signed int96, which is supported by Parquet. >>> >>> Timestamp [logical type] >>> extends FixedDecimal [logical type] (int64) >>> extends FixedWidth [physical type] byteArray[8] >>> >>> Timestamp96 [logical type] >>> extends FixedDecimal [logical type] (int96) >>> extends FixedWidth [physical type] byteArray[12] >>> >>> — >>> >>> Although inappurtenant to this specific discussion, I would like to see >>> a standardized DateTime specification that uses a signed int64 as the >>> decimal epochSecond and an unsigned int96 as the fractional representation >>> of a second. >>> >>> TimestampHiggs [logical type] >>> extends FixedDecimal [logical type] [(int64), (uint96)] :: join()ing of >>> 2 columns, the fixed decimal epochSecond and the fractional second as >>> (n/2^96). >>> extends FixedWidth [physical type] byteArray[8], byteArray[12] >>> >>> —Donald >>> >>> > On Sep 29, 2016, at 7:07 PM, Jacques Nadeau <jacq...@apache.org> >>> wrote: >>> > >>> > +1 >>> > >>> > On Thu, Sep 29, 2016 at 3:19 PM, Wes McKinney <wesmck...@gmail.com> >>> wrote: >>> > >>> >> hello, >>> >> >>> >> For the current iteration of Arrow, can we agree to support int64 UNIX >>> >> timestamps with a particular resolution (second through nanosecond), >>> >> as these are reasonably common representations? We can look to expand >>> >> later if it is needed. >>> >> >>> >> Thanks >>> >> Wes >>> >> >>> >> On Mon, Aug 15, 2016 at 4:12 AM, Wes McKinney <wesmck...@gmail.com> >>> wrote: >>> >>> Bumping this discussion. As part of finalizing a v1 Arrow spec (for >>> >>> purposes of moving data between systems, at minimum) we should >>> propose >>> >>> timestamp metadata and physical memory representation that maximizes >>> >>> interoperability with other systems. It seems like a fixed decimal >>> >>> would meet this requirement as UNIX-like timestamps at some >>> resolution >>> >>> could pass unmodified with appropriate metadata. >>> >>> >>> >>> We will also need decimal types in Arrow (at least to accommodate >>> >>> common database representations and file formats like Parquet), so >>> >>> this seems like a reasonable potential hierarchy of types: >>> >>> >>> >>> Timestamp [logical type] >>> >>> extends FixedDecimal [logical type] >>> >>> extends FixedWidth [physical type] >>> >>> >>> >>> I did a bit of internet searching but did not find a canonical >>> >>> reference or implementation of fixed decimals; that would be helpful. >>> >>> >>> >>> As an aside: for floating decimal numbers for numerical data we could >>> >>> utilize an implementation like http://www.bytereef.org/mpdecimal/ >>> >>> which implements the spec described at >>> >>> http://speleotrove.com/decimal/decarith.html >>> >>> >>> >>> Thanks >>> >>> Wes >>> >>> >>> >>> On Thu, Jul 14, 2016 at 8:18 AM, Alex Samuel <a...@alexsamuel.net> >>> >> wrote: >>> >>>> Hi all, >>> >>>> >>> >>>> May I suggest that instead of fixed-point decimals, you consider a >>> more >>> >>>> general fixed-denominator rational representation, for times and >>> other >>> >>>> purposes? Powers of ten are convenient for humans, but powers of two >>> >> more >>> >>>> efficient. For some applications, the efficiency of bit operations >>> over >>> >>>> divmod is more useful than an exact representation of integral >>> >> nanoseconds. >>> >>>> >>> >>>> std::chrono takes this approach. I'll also humbly point you at my >>> own >>> >>>> date/time library, https://github.com/alexhsamuel/cron (incomplete >>> but >>> >>>> basically working), which may provide ideas or useful code. It was >>> >> intended >>> >>>> for precisely this sort of application. >>> >>>> >>> >>>> Regards, >>> >>>> Alex >>> >>>> >>> >>>> >>> >>>> On Thu, Jul 14, 2016 at 10:27 AM Uwe Korn <uw...@xhochy.com> wrote: >>> >>>> >>> >>>>> I agree with that having a Decimal type for timestamps is a nice >>> >>>>> definition. Haying your time encoded as seconds or nanoseconds >>> should >>> >> be >>> >>>>> the same as having a scale of the respective amount. But I would >>> rather >>> >>>>> avoid having a separate decimal physical type. Therefore I'd >>> prefer the >>> >>>>> parquet approach where decimal is only a logical type and backed by >>> >>>>> either a bytearray, int32 or int64. >>> >>>>> >>> >>>>> Thus a more general timestamp could look like: >>> >>>>> >>> >>>>> * Decimals are logical types, physical types are the same as >>> defined in >>> >>>>> Parquet [1] >>> >>>>> * Base unit for timestamps is seconds, you can get milliseconds and >>> >>>>> nanoseconds by using a different scale. .(Note that seconds and so >>> on >>> >>>>> are all powers of ten, thus matching the specification of decimal >>> scale >>> >>>>> really good). >>> >>>>> * Timestamp is just another logical type that is referring to >>> Decimal >>> >>>>> (and optionally may have a timezone) and signalling that we have a >>> Time >>> >>>>> and not just a "simple" decimal. >>> >>>>> * For a first iteration, I would assume no timezone or UTC but not >>> >>>>> include a metadata field. Once we're sure the implementation >>> works, we >>> >>>>> can add metadata about it. >>> >>>>> >>> >>>>> Timedeltas could be addressed in a similar way, just without the >>> need >>> >>>>> for a timezone. >>> >>>>> >>> >>>>> For my usages, I don't have the use-case for a larger than int64 >>> >>>>> timestamp and would like to have it exactly as such in my >>> computation, >>> >>>>> thus my preference for the Parquet way. >>> >>>>> >>> >>>>> Uwe >>> >>>>> >>> >>>>> [1] >>> >>>>> >>> >>>>> https://github.com/apache/parquet-format/blob/master/ >>> >> LogicalTypes.md#decimal >>> >>>>> >>> >>>>> On 13.07.16 03:06, Julian Hyde wrote: >>> >>>>>> I'm talking about a fixed decimal type, not floating decimal. >>> (Oracle >>> >>>>>> numbers are floating decimal. They have a few nice properties, but >>> >>>>>> they are variable width and can get quite large. I've seen one or >>> two >>> >>>>>> systems that started with binary flo >>> >>>> >>> >>>> >>> >>>>> * Base unit for timestamps is seconds, you can get milliseconds and >>> >>>> >>> >>>> nanoseconds by using a different scale. .(Note that seconds and so >>> on >>> >>>> >>> >>>> are all powers of ten, thus matching the specification of decimal >>> scale >>> >>>> >>> >>>> really good). >>> >>>> >>> >>>> * Timestamp is just another logical type that is referring to >>> Decimal >>> >>>> >>> >>>> (and optionally may have a timezone) and signalling that we have a >>> Tim >>> >>>> >>> >>>> ating point numbers, which are >>> >>>>>> much worse for business computing, and then change to Java >>> >> BigDecimal, >>> >>>>>> which gives the right answer but are horribly inefficient.) >>> >>>>>> >>> >>>>>> A fixed decimal type has virtually zero computational overhead. It >>> >>>>>> just has a piece of metadata saying something like "every value in >>> >>>>>> this field is multiplied by 1 million" and leaves it to the client >>> >>>>>> program to do that multiplying. >>> >>>>>> >>> >>>>>> My advice is to create a good fixed decimal type and lean on it >>> >> heavily. >>> >>>>>> >>> >>>>>> Julian >>> >>>>>> >>> >>>>> >>> >>>>> >>> >> >>> >>> >> >> >> -- >> Julien >> > > > > -- > Julien > -- Julien