I created a JIRA for the Timestamp type if you want to comment in it: https://issues.apache.org/jira/browse/ARROW-315
On Mon, Oct 3, 2016 at 3:16 PM, Julien Le Dem <jul...@dremio.com> wrote: > consistency with Parquet a + > Parquet supports timestamp millis and micros (no nanos) > https://github.com/apache/parquet-format/blob/master/ > LogicalTypes.md#datetime-types > > currently Arrow timestamps have a timezone field. > https://github.com/apache/arrow/blob/master/format/Message.fbs#L67 > Wes: regarding your suggestion do we want to change timestamp as follows? > - remove "timestamp" field and say it's UTC > - add unit field (MICROS | MILLIS) > > > > On Fri, Sep 30, 2016 at 12:20 PM, Donald Foss <donald.f...@gmail.com> > wrote: > >> +1 for nano or milli, or something else? >> >> TL;DR; >> >> epochMilli++ >> >> — >> >> Wes, the hierarchy is eminently reasonable, so +1 from me for that. >> Regarding your aside, I am also a fan of the >> http://speleotrove.com/decimal/decarith.html < >> http://speleotrove.com/decimal/decarith.html> specification, though I >> must admit I am biased simply because it addresses the Rexx Lost Digits >> condition. >> >> The most commonly used timestamps I see are stored as epoch milliseconds, >> or epochMillis. It may not be canonical, however there are many billions >> of devices and software applications utilizing it. >> >> To support extremely fine grained DateTime representations, particularly >> in common scientific applications, I’m for _epochNano_, with logical >> casting to work with existing datasets that are in epochMilli instead. We >> can deal with the rollover in 300k years. >> >> While I personally would prefer assigning 0 as 2000-01-01T00:00:00.00Z, I >> doubt it will ever happen. No, I’m not a millennial. >> >> My only concern is for use of 64-bit logical DateTime at the small >> Physics level. For that use case, UT2 is more appropriate; measurements >> are frequently in fractions of nanoseconds. Perhaps there could be a way >> to logically cast a signed int96, which is supported by Parquet. >> >> Timestamp [logical type] >> extends FixedDecimal [logical type] (int64) >> extends FixedWidth [physical type] byteArray[8] >> >> Timestamp96 [logical type] >> extends FixedDecimal [logical type] (int96) >> extends FixedWidth [physical type] byteArray[12] >> >> — >> >> Although inappurtenant to this specific discussion, I would like to see a >> standardized DateTime specification that uses a signed int64 as the decimal >> epochSecond and an unsigned int96 as the fractional representation of a >> second. >> >> TimestampHiggs [logical type] >> extends FixedDecimal [logical type] [(int64), (uint96)] :: join()ing of 2 >> columns, the fixed decimal epochSecond and the fractional second as >> (n/2^96). >> extends FixedWidth [physical type] byteArray[8], byteArray[12] >> >> —Donald >> >> > On Sep 29, 2016, at 7:07 PM, Jacques Nadeau <jacq...@apache.org> wrote: >> > >> > +1 >> > >> > On Thu, Sep 29, 2016 at 3:19 PM, Wes McKinney <wesmck...@gmail.com> >> wrote: >> > >> >> hello, >> >> >> >> For the current iteration of Arrow, can we agree to support int64 UNIX >> >> timestamps with a particular resolution (second through nanosecond), >> >> as these are reasonably common representations? We can look to expand >> >> later if it is needed. >> >> >> >> Thanks >> >> Wes >> >> >> >> On Mon, Aug 15, 2016 at 4:12 AM, Wes McKinney <wesmck...@gmail.com> >> wrote: >> >>> Bumping this discussion. As part of finalizing a v1 Arrow spec (for >> >>> purposes of moving data between systems, at minimum) we should propose >> >>> timestamp metadata and physical memory representation that maximizes >> >>> interoperability with other systems. It seems like a fixed decimal >> >>> would meet this requirement as UNIX-like timestamps at some resolution >> >>> could pass unmodified with appropriate metadata. >> >>> >> >>> We will also need decimal types in Arrow (at least to accommodate >> >>> common database representations and file formats like Parquet), so >> >>> this seems like a reasonable potential hierarchy of types: >> >>> >> >>> Timestamp [logical type] >> >>> extends FixedDecimal [logical type] >> >>> extends FixedWidth [physical type] >> >>> >> >>> I did a bit of internet searching but did not find a canonical >> >>> reference or implementation of fixed decimals; that would be helpful. >> >>> >> >>> As an aside: for floating decimal numbers for numerical data we could >> >>> utilize an implementation like http://www.bytereef.org/mpdecimal/ >> >>> which implements the spec described at >> >>> http://speleotrove.com/decimal/decarith.html >> >>> >> >>> Thanks >> >>> Wes >> >>> >> >>> On Thu, Jul 14, 2016 at 8:18 AM, Alex Samuel <a...@alexsamuel.net> >> >> wrote: >> >>>> Hi all, >> >>>> >> >>>> May I suggest that instead of fixed-point decimals, you consider a >> more >> >>>> general fixed-denominator rational representation, for times and >> other >> >>>> purposes? Powers of ten are convenient for humans, but powers of two >> >> more >> >>>> efficient. For some applications, the efficiency of bit operations >> over >> >>>> divmod is more useful than an exact representation of integral >> >> nanoseconds. >> >>>> >> >>>> std::chrono takes this approach. I'll also humbly point you at my own >> >>>> date/time library, https://github.com/alexhsamuel/cron (incomplete >> but >> >>>> basically working), which may provide ideas or useful code. It was >> >> intended >> >>>> for precisely this sort of application. >> >>>> >> >>>> Regards, >> >>>> Alex >> >>>> >> >>>> >> >>>> On Thu, Jul 14, 2016 at 10:27 AM Uwe Korn <uw...@xhochy.com> wrote: >> >>>> >> >>>>> I agree with that having a Decimal type for timestamps is a nice >> >>>>> definition. Haying your time encoded as seconds or nanoseconds >> should >> >> be >> >>>>> the same as having a scale of the respective amount. But I would >> rather >> >>>>> avoid having a separate decimal physical type. Therefore I'd prefer >> the >> >>>>> parquet approach where decimal is only a logical type and backed by >> >>>>> either a bytearray, int32 or int64. >> >>>>> >> >>>>> Thus a more general timestamp could look like: >> >>>>> >> >>>>> * Decimals are logical types, physical types are the same as >> defined in >> >>>>> Parquet [1] >> >>>>> * Base unit for timestamps is seconds, you can get milliseconds and >> >>>>> nanoseconds by using a different scale. .(Note that seconds and so >> on >> >>>>> are all powers of ten, thus matching the specification of decimal >> scale >> >>>>> really good). >> >>>>> * Timestamp is just another logical type that is referring to >> Decimal >> >>>>> (and optionally may have a timezone) and signalling that we have a >> Time >> >>>>> and not just a "simple" decimal. >> >>>>> * For a first iteration, I would assume no timezone or UTC but not >> >>>>> include a metadata field. Once we're sure the implementation works, >> we >> >>>>> can add metadata about it. >> >>>>> >> >>>>> Timedeltas could be addressed in a similar way, just without the >> need >> >>>>> for a timezone. >> >>>>> >> >>>>> For my usages, I don't have the use-case for a larger than int64 >> >>>>> timestamp and would like to have it exactly as such in my >> computation, >> >>>>> thus my preference for the Parquet way. >> >>>>> >> >>>>> Uwe >> >>>>> >> >>>>> [1] >> >>>>> >> >>>>> https://github.com/apache/parquet-format/blob/master/ >> >> LogicalTypes.md#decimal >> >>>>> >> >>>>> On 13.07.16 03:06, Julian Hyde wrote: >> >>>>>> I'm talking about a fixed decimal type, not floating decimal. >> (Oracle >> >>>>>> numbers are floating decimal. They have a few nice properties, but >> >>>>>> they are variable width and can get quite large. I've seen one or >> two >> >>>>>> systems that started with binary flo >> >>>> >> >>>> >> >>>>> * Base unit for timestamps is seconds, you can get milliseconds and >> >>>> >> >>>> nanoseconds by using a different scale. .(Note that seconds and so on >> >>>> >> >>>> are all powers of ten, thus matching the specification of decimal >> scale >> >>>> >> >>>> really good). >> >>>> >> >>>> * Timestamp is just another logical type that is referring to Decimal >> >>>> >> >>>> (and optionally may have a timezone) and signalling that we have a >> Tim >> >>>> >> >>>> ating point numbers, which are >> >>>>>> much worse for business computing, and then change to Java >> >> BigDecimal, >> >>>>>> which gives the right answer but are horribly inefficient.) >> >>>>>> >> >>>>>> A fixed decimal type has virtually zero computational overhead. It >> >>>>>> just has a piece of metadata saying something like "every value in >> >>>>>> this field is multiplied by 1 million" and leaves it to the client >> >>>>>> program to do that multiplying. >> >>>>>> >> >>>>>> My advice is to create a good fixed decimal type and lean on it >> >> heavily. >> >>>>>> >> >>>>>> Julian >> >>>>>> >> >>>>> >> >>>>> >> >> >> >> > > > -- > Julien > -- Julien