[Numpy-discussion] NumPy date/time types and the resolution concept
Hi, Before giving more thought to the new proposal of the date/time types for NumPy based in the concept of 'resolution', I'd like to gather more feedback on your opinions about this. After pondering about the opinions about the first proposal, the idea we are incubating is to complement the ``datetime64`` with a 'resolution' metainfo. The ``datetime64`` will still be based on a int64 type, but the meaning of the 'ticks' would depend on a 'resolution' property. This is best seen with an example: In [21]: numpy.arange(3, dtype=numpy.dtype('datetime64', 'sec')) Out [21]: [1970-01-01T00:00:00, 1970-01-01T00:00:01, 1970-01-01T00:00:02] In [22]: numpy.arange(3, dtype=numpy.dtype('datetime64', 'hour')) Out [22]: [1970-01-01T00, 1970-01-01T01, 1970-01-01T02] i.e. the 'resolution' gives the actual meaning to the 'int64' counter. The advantage of this abstraction is that the user can easily choose the scale of resolution that better fits his need. I'm thinking in providing the next resolutions: ["femtosec", "picosec", "nanosec", "microsec", "millisec", "sec", "min", "hour", "month", "year"] Also, together with the absolute ``datetime64`` one can have a relative counterpart, say, ``timedelta64`` that also supports the notion of 'resolution'. Between both one would cover the needs for most uses, while providing the user with a lot of flexibility, IMO. We very much prefer this new approach than the stated in our first proposal. Now, it comes the tricky part: how to integrate the notion of 'resolution' with the 'dtype' data type factory of NumPy? Well, we have thought a couple of possibilities. 1) Using the NumPy 'dtype' factory: nanoabs = numpy.dtype('datetime64', resolution="nanosec") nanorel = numpy.dtype('timedelta64', resolution="nanosec") 2) Extending the string notation by using the '[]' square brackets: nanoabs = numpy.dtype('datetime64[nanosec]') # long notation nanoabs = numpy.dtype('T[nanosec]') # short notation nanorel = numpy.dtype('timedelta64[nanosec]') # long notation nanorel = numpy.dtype('t[nanosec]') # short notation With these building blocks, one may obtain more complex dtype structures easily. Now, the question is: would that proposal enter in conflict with the spirit of the current 'dtype' factory? And another important one, would that complicate the implementation too much? If the answer to the both previous questions is 'no', then we will study this more and provide another proposal based on this. BTW, I suppose that the best candidate to answer these would be Travis O., but if anybody feels brave enough ;-) please go ahead and give your advice. Cheers, -- Francesc Alted ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NumPy date/time types and the resolution concept
On Monday 14 July 2008 09:07:47 Francesc Alted wrote: > The advantage of this abstraction is that the user can easily choose the > scale of resolution that better fits his need. I'm thinking in > providing the next resolutions: > > ["femtosec", "picosec", "nanosec", "microsec", "millisec", "sec", "min", > "hour", "month", "year"] In TimeSeries, we don't have anything less than a second, but we have 'daily', 'business daily', 'weekly' and 'quarterly' resolutions. A very useful point that Matt Knox had coded is the possibility to specify starting points for switching from one resolution to another. For example, you can have a series with a 'ANN_MAR' frequency, that corresponds to 1 point a year, the year starting in April. When switching back to a monthly resolution, the points from January to March of the first year will be masked. Another useful point would be allow the user to define his/her own resolution (every 15min, every 12h...). Right now it's a bit clunky in TimeSeries, we have to use the lowest resolution of the series (min, hour) and leave a lot of blanks (TimeSeries don't have to be regularly spaced, but it helps...) > Now, it comes the tricky part: how to integrate the notion > of 'resolution' with the 'dtype' data type factory of NumPy? In TimeSeries, the frequency is stored as an integer. For example, a daily frequency is stored as 6000, an annual frequency as 1000, a 'ANN_MAR' frequency as 1003... ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NumPy date/time types and the resolution concept
On Mon, 14 Jul 2008, Francesc Alted apparently wrote: > Before giving more thought to the new proposal of the > date/time types for NumPy based in the concept of > 'resolution', I'd like to gather more feedback on your > opinions about this. It might be a good idea to run the proposal(s) past Marc-Andre Lemburg mal (at) egenix (dot) com Cheers, Alan Isaac ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NumPy date/time types and the resolution concept
2008/7/14 Francesc Alted <[EMAIL PROTECTED]>: > After pondering about the opinions about the first proposal, the idea we > are incubating is to complement the ``datetime64`` with a 'resolution' > metainfo. The ``datetime64`` will still be based on a int64 type, but > the meaning of the 'ticks' would depend on a 'resolution' property. This is an interesting idea. To be useful, though, you would also need a flexible "offset" defining the zero of time. After all, the reason not to just always use (say) femtosecond accuracy is that 2**64 femtoseconds is only about five hours. So if you're going to use femtosecond steps, you really want to choose your start point carefully. (It's also worth noting that there is little need for more time accuracy than atomic clocks can provide, since anyone looking for more than that is going to be doing some tricky metrology anyway.) One might take guidance from the FITS format, which represents (arrays of) quantities as (usually) fixed-point numbers, but has a global "scale" and "offset" parameter for each array. This allows one to accurately represent many common arrays with relatively few bits. The FITS libraries transparently convert these quantities. Of course, this isn't so convenient if you don't have basic machine datatypes with enough precision to handle all the quantities of interest. Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NumPy date/time types and the resolution concept
A Monday 14 July 2008, Pierre GM escrigué: > On Monday 14 July 2008 09:07:47 Francesc Alted wrote: > > The advantage of this abstraction is that the user can easily > > choose the scale of resolution that better fits his need. I'm > > thinking in providing the next resolutions: > > > > ["femtosec", "picosec", "nanosec", "microsec", "millisec", "sec", > > "min", "hour", "month", "year"] > > In TimeSeries, we don't have anything less than a second, but we > have 'daily', 'business daily', 'weekly' and 'quarterly' resolutions. Yes, I forgot the "day" resolution. I suppose that "weekly" and "quaterly" could be added too. However, if we adopt a new way to specify the resolution (see later), these can be stated as '7d' and '3m' respectively. Mmh, not sure about "business daily"; this maybe is useful in time series, but I don't find a reasonable meaning for it as a 'time resolution' (which is a different concept from 'time frequency'). So I'd let it out. > A very useful point that Matt Knox had coded is the possibility to > specify starting points for switching from one resolution to another. > For example, you can have a series with a 'ANN_MAR' frequency, that > corresponds to 1 point a year, the year starting in April. When > switching back to a monthly resolution, the points from January to > March of the first year will be masked. Ok. Ann was also suggesting that the origin of time would be configurable, but then, you are talking about *masking* values. Mmm, I don't think we should try to incorporate masking capabilities in the NumPy date/time types. At any rate, I've not thought about the possibility of having an origin defined by the user, but if we could add the 'resolution' metainfo, I don't see why we couldn't do the same with the 'origin' metainfo too. > Another useful point would be allow the user to define his/her own > resolution (every 15min, every 12h...). Right now it's a bit clunky > in TimeSeries, we have to use the lowest resolution of the series > (min, hour) and leave a lot of blanks (TimeSeries don't have to be > regularly spaced, but it helps...) Ok. I see the use case for this, but for implementation purposes, we should come with a more complete way to specify the resolution than I realized before. Hmm, what about the next: [N]timeunit where ``timeunit`` can take the values in: ['y', 'm', 'd', 'h', 'm', 's', 'ms', 'us', 'ns', 'fs'] so, for example, '14d' means a resolution of 14 days, or '10ms' means a resolution of 1 hundreth of second. Sounds good to me. What other people think? > > > Now, it comes the tricky part: how to integrate the notion > > of 'resolution' with the 'dtype' data type factory of NumPy? > > In TimeSeries, the frequency is stored as an integer. For example, a > daily frequency is stored as 6000, an annual frequency as 1000, a > 'ANN_MAR' frequency as 1003... Well, I initially planned to keep the resolution as an enumerated (int8 would be enough), but if the new way to specify resolutions goes ahead, I'm afraid that we may need a fill int64 to save this. But apart from that, this should be not a problem (in general, the metainfo is a very tiny part of the space taken by a dataset). Cheers, -- Francesc Alted ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NumPy date/time types and the resolution concept
A Monday 14 July 2008, Alan G Isaac escrigué: > On Mon, 14 Jul 2008, Francesc Alted apparently wrote: > > Before giving more thought to the new proposal of the > > date/time types for NumPy based in the concept of > > 'resolution', I'd like to gather more feedback on your > > opinions about this. > > It might be a good idea to run the proposal(s) past > Marc-Andre Lemburg mal (at) egenix (dot) com Sure. And maybe also to Fred Drake, the original autor of the ``datetime`` module. However, I'd prefer to send them something in a more advanced state of refinement than it is now. Thanks for the suggestion, -- Francesc Alted ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NumPy date/time types and the resolution concept
On Monday 14 July 2008 12:50:21 Francesc Alted wrote: > > A very useful point that Matt Knox had coded is the possibility to > > specify starting points for switching from one resolution to another. > > For example, you can have a series with a 'ANN_MAR' frequency, that > > corresponds to 1 point a year, the year starting in April. When > > switching back to a monthly resolution, the points from January to > > March of the first year will be masked. > > Ok. Ann was also suggesting that the origin of time would be > configurable, but then, you are talking about *masking* values. Mmm, I > don't think we should try to incorporate masking capabilities in the > NumPy date/time types. Francesc, In scikits.timeseries, we have 2 different objects: * DateArray, which is basically a ndarray of integers with a given 'frequency' attribute. * TimeSeries, which is basically the combination of a MaskedArray (the data part) and a DateArray (which keeps track of the date corresponding to each data point. TimeSeries object have the resolution/origin of the companion DateArray, and when they're converted from one resolution to another, some masking may occur. My understanding is that you intend to define an object similar to DateArray. You want to define a new dtype (datetime64 or other), we used yet another class instead, Date. A dtype would be easier to manipulate, but as neither Matt nor I were particularly experienced with that at the time, we followed the simpler approach of a class... > [N]timeunit > > where ``timeunit`` can take the values in: > > ['y', 'm', 'd', 'h', 'm', 's', 'ms', 'us', 'ns', 'fs'] > > so, for example, '14d' means a resolution of 14 days, or '10ms' means a > resolution of 1 hundreth of second. Sounds good to me. What other > people think? Sounds pretty cool and intuitive to use. However, writing the conversion rules from one to another will be a lot of fun. Take weekly, for example: that's a period of 7 days, but when does it start ? On a monday ? Then, 12/31/2007 was the start of the first week of 2008... OK, we can leave that problem for the moment... ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NumPy date/time types and the resolution concept
A Monday 14 July 2008, Anne Archibald escrigué: > 2008/7/14 Francesc Alted <[EMAIL PROTECTED]>: > > After pondering about the opinions about the first proposal, the > > idea we are incubating is to complement the ``datetime64`` with a > > 'resolution' metainfo. The ``datetime64`` will still be based on a > > int64 type, but the meaning of the 'ticks' would depend on a > > 'resolution' property. > > This is an interesting idea. To be useful, though, you would also > need a flexible "offset" defining the zero of time. After all, the > reason not to just always use (say) femtosecond accuracy is that > 2**64 femtoseconds is only about five hours. So if you're going to > use femtosecond steps, you really want to choose your start point > carefully. (It's also worth noting that there is little need for more > time accuracy than atomic clocks can provide, since anyone looking > for more than that is going to be doing some tricky metrology > anyway.) That's a good point indeed. Well, to start with, I suppose that picosecond resolution is more than enough for nowadays precision standards (even when using atomic clocks). However, provided that atomic clocks are always improving its precision [1], having a femtosecond resolution is not going to bother people, I think. [1] http://en.wikipedia.org/wiki/Image:Clock_accurcy.jpg But the time origin is certainly an issue, yes. See later. > One might take guidance from the FITS format, which represents > (arrays of) quantities as (usually) fixed-point numbers, but has a > global "scale" and "offset" parameter for each array. This allows one > to accurately represent many common arrays with relatively few bits. > The FITS libraries transparently convert these quantities. Of course, > this isn't so convenient if you don't have basic machine datatypes > with enough precision to handle all the quantities of interest. That's pretty interesting in that the "scale" is certainly something similar to the "resolution" concept that we want to introduce. And definitely, "offset" would be similar to "origin". So yes, we will try to introduce both concepts. However, one thing that we would try to avoid is to use fixed-point arithmetic (we plan to use integer arithmetic only). The rational is that fixed-point arithmetic is computationally more complex (it has to implemented in software, while integer arithmetic is implemented in hardware) and that would slow down things too much. Thanks! -- Francesc Alted ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NumPy date/time types and the resolution concept
A Monday 14 July 2008, Pierre GM escrigué: > On Monday 14 July 2008 12:50:21 Francesc Alted wrote: > > > A very useful point that Matt Knox had coded is the possibility > > > to specify starting points for switching from one resolution to > > > another. For example, you can have a series with a 'ANN_MAR' > > > frequency, that corresponds to 1 point a year, the year starting > > > in April. When switching back to a monthly resolution, the points > > > from January to March of the first year will be masked. > > > > Ok. Ann was also suggesting that the origin of time would be > > configurable, but then, you are talking about *masking* values. > > Mmm, I don't think we should try to incorporate masking > > capabilities in the NumPy date/time types. > > Francesc, > In scikits.timeseries, we have 2 different objects: > * DateArray, which is basically a ndarray of integers with a given > 'frequency' attribute. > * TimeSeries, which is basically the combination of a MaskedArray > (the data part) and a DateArray (which keeps track of the date > corresponding to each data point. TimeSeries object have the > resolution/origin of the companion DateArray, and when they're > converted from one resolution to another, some masking may occur. > > My understanding is that you intend to define an object similar to > DateArray. You want to define a new dtype (datetime64 or other), we > used yet another class instead, Date. A dtype would be easier to > manipulate, but as neither Matt nor I were particularly experienced > with that at the time, we followed the simpler approach of a class... Well, what we are after is precisely this: a new dtype type. After integrating it in NumPy, I suppose that your DateArray would be similar than a NumPy array with a dtype ``datetime64`` (bar the conceptual differences between your 'frequency' behind DateArray and the 'resolution' behind the datetime64 dtype). > > > [N]timeunit > > > > where ``timeunit`` can take the values in: > > > > ['y', 'm', 'd', 'h', 'm', 's', 'ms', 'us', 'ns', 'fs'] > > > > so, for example, '14d' means a resolution of 14 days, or '10ms' > > means a resolution of 1 hundreth of second. Sounds good to me. > > What other people think? > > Sounds pretty cool and intuitive to use. However, writing the > conversion rules from one to another will be a lot of fun. Take > weekly, for example: that's a period of 7 days, but when does it > start ? On a monday ? Then, 12/31/2007 was the start of the first > week of 2008... OK, we can leave that problem for the moment... It would start when the origin tells that it should start. It is important to note that our proposal will not force a '7d' (seven days) 'tick' to start on monday, or a '1m' (one month) to start the 1st day of a calendar month, but rather where the user decides to set its origin. Cheers, -- Francesc Alted ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NumPy date/time types and the resolution concept
On Monday 14 July 2008 14:17:18 Francesc Alted wrote: > Well, what we are after is precisely this: a new dtype type. After > integrating it in NumPy, I suppose that your DateArray would be similar > than a NumPy array with a dtype ``datetime64`` (bar the conceptual > differences between your 'frequency' behind DateArray and > the 'resolution' behind the datetime64 dtype). Well, you're losing me on this one: could you explain the difference between the two concepts ? It might only be a problem of vocabulary... > It would start when the origin tells that it should start. It is > important to note that our proposal will not force a '7d' (seven > days) 'tick' to start on monday, or a '1m' (one month) to start the 1st > day of a calendar month, but rather where the user decides to set its > origin. OK, so we need 2 flags, one for the resolution, one for the origin. Because there won't be that many resolution possible, an int8 should be sufficient. What do you have in mind for the origin ? When using a resolution coarser than 1d (7d, 1m, 3m, 1a), an origin in day is OK. What about less than a day ? ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NumPy date/time types and the resolution concept
A Monday 14 July 2008, Pierre GM escrigué: > On Monday 14 July 2008 14:17:18 Francesc Alted wrote: > > Well, what we are after is precisely this: a new dtype type. After > > integrating it in NumPy, I suppose that your DateArray would be > > similar than a NumPy array with a dtype ``datetime64`` (bar the > > conceptual differences between your 'frequency' behind DateArray > > and > > the 'resolution' behind the datetime64 dtype). > > Well, you're losing me on this one: could you explain the difference > between the two concepts ? It might only be a problem of > vocabulary... Maybe is only that. But by using the term 'frequency' I tend to think that you are expecting to have one entry (observation) in your array for each time 'tick' since time start. OTOH, the term 'resolution' doesn't have this implication, and only states the precision of the timestamp. I don't know whether my impression is true or not, but after reading about your TimeSeries package, I'm still thinking that this expectation of one observation per 'tick' was what driven you to choose the 'frequency' name. > > It would start when the origin tells that it should start. It is > > important to note that our proposal will not force a '7d' (seven > > days) 'tick' to start on monday, or a '1m' (one month) to start the > > 1st day of a calendar month, but rather where the user decides to > > set its origin. > > OK, so we need 2 flags, one for the resolution, one for the origin. > Because there won't be that many resolution possible, an int8 should > be sufficient. What do you have in mind for the origin ? When using a > resolution coarser than 1d (7d, 1m, 3m, 1a), an origin in day is OK. > What about less than a day ? Well, after reading the mails from Chris and Anne, I think the best is that the origin would be kept as an int64 with a resolution of microseconds (for compatibility with the ``datetime`` module, as I've said before). Cheers, -- Francesc Alted ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NumPy date/time types and the resolution concept
On Tuesday 15 July 2008 07:30:09 Francesc Alted wrote: > Maybe is only that. But by using the term 'frequency' I tend to think > that you are expecting to have one entry (observation) in your array > for each time 'tick' since time start. OTOH, the term 'resolution' > doesn't have this implication, and only states the precision of the > timestamp. OK, now I get it. > I don't know whether my impression is true or not, but after reading > about your TimeSeries package, I'm still thinking that this expectation > of one observation per 'tick' was what driven you to choose > the 'frequency' name. Well, we do require a "one point per tick" for some operations, such as conversion from one frequency to another, but only for TimeSeries. A Date Array doesn't have to be regularly spaced. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NumPy date/time types and the resolution concept
2008/7/15 Francesc Alted <[EMAIL PROTECTED]>: > Maybe is only that. But by using the term 'frequency' I tend to think > that you are expecting to have one entry (observation) in your array > for each time 'tick' since time start. OTOH, the term 'resolution' > doesn't have this implication, and only states the precision of the > timestamp. > Well, after reading the mails from Chris and Anne, I think the best is > that the origin would be kept as an int64 with a resolution of > microseconds (for compatibility with the ``datetime`` module, as I've > said before). A couple of details worth pointing out: we don't need a zillion resolutions. One that's as good as the world time standards, and one that spans an adequate length of time should cover it. After all, the only reason for not using the highest available resolution is if you want to cover a larger range of times. So there is no real need for microseconds and milliseconds and seconds and days and weeks and... There is also no need for the origin to be kept with a resolution as high as microseconds; seconds would do just fine, since if necessary it can be interpreted as "exactly 7000 seconds after the epoch" even if you are using femtoseconds elsewhere. Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NumPy date/time types and the resolution concept
A Tuesday 15 July 2008, Anne Archibald escrigué: > 2008/7/15 Francesc Alted <[EMAIL PROTECTED]>: > > Maybe is only that. But by using the term 'frequency' I tend to > > think that you are expecting to have one entry (observation) in > > your array for each time 'tick' since time start. OTOH, the term > > 'resolution' doesn't have this implication, and only states the > > precision of the timestamp. > > > > Well, after reading the mails from Chris and Anne, I think the best > > is that the origin would be kept as an int64 with a resolution of > > microseconds (for compatibility with the ``datetime`` module, as > > I've said before). > > A couple of details worth pointing out: we don't need a zillion > resolutions. One that's as good as the world time standards, and one > that spans an adequate length of time should cover it. After all, the > only reason for not using the highest available resolution is if you > want to cover a larger range of times. So there is no real need for > microseconds and milliseconds and seconds and days and weeks and... Maybe you are right, but by providing many resolutions we are trying to cope with the needs of people that are using them a lot. In particular, we are willing that the authors of the timseries scikit can find on these new dtype a fair replacement of their Date class (our proposal will be not so featured, but...). > There is also no need for the origin to be kept with a resolution as > high as microseconds; seconds would do just fine, since if necessary > it can be interpreted as "exactly 7000 seconds after the epoch" even > if you are using femtoseconds elsewhere. Good point. However, we finally managed to not include the ``origin`` metadata in our new proposal. Have a look at the second proposal that I'll be posting soon for details. Cheers, -- Francesc Alted ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NumPy date/time types and the resolution concept
A Tuesday 15 July 2008, Pierre GM escrigué: > On Tuesday 15 July 2008 07:30:09 Francesc Alted wrote: > > Maybe is only that. But by using the term 'frequency' I tend to > > think that you are expecting to have one entry (observation) in > > your array for each time 'tick' since time start. OTOH, the term > > 'resolution' doesn't have this implication, and only states the > > precision of the timestamp. > > OK, now I get it. > > > I don't know whether my impression is true or not, but after > > reading about your TimeSeries package, I'm still thinking that this > > expectation of one observation per 'tick' was what driven you to > > choose the 'frequency' name. > > Well, we do require a "one point per tick" for some operations, such > as conversion from one frequency to another, but only for TimeSeries. > A Date Array doesn't have to be regularly spaced. Ok, I see. So, it is just the 'frequency' keyword that was misleading me. Thanks for the clarification. Cheers, -- Francesc Alted ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion