Hi Andrew, Great question. I'll just have a few small comments:
On 13 August 2010 at 16:26, Andrew Piskorski wrote: | How do folks here prefer to represent date and time stamps on market | tick data, and how have you dealt with time zones? Any lessons | learned? | | The tick data I have to work with comes with a string date and time | stamp, which typically is in the local exchange timezone but has no | explicit timezone information anywhere in the data. | | Now, often I don't care about the timezone anyway, the exchange local | time (or even just elapsed time since market open) is all I'll need. | However, I'll sometimes want to carefully line up timestamps across | different exchanges in different timezones. In those cases, I'd | prefer to completely ignore timezones until THEN when I actually need | them. | | So ideally, I'd like to pick a single date and time format now which | meets all of these (draft) criteria: | | 1. Fast and compact, for use with tick data. | | 2. Relatively simple low-level representation, to ease | interoperability with C and/or other non-R programming languages. | | 3. Able to cleanly and unambiguously represent date-time stamps both | with an unspecified (currently unknown) timezone, and with any valid | timezone. | | Thus, I'm very curious what others have used, and what criteria you | found to actually be important in practice... | | These are all the R date/time options I'm aware of: POSIXct, POSIXlt, | chron, and timeDate. Did I miss any important ones? as.Date is | useful of course, but only for dates, not times. Here's what I think | of each of those options so far: | | POSIXct's date and time representation is exactly that of Unix, a | count of seconds since 1970-01-01 in GMT. Nice and simple. However, It is better than Unix as it is fractional seconds with a microsecond granularity where as standard Unix has (iirc) unsigned long integers. > options("digits.secs"=6) # needed or sub-seconds not shown > Sys.time() [1] "2010-08-13 15:57:37.790672 CDT" > I use this type a lot for the same reasons you list: fast, light, interchangeable. But as you point out, no timezones. | that inherently means that there isn't really any such thing as an | unspecified timezone. SOME timezone must immediately be assumed in | order to get your data into the underlying seconds-since-GMT-epoch | format. | | timeDate appears to be a lot like POSIXct, but it has other features | and I'm not sure whether it's underlying representation is really the | same or not. I've seen some contradictory information out there on | what it does and what it's really good for; should I investigate | further? | | POSIXlt is effectively an R wrapper around the standard Unix libc tm | structure used by strptime(3). Thus, it is timezone agnostic, and can | easily represent a date and time with the timezone unspecified or with | any particular timezone. However, it seems unwieldy and annoying to | work with. But there are converters to and from POSIXct ! So it is not "instead of" but rather "along with" making POSIXct an even stronger candidate (provided you handle TZ). Also, difftime() works well on these types. | chron uses Julian days, presumably in GMT. By default days are | counted since the Unix epoch of 1970-01-01, but if you pass in | origin=0 you get standard Julian day numbers, where 1970-01-01 = | 2440588. Times are fractional days. However, chron claims to have no | timezone support at all. | | as.Date (class "Date"), like chron, uses Julian days, but its origin | is always 1970-01-01, you cannot change it. It deals solely with | dates, not times. Yes you can change the origin. See help(as.Date). Cheers, Dirk | Of the above, POSIXct was my obvious initial candidate, it's only | timezone handling that made me think twice. If I do use POSIXct, | AFAICT my options are either: | | 1. Put in the work upfront to make sure I have all my timezones | correct for all my data. I am leery of actually getting that truly | correct, particularly with daylight savings time differences, possible | arbitrary changes in data vendor practices over time, etc. | | 2. Or, pretend that all my timestamps are actually in GMT, save them | that way on disk, and convert later if/when I need to use the true | exchange-local timezone. | | 3. Something in between the two above approaches, pick a likely | timezone and save the data in that, regardless of how certain I am | that it's really 100% correct. However, that then suggests... | | 4. Build some sort of wrapper around POSIXct which lets me be clearer | about whether I know the correct timezone or not. | | And of course there are plenty of ways to build something from scratch | without directly involving POSIXct at all. (Perhaps even with some | sort of clever flag bits to optionally note timezones embedded within | an otherwise vanilla fractional Julian day or Unix epoch based value.) | | Finally, below is some R output to demonstrate the apparent date/time | representations used by various R functions and classes: | | ------------------------------------------------------------ | | # R's default Julian date origin is 1970-01-01, but if you pass in | # origin=0, you get standard Julian Day Numbers where 1970-01-01 is | # JDN 2440588. That also happens to agree exactly with Oracle 10g. | # | # Oddly, with origin=0 R seems to think that JDN 0 is 4713-11-24 BC | # rather than either the 4712-01-01 BC from Oracle or the 4713-01-01 | # BC given on Wikipedia. However this does not seem to make any | # practical difference. | # http://en.wikipedia.org/wiki/Julian_day | | # chron and its helper functions julian, month.day.year, etc.: | > unlist(month.day.year(0)) | month day year | 1 1 1970 | > unlist(month.day.year(0 ,origin=0)) | month day year | 11 24 -4713 | > chron:::julian.default(1,1,1970) | [1] 0 | > chron:::julian.default(1,1,1970 ,origin=0) | [1] 2440588 | > print(as.numeric(chron(rep("1970-01-01",2) ,c("00:00:00","00:01:00") ,format=c(dates="y-m-d" ,times="h:m:s"))) ,digits=17) | [1] 0.0000000000000000000 0.0006944444444444445 | > print(as.numeric(chron(rep("1970-01-01",2) ,c("00:00:00","00:01:00") ,format=c(dates="y-m-d" ,times="h:m:s") ,origin=0)) ,digits=17) | [1] 2440588.000000000 2440588.000694444 | > 0.0006944444444444445 *24*60 | [1] 1 | | | oracle> select to_char(to_date('4712-01-01 BC','YYYY-MM-DD BC'),'J') as julian from dual; | JULIAN | ------- | 0000001 | oracle> select to_char(to_date('1970-01-01 AD','YYYY-MM-DD BC'),'J') as julian from dual; | JULIAN | ------- | 2440588 | | oracle> select to_char(to_date('1','J') ,'BC YYYY-MM-DD HH24:MI:SS') as origin from dual; | ORIGIN | ---------------------- | BC 4712-01-01 00:00:00 | oracle> select to_char(to_date('2440588','J') ,'BC YYYY-MM-DD HH24:MI:SS') as epoch from dual; | EPOCH | ---------------------- | AD 1970-01-01 00:00:00 | | oracle> select to_char(to_date('4713-11-24 BC','YYYY-MM-DD BC'),'J') as julian from dual; | ERROR at line 1: | ORA-01841: (full) year must be between -4713 and +9999, and not be 0 | | | # as.Date: as.Date.character uses strptime, and origin=0 has no effect. | > as.numeric(as.Date(c("1969-12-31" ,"1970-01-01" ,"1970-01-02"))) | [1] -1 0 1 | > as.numeric(as.Date(c("1969-12-31" ,"1970-01-01" ,"1970-01-02") ,origin=0)) | [1] -1 0 1 | | | # as.POSIXct: Default is local timezone, which happens to be EST -0500: | > as.numeric(as.POSIXct(c("1970-01-01 00:00:00","1970-01-01 00:01:00") ,tz="GMT")) | [1] 0 60 | > as.numeric(as.POSIXct(c("1970-01-01 00:00:00","1970-01-01 00:01:00"))) | [1] 18000 18060 | > 18000 /60^2 | [1] 5 | | | posixlt.to.tz.df <- function(xx) { | rr <- as.data.frame(unclass(xx)) | rr[,"tzone"] <- NA | if (!is.na(match("tzone" ,names(attributes(xx))))) | rr[,"tzone"] <- attributes(xx)$tzone | rr | } | | # as.POSIXlt: Is technically capable of expressing "timezone unknown": | > posixlt.to.tz.df(as.POSIXlt(c("1970-01-01 00:00:00","1970-01-01 00:01:00") ,tz="GMT")) | sec min hour mday mon year wday yday isdst tzone | 1 0 0 0 1 0 70 4 0 0 GMT | 2 0 1 0 1 0 70 4 0 0 GMT | > posixlt.to.tz.df(as.POSIXlt(c("1970-01-01 00:00:00","1970-01-01 00:01:00") ,tz="EST")) | sec min hour mday mon year wday yday isdst tzone | 1 0 0 0 1 0 70 4 0 0 EST | 2 0 1 0 1 0 70 4 0 0 EST | > posixlt.to.tz.df(as.POSIXlt(c("1970-01-01 00:00:00","1970-01-01 00:01:00"))) | sec min hour mday mon year wday yday isdst tzone | 1 0 0 0 1 0 70 4 0 0 NA | 2 0 1 0 1 0 70 4 0 0 NA | > posixlt.to.tz.df(as.POSIXlt(c("1970-01-01 00:00:00","1970-01-01 00:01:00") ,tz="PST")) | sec min hour mday mon year wday yday isdst tzone | 1 0 0 0 1 0 70 4 0 0 PST | 2 0 1 0 1 0 70 4 0 0 PST | > posixlt.to.tz.df(as.POSIXlt(c("1970-01-01 00:00:00","1970-01-01 00:01:00") ,tz="-0800")) | sec min hour mday mon year wday yday isdst tzone | 1 0 0 0 1 0 70 4 0 0 -0800 | 2 0 1 0 1 0 70 4 0 0 -0800 | | # as.POSIXct is in fact implemented by using strptime and as.POSIXlt, | # but here show explicit conversion of lt to ct anyway: | > as.numeric(as.POSIXct(as.POSIXlt(c("1970-01-01 00:00:00","1970-01-01 00:01:00") ,tz="GMT"))) | [1] 0 60 | > as.numeric(as.POSIXct(as.POSIXlt(c("1970-01-01 00:00:00","1970-01-01 00:01:00")))) | [1] 18000 18060 | > 18000 /60^2 | [1] 5 | | # BUG: Oddly, as.POSIXct() does NOT seem to respect non-GMT non-local | # timzones like PST: | > as.numeric(as.POSIXct(as.POSIXlt(c("1970-01-01 00:00:00","1970-01-01 00:01:00") ,tz="PST"))) | [1] 0 60 | > as.numeric(as.POSIXct(as.POSIXlt(c("1970-01-01 00:00:00","1970-01-01 00:01:00") ,tz="-0800"))) | [1] 0 60 | | | # Note the (documented) subtlety due to GMT/UTC when converting | # between class "Date" and "POSIXct": | > as.POSIXct("2010-03-13 19:00:00 EST") | [1] "2010-03-13 19:00:00 EST" | > as.Date(as.POSIXct("2010-03-13 19:00:00 EST")) | [1] "2010-03-14" | > as.Date("2010-03-14")) | [1] "2010-03-14 UTC" | > as.POSIXct(as.Date("2010-03-14")) | [1] "2010-03-13 19:00:00 EST" | | | # timeDate appears to be much like POSIXct, counting seconds since the | # Unix epoch, but perhaps its underlying implementation is actually | # different, given its explicit use of a "units" attribute below: | | > timeDate(0) | GMT | [1] [1970-01-01] | > timeDate("1970-01-01") | GMT | [1] [1970-01-01] | | > as.numeric(timeDate("1970-01-01")) | [1] 0 | attr(,"FinCenter") | [1] "GMT" | attr(,"units") | [1] "secs" | attr(,"origin") | [1] "1970-01-01 00:00:00 GMT" | | > timeDate(2) | GMT | [1] [1970-01-01 00:00:02] | | > as.numeric(timeDate(2)) | [1] 2 | attr(,"FinCenter") | [1] "GMT" | attr(,"units") | [1] "secs" | attr(,"origin") | [1] "1970-01-01 00:00:00 GMT" | | > as.numeric(timeDate(24*60^2)) | [1] 1 | attr(,"FinCenter") | [1] "GMT" | attr(,"units") | [1] "days" | attr(,"origin") | [1] "1970-01-01 GMT" | | ------------------------------------------------------------ | | -- | Andrew Piskorski <a...@piskorski.com> | http://www.piskorski.com/ | | _______________________________________________ | R-SIG-Finance@stat.math.ethz.ch mailing list | https://stat.ethz.ch/mailman/listinfo/r-sig-finance | -- Subscriber-posting only. If you want to post, subscribe first. | -- Also note that this is not the r-help list where general R questions should go. -- Dirk Eddelbuettel | e...@debian.org | http://dirk.eddelbuettel.com _______________________________________________ R-SIG-Finance@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-finance -- Subscriber-posting only. If you want to post, subscribe first. -- Also note that this is not the r-help list where general R questions should go.