On Oct 5, 2009, at 5:14 PM, esp wrote:
Date-Time-Stamp input method to correctly interpret user-specific
formats:coding is 90% there - based on exmple at
http://tolstoy.newcastle.edu.au/R/help/05/02/12003.html
...anyone got the last 10% please?
CONTEXT:
Data is received where one of the columns is a datetimestamp. At
midnight,
the value represented as text in this column consists of just the
date part,
e.g. "01/09/2009". At other times, the value in the column contains
both
date and time e.g. "01/09/2009 00:00:01". The goal is to read it
into R as
an appropriate data type, where for example date arithmetic can be
performed. As far as I can tell, the most appropriate such data
type is
POSIXct. The trick then is to read in the datetimestamps in the
data as
this type.
PROBLEM:
POSIXct defaults to a text representation almost but not quite like my
received data. The main difference is that the POSIXct date part is
in
reverse order, e.g. "2009-09-01". It is possible to define a
different
format where date and time parts look like my data but when
encountering
datetimestamps where only the the date part is present (as in the
case of my
midnight data) then this is interpreted as NA i.e. undefined.
SOLUTION (ALMOST):
There is a workaround (based on example at
http://tolstoy.newcastle.edu.au/R/help/05/02/12003.html). It is
possible to
define a class then read the data in as this class. For such a
class it is
possible to define a class method, in terms of a function, for
translating a
text (character string) representation into a value. In that
function, one
can use a conditional expression to treat midnight datetimestamps
differently from those at other times of day. The example below
does that.
In order to apply this function over all of the datetimestamp values
in the
column, it is necessary to use something like R's 'sapply' function.
SNAG:
The function below implements this approach. A datetimestamp with
only the
date part, including leading zeroes, is always length 10
(characters). It
correctly interprets the datetimestamp values, but unfortunately
translates
them into what appear to be numeric type. I am actually uncertain
precisely
what is happening, as I am very new to R and have most certainly
stretched
myself in writing this code. I think perhaps it returns a list and
something associated with this aspect makes it "forget" the data
type is
POSIXct or at least how such a type should be displayed as text or
what to
do about it.
PLEA:
Please, can anyone give any help whatsoever, however tenuous?
CODE, DATA & RESULTS:
Function to Read required data, intended to make the datetime column
of the
data (example given further below) into POSIXct values:
<<<
spot_frequency_readin <- function(file,nrows=-1) {
# create temp class
setClass("t_class2_", representation("character"))
setAs("character", "t_class2_", function(from) {sapply(from,
function(x) {
if (nchar(x)==10) {
as.POSIXct(strptime(x,format="%d/%m/%Y"))
}
else {
as.POSIXct(strptime(x,format="%d/%m/%Y %H:%M:%S"))
}
}
)
}
)
#(for format symbols, see "R Reference Card")
# read the file (TSV)
file <- read.delim(file, header=TRUE, comment.char = "", nrows=nrows,
as.is=FALSE, col.names=c("DATETIME", "FREQ"),
colClasses=c("t_class2_",
"numeric") )
# remove it now that we are done with it
removeClass("t_class2_")
return(file)
}
This appears to work apart as regards processing each row of data
correctly,
but the values returned look like numeric equivalents of POSIXct, as
opposed
to the expected character-based (string) equivalents:
Example Data:
<<<
DATETIME FREQ
01/09/2009 59.036
01/09/2009 00:00:01 58.035
01/09/2009 00:00:02 53.035
01/09/2009 00:00:03 47.033
01/09/2009 00:00:04 52.03
01/09/2009 00:00:05 55.025
Example Function Call:
<<<
spot = spot_frequency_readin("mydatafile.txt",4)
Result of Example Function Call:
<<<
spot[1]
DATETIME
1 1251759600
2 1251759601
3 1251759602
4 1251759603
What I ideally wanted to see (whether or not the time part of the
datetimestamp at midnight was displayed):
<<<
spot[1]
DATETIME
01/09/2009 00:00:00
01/09/2009 00:00:01
01/09/2009 00:00:02
01/09/2009 00:00:03
01/09/2009 00:00:04
For the function as defined above using 'sapply'
spot[,1]
01/09/2009 01/09/2009 00:00:01 01/09/2009 00:00:02 01/09/2009
00:00:03
1251759600 1251759601 1251759602
1251759603
This was unexpected - it seems to have displayed the datetimestamp
values
both as per my defined character-string representation and as numeric
values.
as.POSIXct(spot$DATETIME, origin="1970-01-01")
01/09/2009 01/09/2009 00:00:01 01/09/2009
00:00:02
"2009-09-01 05:00:00 EDT" "2009-09-01 05:00:01 EDT" "2009-09-01
05:00:02 EDT"
01/09/2009 00:00:03
"2009-09-01 05:00:03 EDT"
If you want to get rid of the somewhat extranous names:
> unname(as.POSIXct(spot$DATETIME, origin="1970-01-01") )
[1] "2009-09-01 05:00:00 EDT" "2009-09-01 05:00:01 EDT" "2009-09-01
05:00:02 EDT"
[4] "2009-09-01 05:00:03 EDT"
If you want a varialbe that stays that way:
> spot$D2 <- as.POSIXct(spot$DATETIME, origin="1970-01-01")
> spot
DATETIME FREQ D2
1 1251777600 59.036 2009-09-01 05:00:00
2 1251777601 58.035 2009-09-01 05:00:01
3 1251777602 53.035 2009-09-01 05:00:02
4 1251777603 47.033 2009-09-01 05:00:03
Or you could overwrite spot$DATETIME.
Alternatively ifI replace the 'sapply' by a 'lapply' then I get
something
closer to what I expect. It is at least what looks like R's default
text
representation for POSIXct datetimes, even if it is not in my
preferred
format.
<<<
spot[,1]
[[1]]
[1] "2009-09-01 BST"
[[2]]
[1] "2009-09-01 00:00:01 BST"
[[3]]
[1] "2009-09-01 00:00:02 BST"
[[4]]
[1] "2009-09-01 00:00:03 BST"
--
David Winsemius, MD
Heritage Laboratories
West Hartford, CT
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.