On Jan 30, 2012, at 8:44 AM, Paul Miller wrote:
Hi Rui, Marc, and Gabor,
Thanks for your replies to my question. All were helpful and it was
interesting to see how different people approach various aspects of
the same problem.
Spent some time this weekend looking at Rui's solution, which is
certainly much clearer than my own. Managed to figure out pretty
much all the details of how it works. Also managed to tweak it
slightly in order to make it do exactly what I wanted. (See revised
code below.)
Still have a couple of questions though. The first concerns the
insertion of the code "Y > 2012" to set year values beyond 2012 to
NA (on line 10 of the function below). When I add this (or use it
in place of "nchar(Y) > 4"), the code succesfully finds the problem
date "05/16/2015". After that though, it produces the following
error message:
Error in if (any(is.na(x) & M != "un" & Y != "un")) cat("Warning:
Invalid date values in", : missing value where TRUE/FALSE needed
It's a bit dangerous to use comparison operators on mixed data types.
In your case you are comparing a character value to a numeric value
and may not realize that 2015 is not the same as "2015". Try "123" >
1000 if you want a quick counter-example. You may want to coerce the Y
value to "numeric" mode to be safe.
Also 'any' does not expect the logical connectives. You probably want:
any(is.na(x) , M != "un" , Y != "un")
Why is this happening? If the code correctly correctly handles the
date "06/20/1840" without producing an error, why can't it do
likelwise with "05/16/2015"?
The second question is why it's necessary to put "x" on line 15
following "cat("Warning ...)". I know that I don't get any date
columns if I don't include this but am not sure why.
The third question is whether it's possible to change the class of
the date variables without using a for loop. I played around with
this a little but didn't find a vectorized alternative. It may be
that this is not really important. It's just that I've read in
several places that for loops should be avoided wherever possible.
Thanks,
Paul
##########################################
#### Code for detecting invalid dates ####
##########################################
#### Test Data ####
connection <- textConnection("
1 11/23/21931 05/23/2009 un/17/2011
2 06/20/1840 02/30/2010 03/17/2011
3 06/17/1935 12/20/2008 07/un/2011
4 05/31/1937 01/18/2007 04/30/2011
5 06/31/1933 05/16/2015 11/20/un
")
TestDates <- data.frame(scan(connection,
list(Patient=0, birthDT="", diagnosisDT="", metastaticDT="")))
close(connection)
#### Input Data ####
TDSaved <- TestDates
#### List of Date Variables ####
DateNames <- c("birthDT", "diagnosisDT", "metastaticDT")
#### Date Function ####
fun <- function(Dat){
f <- function(jj, DF){
x <- as.character(DF[, jj])
x <- unlist(strsplit(x, "/"))
n <- length(x)
M <- x[seq(1, n, 3)]
D <- x[seq(2, n, 3)]
Y <- x[seq(3, n, 3)]
D[D == "un"] <- "15"
Y <- ifelse(nchar(Y) > 4 | Y > 2012 | Y < 1900, NA, Y)
x <- as.Date(paste(Y, M, D, sep="-"), format="%Y-%m-%d")
if(any(is.na(x) & M != "un" & Y != "un"))
cat("Warning: Invalid date values in", jj, "\n",
as.character(DF[is.na(x), jj]), "\n")
x
}
Dat <- data.frame(sapply(names(Dat), function(j) f(j, Dat)))
for(i in names(Dat)) class(Dat[[i]]) <- "Date"
Dat
}
#### Output Data ####
TD <- TDSaved
#### Read Dates ####
TD[, DateNames] <- fun(TD[, DateNames])
TD
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD
Heritage Laboratories
West Hartford, CT
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.