Paul,

I have a partial solution for you. It is partial in that I have not quite 
figured out the correct incantation to convert a 5 digit year (eg. 11/23/21931) 
properly using the R date functions. According to various sources (eg. man 
strptime and man strftime) as well as the R help for both functions, there are 
extended formats available, but I am having a bout of cerebral flatulence in 
getting them to work correctly and a search has not been fruitful. Perhaps 
someone else can offer some insights.

That being said, with the exception of correctly handling that one situation, 
which arguably IS a valid date a long time in the future and which would 
otherwise result in a truncated year (first four digits only)

> as.Date("11/23/21931", format = "%m/%d/%Y")
[1] "2193-11-23"

Here is one approach:

# Check the date. If as.Date() fails or the input is > 10 characters return it
checkDate <- function(x) as.character(x[is.na(as.Date(x, format = "%m/%d/%Y")) 
| 
                                        nchar(as.character(x)) > 10])

> lapply(TestDates[, -1], checkDate)
$birthDT
[1] "11/23/21931" "06/31/1933" 

$diagnosisDT
[1] "02/30/2010"

$metastaticDT
[1] "un/17/2011" "07/un/2011" "11/20/un"  


You could fine tune the checkDate() function to handle other formats, etc.

HTH,

Marc Schwartz


On Jan 26, 2012, at 9:54 AM, Paul Miller wrote:

> Sorry, sent this earlier but forgot to add an informative subject line. Am 
> resending, in the hopes of getting further replies. My apologies. Hope this 
> is OK.
> 
> Paul
> 
> 
> Hi Rui,
> 
> Thanks for your reply to my post. My code still has various shortcomings but 
> at least now it is fully functional.
> 
> It may be that, as I transition to using R, I'll have to live with some less 
> than ideal code, at least at the outset. I'll just have to write and re-write 
> my code as I improve.
> 
> Appreciate your help.
> 
> Paul
> 
> 
> Message: 66
> Date: Tue, 24 Jan 2012 09:54:57 -0800 (PST)
> From: Rui Barradas <ruipbarra...@sapo.pt>
> To: r-help@r-project.org
> Subject: Re: [R] Checking for invalid dates: Code works but needs
>    improvement
> Message-ID: <1327427697928-4324533.p...@n4.nabble.com>
> Content-Type: text/plain; charset=us-ascii
> 
> Hello,
> 
> Point 3 is very simple, instead of 'print' use 'cat'.
> Unlike 'print' it allows for several arguments and (very) simple formating.
> 
>  { cat("Error: Invalid date values in", DateNames[[i]], "\n",
>               TestDates[DateNames][[i]][TestDates$Invalid==1], "\n") }
> 
> Rui Barradas
> 
> Message: 53
> Date: Tue, 24 Jan 2012 08:54:49 -0800 (PST)
> From: Paul Miller <pjmiller...@yahoo.com>
> To: r-help@r-project.org
> Subject: [R] Checking for invalid dates: Code works but needs
>    improvement
> Message-ID:
>    <1327424089.1149.yahoomailclas...@web161604.mail.bf1.yahoo.com>
> Content-Type: text/plain; charset=us-ascii
> 
> Hello Everyone,
> 
> Still new to R. Wrote some code that finds and prints invalid dates (see 
> below). This code works but I suspect it's not very good. If someone could 
> show me a better way, I'd greatly appreciate it.
> 
> Here is some information about what I'm trying to accomplish. My sense is 
> that the R date functions are best at identifying invalid dates when fed 
> character data in their default format. So my code converts the input dates 
> to character, breaks them apart using strsplit, and then reformats them. It 
> then identifies which dates are "missing" in the sense that the month or year 
> are unknown and prints out any remaining invalid date values. 
> 
> As I see it, the code has at least 4 shortcomings.
> 
> 1. It's too long. My understanding is that skilled programmers can usually or 
> often complete tasks like this in a few lines.
> 
> 2. It's not vectorized. I started out trying to do something that was 
> vectorized but ran into problems with the strsplit function. I looked at the 
> help file and it appears this function will only accept a single character 
> vector.
> 
> 3. It prints out the incorrect dates but doesn't indicate which date variable 
> they belong to. I tried various things with paste but never came up with 
> anything that worked. Ideally, I'd like to get something that looks roughly 
> like:
> 
> Error: Invalid date values in birthDT
> 
> "21931-11-23" 
> "1933-06-31"
> 
> Error: Invalid date values in diagnosisDT
> 
> "2010-02-30"
> 
> 4. There's no way to specify names for input and output data. I imagine this 
> would be fairly easy to specify this in the arguments to a function but am 
> not sure how to incorporate it into a for loop.
> 
> Thanks,
> 
> Paul  
> 
> ##########################################
> #### Code for detecting invalid dates ####
> ##########################################
> 
> #### Test Data ####
> 
> connection <- textConnection("
> 1 11/23/21931 05/23/2009 un/17/2011
> 2 06/20/1940  02/30/2010 03/17/2011
> 3 06/17/1935  12/20/2008 07/un/2011
> 4 05/31/1937  01/18/2007 04/30/2011
> 5 06/31/1933  05/16/2009 11/20/un
> ")
> 
> TestDates <- data.frame(scan(connection, 
>         list(Patient=0, birthDT="", diagnosisDT="", metastaticDT="")))
> 
> close(connection)
> 
> TestDates
> 
> class(TestDates$birthDT)
> class(TestDates$diagnosisDT)
> class(TestDates$metastaticDT)
> 
> #### List of Date Variables ####
> 
> DateNames <- c("birthDT", "diagnosisDT", "metastaticDT")
> 
> #### Read Dates ####
> 
> for (i in seq(TestDates[DateNames])){
> TestDates[DateNames][[i]] <- as.character(TestDates[DateNames][[i]])
> TestDates$ParsedDT <- strsplit(TestDates[DateNames][[i]],"/")
> TestDates$Month <- sapply(TestDates$ParsedDT,function(x)x[1])
> TestDates$Day <- sapply(TestDates$ParsedDT,function(x)x[2])
> TestDates$Year <- sapply(TestDates$ParsedDT,function(x)x[3])
> TestDates$Day[TestDates$Day=="un"] <- "15"
> TestDates[DateNames][[i]] <- with(TestDates, paste(Year, Month, Day, sep = 
> "-"))
> is.na( TestDates[DateNames][[i]] [TestDates$Month=="un"] ) <- T
> is.na( TestDates[DateNames][[i]] [TestDates$Year=="un"] ) <- T
> TestDates$Date <- as.Date(TestDates[DateNames][[i]], format="%Y-%m-%d")
> TestDates$Invalid <- ifelse(is.na(TestDates$Date) & 
> !is.na(TestDates[DateNames][[i]]), 1, 0)
> if( sum(TestDates$Invalid)==0 ) 
>    { TestDates[DateNames][[i]] <- TestDates$Date } else
>    { print ( TestDates[DateNames][[i]][TestDates$Invalid==1]) }
> TestDates <- subset(TestDates, select = -c(ParsedDT, Month, Day, Year, Date, 
> Invalid))
> }
> 
> TestDates
> 
> class(TestDates$birthDT)
> class(TestDates$diagnosisDT)
> class(TestDates$metastaticDT)

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to