Re: [R] Converting SAS Data code to R.

David Winsemius Sun, 27 Sep 2009 19:02:27 -0700


On Sep 27, 2009, at 6:01 PM, David Winsemius wrote:

On Sep 27, 2009, at 12:10 PM, David Winsemius wrote:
On Sep 27, 2009, at 11:49 AM, Douglas Bates wrote:
On Sat, Sep 26, 2009 at 11:33 PM, David Winsemius
<dwinsem...@comcast.net> wrote:
I am contemplating bringing in and merging three NHANES-IIIdatasets fromthe National Center for Health Statistics that are fixed formatwith recordlength=3348, line counts around 20,000 and described by SAS DATAsteps. Ihave downloaded and linked similar datasets from the ContinuousNHANESpublic data releases, but never ones with this many variables atonce. Inthe prior effort I managed the task by some cut-paste-editingfrom the SAScode file into a corresponding read.fwf R call, but the earlierNHANES-IIIdata is far more voluminous than the more recent "Continuous"version. I amwondering if anyone has experience with such a process and wouldbe willing
to share some advice? The SAS code can be seen here:
ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHANES/NHANESIII/1A/adult.sas
The main code file Data step starts out...
 FILENAME ADULT "D:\Questionnaire\DAT\ADULT.DAT" LRECL=3348;
 *** LRECL includes 2 positions for CRLF, assuming use of PC SAS;
 DATA WORK;
   INFILE ADULT MISSOVER;
   LENGTH
     SEQN      7
     DMPFSEQ   5
     DMPSTAT   3
     DMARETHN  3
     DMARACER  3
     DMAETHNR  3
     HSSEX     3
The corresponding positions in the INPUT section are
  INPUT
     SEQN     1-5
     DMPFSEQ  6-10
     DMPSTAT  11
     DMARETHN 12
     DMARACER 13
     DMAETHNR 14
     HSSEX    15
The note about CRLF appears to be implying that those charactersare beingcounted as part of the length of the first variable, SEQN, butthat thereare only 5 meaningful positions. I suppose I can find out bytrial and errorhow to read such files, but it would save me some time if anyonein the
audience has worked through this on this data before.
One thought would be to import the data with the SAS work-alikeprogram,WKS, (which I have not used before) and then to read in withread.xport fromthe foreign library. That would obviate the need to understandthe characterposition issue, but probably has a time commitment to get it upand running
and learn how to use it.
Another thought would be to parse the fixed width SAS Data stepcode intopieces and build a data.frame from which I then extract therow.names,
col.names, and colClasses from that centralized structure.
Are the data available to the public somewhere or could just a few
records be made available?
Yes. Just trim the file name and the CDC ftp server accepts thepath specification:
ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHANES/NHANESIII/1A/

The file that goes with that SAS code is adult.dat

ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHANES/NHANESIII/1A/adult.dat
The reason I ask is because I imagine there are a lot of missingdata
in each record (the data are arranged in the "wide" format for
longitudinal data and includes follow-up questions that will notapply
to most respondents).  The missing data indicator, if any, and the
format of the other fields will be important in deciding how tosplit
the data.
Thanks for that. It was not designed as a longitudinal study, butrather as cross-sectional study that was spaced over several years.They did a re-exam of some sort, but that was not the primarypurpose, nor will it be my particular interest. I have tried todetermine by examination whether "." or " " is the missing valueindicator and it appears that both may used although there are manymore spaces. Most of the input suggests to my 15-year-old memoriesof SAS that the data is numeric but there are 17 variables whereinput spec is "$nn"
> varLines[grep("[[:punct:]]", varLines)]
[1] " HAX11AG $6" " HAX11AH $6" " HAX11AI$6"[4] " HAX11AJ $6" " HAX11AK $6" " HAX11AL$6"[7] " HAX11AM $6" " HAX11AN $6" " HAX11AO$6"[10] " HAX11AP $6" " HAX11AQ $6" "HAX11AR $6"[13] " HAX11AS $6" " HAX11AT $6" "HAX11AU $6"
[16] "        HAX11AV  $6"  "        HAZA1CC  $30"
My progress on this effort so far consists of having figured out howto extract the variable names and their associated lengths so I canset up a call to read.fwf(). This is waht I did on hte section ofthe SAS code following INPUT that contains those elements:
trim.ws <- function(x) gsub("^[[:space:]]+|[[:space:]]+$", "",x)
# courtesy of a Grothendieck r-help posting of a couple or threeyears ago.
adult.var <- data.frame(varnames =sapply( strsplit(trim.ws(varLines) , " +") , "[", 1:2)[1,], varlen=sapply( strsplit(trim.ws(varLines) , " +") , "[", 1:2)[2,])#so that I can split the trimmed strings on an arbitrary number ofspaces.
> adult.var[,][1:5,]
 varnames varlen
1     SEQN      7
2  DMPFSEQ      5
3  DMPSTAT      3
4 DMARETHN      3
5 DMARACER      3

As it turned out the "LENGTH" numbers in the SAS code are not thenumber of characters. I needed to calculate the number of charactersby subtracting the starting and ending positions following the INPUTstatements (pasted them in from my editor) :


varlen.ss <- scan(textConnection("SEQN     1-5
        DMPFSEQ  6-10
        DMPSTAT  11
        DMARETHN 12
snipped hundreds of varialbe names and postion...
       HAZMNK1R 3337-3339
        HAZNOK1R 3340-3341
        HAZMNK5R 3342-3344
        HAZNOK5R 3345-3346") ,what=c("character", "character") )

varlen.ssm <- matrix(varlen.ss, ncol=2, byrow=TRUE)
adult.var$vname2 <- varlen.ssm[ ,1]
adult.var$vlen2 <- varlen.ssm[ ,2]

adult.var$end <- unlist(lapply( strsplit( adult.var[,"vlen2"], "-"),tail, 1))adult.var$start <- unlist(lapply( strsplit( adult.var[,"vlen2"], "-"),head, 1))adult.var$varlen3 <- as.numeric(adult.var$end)-as.numeric(adult.var$start) + 1

> adult.var[grep("\\$", adult.var$varlen),][1:5,]
    varnames varlen
1064  HAX11AG     $6
1069  HAX11AH     $6
1074  HAX11AI     $6
1079  HAX11AJ     $6
1084  HAX11AK     $6
I still have a small number of "varlen" which have the form "$nn"but I suspect that won't be much of a challenge to substitute "" for"$". I think I will first create a column that is "numeric" for allthe rows without "$" and "character" for all the ones with "$".


adult.var$charvar <- NA
 adult.var$charvar[grep("\\$", adult.var$varlen)] <- "character"
adult.var$charvar[-grep("\\$", adult.var$varlen)] <- "numeric"

So the successful read operation then followed:

adult.read <- with( adult.var, read.fwf( "/Users/davidwinsemius/Documents/Mortality/NHANES3/NH3.adult.dat", widths=varlen3,colClasses=charvar, col.names=varnames) )


> str(adult.read)
'data.frame':   20050 obs. of  1238 variables:
 $ SEQN    : num  3 4 9 10 11 19 34 40 44 45 ...
 $ DMPFSEQ : num  3872 4115 4064 5386 8142 ...
 $ DMPSTAT : num  2 2 2 2 2 2 2 2 3 3 ...
 $ DMARETHN: num  3 3 1 1 3 2 2 3 1 2 ...
 $ DMARACER: num  1 1 1 1 1 2 2 1 1 2 ...
 $ DMAETHNR: num  1 1 3 3 1 3 3 1 3 3 ...
 $ HSSEX   : num  1 2 2 1 1 1 2 2 2 2 ...
.....snipped the rest of the variables....

--

David Winsemius, MD
Heritage Laboratories
West Hartford, CT

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Converting SAS Data code to R.

Reply via email to