On Sep 27, 2009, at 11:49 AM, Douglas Bates wrote:
On Sat, Sep 26, 2009 at 11:33 PM, David Winsemius
<dwinsem...@comcast.net> wrote:
I am contemplating bringing in and merging three NHANES-III
datasets from
the National Center for Health Statistics that are fixed format
with record
length=3348, line counts around 20,000 and described by SAS DATA
steps. I
have downloaded and linked similar datasets from the Continuous
NHANES
public data releases, but never ones with this many variables at
once. In
the prior effort I managed the task by some cut-paste-editing from
the SAS
code file into a corresponding read.fwf R call, but the earlier
NHANES-III
data is far more voluminous than the more recent "Continuous"
version. I am
wondering if anyone has experience with such a process and would be
willing
to share some advice? The SAS code can be seen here:
ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHANES/NHANESIII/1A/adult.sas
The main code file Data step starts out...
FILENAME ADULT "D:\Questionnaire\DAT\ADULT.DAT" LRECL=3348;
*** LRECL includes 2 positions for CRLF, assuming use of PC SAS;
DATA WORK;
INFILE ADULT MISSOVER;
LENGTH
SEQN 7
DMPFSEQ 5
DMPSTAT 3
DMARETHN 3
DMARACER 3
DMAETHNR 3
HSSEX 3
The corresponding positions in the INPUT section are
INPUT
SEQN 1-5
DMPFSEQ 6-10
DMPSTAT 11
DMARETHN 12
DMARACER 13
DMAETHNR 14
HSSEX 15
The note about CRLF appears to be implying that those characters
are being
counted as part of the length of the first variable, SEQN, but that
there
are only 5 meaningful positions. I suppose I can find out by trial
and error
how to read such files, but it would save me some time if anyone in
the
audience has worked through this on this data before.
One thought would be to import the data with the SAS work-alike
program,
WKS, (which I have not used before) and then to read in with
read.xport from
the foreign library. That would obviate the need to understand the
character
position issue, but probably has a time commitment to get it up and
running
and learn how to use it.
Another thought would be to parse the fixed width SAS Data step
code into
pieces and build a data.frame from which I then extract the
row.names,
col.names, and colClasses from that centralized structure.
Are the data available to the public somewhere or could just a few
records be made available?
Yes. Just trim the file name and the CDC ftp server accepts the path
specification:
ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHANES/NHANESIII/1A/
The file that goes with that SAS code is adult.dat
ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHANES/NHANESIII/1A/adult.dat
The reason I ask is because I imagine there are a lot of missing data
in each record (the data are arranged in the "wide" format for
longitudinal data and includes follow-up questions that will not apply
to most respondents). The missing data indicator, if any, and the
format of the other fields will be important in deciding how to split
the data.
Thanks for that. It was not designed as a longitudinal study, but
rather as cross-sectional study that was spaced over several years.
They did a re-exam of some sort, but that was not the primary purpose,
nor will it be my particular interest. I have tried to determine by
examination whether "." or " " is the missing value indicator and it
appears that both may used although there are many more spaces. Most
of the input suggests to my 15-year-old memories of SAS that the data
is numeric but there are 17 variables where input spec is "$nn"
> varLines[grep("[[:punct:]]", varLines)]
[1] " HAX11AG $6" " HAX11AH $6" " HAX11AI
$6"
[4] " HAX11AJ $6" " HAX11AK $6" " HAX11AL
$6"
[7] " HAX11AM $6" " HAX11AN $6" " HAX11AO
$6"
[10] " HAX11AP $6" " HAX11AQ $6" " HAX11AR $6"
[13] " HAX11AS $6" " HAX11AT $6" " HAX11AU $6"
[16] " HAX11AV $6" " HAZA1CC $30"
--
David Winsemius, MD
Heritage Laboratories
West Hartford, CT
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.