Re: [R] Converting SAS Data code to R.

2009-09-27 Thread David Winsemius


On Sep 27, 2009, at 6:01 PM, David Winsemius wrote:



On Sep 27, 2009, at 12:10 PM, David Winsemius wrote:



On Sep 27, 2009, at 11:49 AM, Douglas Bates wrote:


On Sat, Sep 26, 2009 at 11:33 PM, David Winsemius
 wrote:
I am contemplating bringing in and merging three NHANES-III  
datasets from
the National Center for Health Statistics that are fixed format  
with record
length=3348, line counts around 20,000 and described by SAS DATA  
steps. I
have downloaded and linked similar datasets from the Continuous  
NHANES
public data releases, but never ones with this many variables at  
once. In
the prior effort I managed the task by some cut-paste-editing  
from the SAS
code file into a corresponding read.fwf R call, but the earlier  
NHANES-III
data is far more voluminous than the more recent "Continuous"  
version. I am
wondering if anyone has experience with such a process and would  
be willing

to share some advice? The SAS code can be seen here:



ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHANES/NHANESIII/1A/adult.sas



The main code file Data step starts out...
 FILENAME ADULT "D:\Questionnaire\DAT\ADULT.DAT" LRECL=3348;
 *** LRECL includes 2 positions for CRLF, assuming use of PC SAS;
 DATA WORK;
   INFILE ADULT MISSOVER;
   LENGTH
 SEQN  7
 DMPFSEQ   5
 DMPSTAT   3
 DMARETHN  3
 DMARACER  3
 DMAETHNR  3
 HSSEX 3
The corresponding positions in the INPUT section are
  INPUT
 SEQN 1-5
 DMPFSEQ  6-10
 DMPSTAT  11
 DMARETHN 12
 DMARACER 13
 DMAETHNR 14
 HSSEX15
The note about CRLF appears to be implying that those characters  
are being
counted as part of the length of the first variable, SEQN, but  
that there
are only 5 meaningful positions. I suppose I can find out by  
trial and error
how to read such files, but it would save me some time if anyone  
in the

audience has worked through this on this data before.
One thought would be to import the data with the SAS work-alike  
program,
WKS, (which I have not used before) and then to read in with  
read.xport from
the foreign library. That would obviate the need to understand  
the character
position issue, but probably has a time commitment to get it up  
and running

and learn how to use it.
Another thought would be to parse the fixed width SAS Data step  
code into
pieces and build a data.frame from which I then extract the  
row.names,

col.names, and colClasses from that centralized structure.


Are the data available to the public somewhere or could just a few
records be made available?


Yes. Just trim the file name and the CDC ftp server accepts the  
path specification:


ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHANES/NHANESIII/1A/

The file that goes with that SAS code is adult.dat

ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHANES/NHANESIII/1A/adult.dat



The reason I ask is because I imagine there are a lot of missing  
data

in each record (the data are arranged in the "wide" format for
longitudinal data and includes follow-up questions that will not  
apply

to most respondents).  The missing data indicator, if any, and the
format of the other fields will be important in deciding how to  
split

the data.


Thanks for that. It was not designed as a longitudinal study, but  
rather as cross-sectional study that was spaced over several years.  
They did a re-exam of some sort, but that was not the primary  
purpose, nor will it be my particular interest. I have tried to  
determine by examination whether "." or " " is the missing value  
indicator and it appears that both may used although there are many  
more spaces. Most of the input suggests to my 15-year-old memories  
of SAS that the data is numeric but there are 17 variables where  
input spec is "$nn"


> varLines[grep("[[:punct:]]", varLines)]
[1] "HAX11AG  $6"  "HAX11AH  $6"  "HAX11AI   
$6"
[4] "HAX11AJ  $6"  "HAX11AK  $6"  "HAX11AL   
$6"
[7] "HAX11AM  $6"  "HAX11AN  $6"  "HAX11AO   
$6"
[10] "HAX11AP  $6"  "HAX11AQ  $6"  " 
HAX11AR  $6"
[13] "HAX11AS  $6"  "HAX11AT  $6"  " 
HAX11AU  $6"

[16] "HAX11AV  $6"  "HAZA1CC  $30"



My progress on this effort so far consists of having figured out how  
to extract the variable names and their associated lengths so I can  
set up a call to read.fwf(). This is waht I did on hte section of  
the SAS code following INPUT that contains those elements:


trim.ws <- function(x) gsub("^[[:space:]]+|[[:space:]]+$", "",x)
# courtesy of a Grothendieck r-help posting of a couple or three  
years ago.


adult.var <- data.frame(varnames =  
sapply( strsplit(trim.ws(varLines) , " +") ,  "[", 1:2)[1,], varlen=  
sapply( strsplit(trim.ws(varLines) , " +") ,  "[", 1:2)[2,])
#so that I can split the trimmed strings on an arbitrary number of  
spaces.


> adult.var[,][1:5,]
 varnames varlen
1 SEQN  7
2

Re: [R] Converting SAS Data code to R.

2009-09-27 Thread David Winsemius


On Sep 27, 2009, at 12:10 PM, David Winsemius wrote:



On Sep 27, 2009, at 11:49 AM, Douglas Bates wrote:


On Sat, Sep 26, 2009 at 11:33 PM, David Winsemius
 wrote:
I am contemplating bringing in and merging three NHANES-III  
datasets from
the National Center for Health Statistics that are fixed format  
with record
length=3348, line counts around 20,000 and described by SAS DATA  
steps. I
have downloaded and linked similar datasets from the Continuous  
NHANES
public data releases, but never ones with this many variables at  
once. In
the prior effort I managed the task by some cut-paste-editing from  
the SAS
code file into a corresponding read.fwf R call, but the earlier  
NHANES-III
data is far more voluminous than the more recent "Continuous"  
version. I am
wondering if anyone has experience with such a process and would  
be willing

to share some advice? The SAS code can be seen here:



ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHANES/NHANESIII/1A/adult.sas



The main code file Data step starts out...
  FILENAME ADULT "D:\Questionnaire\DAT\ADULT.DAT" LRECL=3348;
  *** LRECL includes 2 positions for CRLF, assuming use of PC SAS;
  DATA WORK;
INFILE ADULT MISSOVER;
LENGTH
  SEQN  7
  DMPFSEQ   5
  DMPSTAT   3
  DMARETHN  3
  DMARACER  3
  DMAETHNR  3
  HSSEX 3
The corresponding positions in the INPUT section are
   INPUT
  SEQN 1-5
  DMPFSEQ  6-10
  DMPSTAT  11
  DMARETHN 12
  DMARACER 13
  DMAETHNR 14
  HSSEX15
The note about CRLF appears to be implying that those characters  
are being
counted as part of the length of the first variable, SEQN, but  
that there
are only 5 meaningful positions. I suppose I can find out by trial  
and error
how to read such files, but it would save me some time if anyone  
in the

audience has worked through this on this data before.
One thought would be to import the data with the SAS work-alike  
program,
WKS, (which I have not used before) and then to read in with  
read.xport from
the foreign library. That would obviate the need to understand the  
character
position issue, but probably has a time commitment to get it up  
and running

and learn how to use it.
Another thought would be to parse the fixed width SAS Data step  
code into
pieces and build a data.frame from which I then extract the  
row.names,

col.names, and colClasses from that centralized structure.


Are the data available to the public somewhere or could just a few
records be made available?


Yes. Just trim the file name and the CDC ftp server accepts the path  
specification:


ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHANES/NHANESIII/1A/

The file that goes with that SAS code is adult.dat

ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHANES/NHANESIII/1A/adult.dat



The reason I ask is because I imagine there are a lot of missing data
in each record (the data are arranged in the "wide" format for
longitudinal data and includes follow-up questions that will not  
apply

to most respondents).  The missing data indicator, if any, and the
format of the other fields will be important in deciding how to split
the data.


Thanks for that. It was not designed as a longitudinal study, but  
rather as cross-sectional study that was spaced over several years.  
They did a re-exam of some sort, but that was not the primary  
purpose, nor will it be my particular interest. I have tried to  
determine by examination whether "." or " " is the missing value  
indicator and it appears that both may used although there are many  
more spaces. Most of the input suggests to my 15-year-old memories  
of SAS that the data is numeric but there are 17 variables where  
input spec is "$nn"


> varLines[grep("[[:punct:]]", varLines)]
[1] "HAX11AG  $6"  "HAX11AH  $6"  "HAX11AI   
$6"
[4] "HAX11AJ  $6"  "HAX11AK  $6"  "HAX11AL   
$6"
[7] "HAX11AM  $6"  "HAX11AN  $6"  "HAX11AO   
$6"
[10] "HAX11AP  $6"  "HAX11AQ  $6"  "HAX11AR   
$6"
[13] "HAX11AS  $6"  "HAX11AT  $6"  "HAX11AU   
$6"

[16] "HAX11AV  $6"  "HAZA1CC  $30"



My progress on this effort so far consists of having figured out how  
to extract the variable names and their associated lengths so I can  
set up a call to read.fwf(). This is waht I did on hte section of the  
SAS code following INPUT that contains those elements:


trim.ws <- function(x) gsub("^[[:space:]]+|[[:space:]]+$", "",x)
# courtesy of a Grothendieck r-help posting of a couple or three years  
ago.


adult.var <- data.frame(varnames =  
sapply( strsplit(trim.ws(varLines) , " +") ,  "[", 1:2)[1,], varlen=  
sapply( strsplit(trim.ws(varLines) , " +") ,  "[", 1:2)[2,])
#so that I can split the trimmed strings on an arbitrary number of  
spaces.


> adult.var[,][1:5,]
  varnames varlen
1 SEQN  7
2  DMPFSEQ  5
3  DMPSTAT  3
4 DMA

Re: [R] Converting SAS Data code to R.

2009-09-27 Thread David Winsemius


On Sep 27, 2009, at 11:49 AM, Douglas Bates wrote:


On Sat, Sep 26, 2009 at 11:33 PM, David Winsemius
 wrote:
I am contemplating bringing in and merging three NHANES-III  
datasets from
the National Center for Health Statistics that are fixed format  
with record
length=3348, line counts around 20,000 and described by SAS DATA  
steps. I
have downloaded and linked similar datasets from the Continuous  
NHANES
public data releases, but never ones with this many variables at  
once. In
the prior effort I managed the task by some cut-paste-editing from  
the SAS
code file into a corresponding read.fwf R call, but the earlier  
NHANES-III
data is far more voluminous than the more recent "Continuous"  
version. I am
wondering if anyone has experience with such a process and would be  
willing

to share some advice? The SAS code can be seen here:



ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHANES/NHANESIII/1A/adult.sas



The main code file Data step starts out...
   FILENAME ADULT "D:\Questionnaire\DAT\ADULT.DAT" LRECL=3348;
   *** LRECL includes 2 positions for CRLF, assuming use of PC SAS;
   DATA WORK;
 INFILE ADULT MISSOVER;
 LENGTH
   SEQN  7
   DMPFSEQ   5
   DMPSTAT   3
   DMARETHN  3
   DMARACER  3
   DMAETHNR  3
   HSSEX 3
The corresponding positions in the INPUT section are
INPUT
   SEQN 1-5
   DMPFSEQ  6-10
   DMPSTAT  11
   DMARETHN 12
   DMARACER 13
   DMAETHNR 14
   HSSEX15
The note about CRLF appears to be implying that those characters  
are being
counted as part of the length of the first variable, SEQN, but that  
there
are only 5 meaningful positions. I suppose I can find out by trial  
and error
how to read such files, but it would save me some time if anyone in  
the

audience has worked through this on this data before.
One thought would be to import the data with the SAS work-alike  
program,
WKS, (which I have not used before) and then to read in with  
read.xport from
the foreign library. That would obviate the need to understand the  
character
position issue, but probably has a time commitment to get it up and  
running

and learn how to use it.
Another thought would be to parse the fixed width SAS Data step  
code into
pieces and build a data.frame from which I then extract the  
row.names,

col.names, and colClasses from that centralized structure.


Are the data available to the public somewhere or could just a few
records be made available?


Yes. Just trim the file name and the CDC ftp server accepts the path  
specification:


ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHANES/NHANESIII/1A/

The file that goes with that SAS code is adult.dat

ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHANES/NHANESIII/1A/adult.dat



The reason I ask is because I imagine there are a lot of missing data
in each record (the data are arranged in the "wide" format for
longitudinal data and includes follow-up questions that will not apply
to most respondents).  The missing data indicator, if any, and the
format of the other fields will be important in deciding how to split
the data.


Thanks for that. It was not designed as a longitudinal study, but  
rather as cross-sectional study that was spaced over several years.  
They did a re-exam of some sort, but that was not the primary purpose,  
nor will it be my particular interest. I have tried to determine by  
examination whether "." or " " is the missing value indicator and it  
appears that both may used although there are many more spaces. Most  
of the input suggests to my 15-year-old memories of SAS that the data  
is numeric but there are 17 variables where input spec is "$nn"


> varLines[grep("[[:punct:]]", varLines)]
 [1] "HAX11AG  $6"  "HAX11AH  $6"  "HAX11AI   
$6"
 [4] "HAX11AJ  $6"  "HAX11AK  $6"  "HAX11AL   
$6"
 [7] "HAX11AM  $6"  "HAX11AN  $6"  "HAX11AO   
$6"

[10] "HAX11AP  $6"  "HAX11AQ  $6"  "HAX11AR  $6"
[13] "HAX11AS  $6"  "HAX11AT  $6"  "HAX11AU  $6"
[16] "HAX11AV  $6"  "HAZA1CC  $30"

--
David Winsemius, MD
Heritage Laboratories
West Hartford, CT

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Converting SAS Data code to R.

2009-09-27 Thread Douglas Bates
On Sat, Sep 26, 2009 at 11:33 PM, David Winsemius
 wrote:
> I am contemplating bringing in and merging three NHANES-III datasets from
> the National Center for Health Statistics that are fixed format with record
> length=3348, line counts around 20,000 and described by SAS DATA steps. I
> have downloaded and linked similar datasets from the Continuous NHANES
> public data releases, but never ones with this many variables at once. In
> the prior effort I managed the task by some cut-paste-editing from the SAS
> code file into a corresponding read.fwf R call, but the earlier NHANES-III
> data is far more voluminous than the more recent "Continuous" version. I am
> wondering if anyone has experience with such a process and would be willing
> to share some advice? The SAS code can be seen here:

> ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHANES/NHANESIII/1A/adult.sas

> The main code file Data step starts out...
>    FILENAME ADULT "D:\Questionnaire\DAT\ADULT.DAT" LRECL=3348;
>    *** LRECL includes 2 positions for CRLF, assuming use of PC SAS;
>    DATA WORK;
>      INFILE ADULT MISSOVER;
>      LENGTH
>        SEQN      7
>        DMPFSEQ   5
>        DMPSTAT   3
>        DMARETHN  3
>        DMARACER  3
>        DMAETHNR  3
>        HSSEX     3
> The corresponding positions in the INPUT section are
>     INPUT
>        SEQN     1-5
>        DMPFSEQ  6-10
>        DMPSTAT  11
>        DMARETHN 12
>        DMARACER 13
>        DMAETHNR 14
>        HSSEX    15
> The note about CRLF appears to be implying that those characters are being
> counted as part of the length of the first variable, SEQN, but that there
> are only 5 meaningful positions. I suppose I can find out by trial and error
> how to read such files, but it would save me some time if anyone in the
> audience has worked through this on this data before.
> One thought would be to import the data with the SAS work-alike program,
> WKS, (which I have not used before) and then to read in with read.xport from
> the foreign library. That would obviate the need to understand the character
> position issue, but probably has a time commitment to get it up and running
> and learn how to use it.
> Another thought would be to parse the fixed width SAS Data step code into
> pieces and build a data.frame from which I then extract the row.names,
> col.names, and colClasses from that centralized structure.

Are the data available to the public somewhere or could just a few
records be made available?

The reason I ask is because I imagine there are a lot of missing data
in each record (the data are arranged in the "wide" format for
longitudinal data and includes follow-up questions that will not apply
to most respondents).  The missing data indicator, if any, and the
format of the other fields will be important in deciding how to split
the data.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Converting SAS Data code to R.

2009-09-26 Thread David Winsemius
I am contemplating bringing in and merging three NHANES-III datasets  
from the National Center for Health Statistics that are fixed format  
with record length=3348, line counts around 20,000 and described by  
SAS DATA steps. I have downloaded and linked similar datasets from the  
Continuous NHANES public data releases, but never ones with this many  
variables at once. In the prior effort I managed the task by some cut- 
paste-editing from the SAS code file into a corresponding read.fwf R  
call, but the earlier NHANES-III data is far more voluminous than the  
more recent "Continuous" version. I am wondering if anyone has  
experience with such a process and would be willing to share some  
advice? The SAS code can be seen here:


ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHANES/NHANESIII/1A/adult.sas

The main code file Data step starts out...
FILENAME ADULT "D:\Questionnaire\DAT\ADULT.DAT" LRECL=3348;
*** LRECL includes 2 positions for CRLF, assuming use of PC SAS;
DATA WORK;
  INFILE ADULT MISSOVER;
  LENGTH
SEQN  7
DMPFSEQ   5
DMPSTAT   3
DMARETHN  3
DMARACER  3
DMAETHNR  3
HSSEX 3
The corresponding positions in the INPUT section are
 INPUT
SEQN 1-5
DMPFSEQ  6-10
DMPSTAT  11
DMARETHN 12
DMARACER 13
DMAETHNR 14
HSSEX15
The note about CRLF appears to be implying that those characters are  
being counted as part of the length of the first variable, SEQN, but  
that there are only 5 meaningful positions. I suppose I can find out  
by trial and error how to read such files, but it would save me some  
time if anyone in the audience has worked through this on this data  
before.
One thought would be to import the data with the SAS work-alike  
program, WKS, (which I have not used before) and then to read in with  
read.xport from the foreign library. That would obviate the need to  
understand the character position issue, but probably has a time  
commitment to get it up and running and learn how to use it.
Another thought would be to parse the fixed width SAS Data step code  
into pieces and build a data.frame from which I then extract the  
row.names, col.names, and colClasses from that centralized structure.


David Winsemius, MD
Heritage Laboratories
West Hartford, CT

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.