[R] reading fixed width format data with 2 types of lines

2010-08-12 Thread Denis Chabot
Hi,

I know how to read fixed width format data with read.fwf, but suddenly I need 
to read in a large number of old fwf files with 2 types of lines. Lines that 
begin with 3 in first column carry one set of variables, and lines that begin 
with 4 carry another set, like this:

…
3A00206546L07004901609004599  1015002  001001008010004002004007003   001
3A00206546L07004900609003099  1029001002001001006014002 
3A00206546L07004900229000499  1015001001
3A00206546L070049001692559049033  1015 018036024
3A00206546L07004900229000499  1001   002
4A00176546L06804709001011100060651640015001001501063   065914   
4A00176546L068047090010111000407616 1092   095614   
4A00196546L098000100010111001706214450151062   065914   
4A00176546L068047090010111000505913 1062   065914   
4A00196546L09800010001011100260472140002001000201042   046114   
4A00196546L0980001000101110025042214501200051042   046114   
4A00196546L09800010001011100290372140005001220501032   036214   
…

I have searched for tricks to do this but I must not have used the right 
keywords, I found nothing.

I suppose I could read the entire file as a single character variable for each 
line, then subset for lines that begin with 3 and save this in an ascii file 
that will then be reopened with a read.fwf call, and do the same with lines 
that begin with 4. But this does not appear to me to be very elegant nor 
efficient… Is there a better method?

Thanks in advance,


Denis Chabot
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] reading fixed width format data with 2 types of lines

2010-08-12 Thread Tim Gruene
I don't know if it's elegant enough for you, but you could split the file into
two files with 'grep ^3 file  file_3' and 'grep ^4 file  file_4'
and then read them in separately.

Tim

On Thu, Aug 12, 2010 at 01:57:19PM -0400, Denis Chabot wrote:
 Hi,
 
 I know how to read fixed width format data with read.fwf, but suddenly I need 
 to read in a large number of old fwf files with 2 types of lines. Lines that 
 begin with 3 in first column carry one set of variables, and lines that 
 begin with 4 carry another set, like this:
 
 …
 3A00206546L07004901609004599  1015002  001001008010004002004007003   
 001
 3A00206546L07004900609003099  1029001002001001006014002   
   
 3A00206546L07004900229000499  1015001001  
   
 3A00206546L070049001692559049033  1015 
 018036024
 3A00206546L07004900229000499  1001   
 002
 4A00176546L06804709001011100060651640015001001501063   065914 
   
 4A00176546L068047090010111000407616 1092   095614 
   
 4A00196546L098000100010111001706214450151062   065914 
   
 4A00176546L068047090010111000505913 1062   065914 
   
 4A00196546L09800010001011100260472140002001000201042   046114 
   
 4A00196546L0980001000101110025042214501200051042   046114 
   
 4A00196546L09800010001011100290372140005001220501032   036214 
   
 …
 
 I have searched for tricks to do this but I must not have used the right 
 keywords, I found nothing.
 
 I suppose I could read the entire file as a single character variable for 
 each line, then subset for lines that begin with 3 and save this in an ascii 
 file that will then be reopened with a read.fwf call, and do the same with 
 lines that begin with 4. But this does not appear to me to be very elegant 
 nor efficient… Is there a better method?
 
 Thanks in advance,
 
 
 Denis Chabot
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

-- 
--
Tim Gruene
Institut fuer anorganische Chemie
Tammannstr. 4
D-37077 Goettingen

GPG Key ID = A46BEE1A



signature.asc
Description: Digital signature
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] reading fixed width format data with 2 types of lines

2010-08-12 Thread Charles C. Berry

On Thu, 12 Aug 2010, Tim Gruene wrote:


I don't know if it's elegant enough for you, but you could split the file into
two files with 'grep ^3 file  file_3' and 'grep ^4 file  file_4'
and then read them in separately.



along the same lines, but all in R (untested)

original.lines - readLines( filename )

tcon.3 - textConnection( grep( ^3, original.lines, value=T ))
res.3 - read.fwf( tcon.3, etc )
close(tcon.3)

tcon.4 - textConnection( grep( ^4, original.lines, value=T ))
res.4 - read.fwf( tcon.4, etc )
close(tcon.4)

rm( original.lines )

Or skip the readLines() step and use

tcon.3 - pipe(paste(grep '^3',filename))

...

I think you can use 'findstr.exe' on windows in lieu of grep.

HTH,

Chuck





Tim

On Thu, Aug 12, 2010 at 01:57:19PM -0400, Denis Chabot wrote:

Hi,

I know how to read fixed width format data with read.fwf, but suddenly I need to read in a large 
number of old fwf files with 2 types of lines. Lines that begin with 3 in first column 
carry one set of variables, and lines that begin with 4 carry another set, like this:

???
3A00206546L07004901609004599  1015002  001001008010004002004007003   001
3A00206546L07004900609003099  1029001002001001006014002
3A00206546L07004900229000499  1015001001
3A00206546L070049001692559049033  1015 018036024
3A00206546L07004900229000499  1001   002
4A00176546L06804709001011100060651640015001001501063   065914
4A00176546L068047090010111000407616 1092   095614
4A00196546L098000100010111001706214450151062   065914
4A00176546L068047090010111000505913 1062   065914
4A00196546L09800010001011100260472140002001000201042   046114
4A00196546L0980001000101110025042214501200051042   046114
4A00196546L09800010001011100290372140005001220501032   036214
???

I have searched for tricks to do this but I must not have used the right 
keywords, I found nothing.

I suppose I could read the entire file as a single character variable for each 
line, then subset for lines that begin with 3 and save this in an ascii file 
that will then be reopened with a read.fwf call, and do the same with lines 
that begin with 4. But this does not appear to me to be very elegant nor 
efficient??? Is there a better method?

Thanks in advance,


Denis Chabot
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


--
--
Tim Gruene
Institut fuer anorganische Chemie
Tammannstr. 4
D-37077 Goettingen

GPG Key ID = A46BEE1A




Charles C. Berry(858) 534-2098
Dept of Family/Preventive Medicine
E mailto:cbe...@tajo.ucsd.edu   UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.