[R] reading fixed width format data with 2 types of lines
Hi, I know how to read fixed width format data with read.fwf, but suddenly I need to read in a large number of old fwf files with 2 types of lines. Lines that begin with 3 in first column carry one set of variables, and lines that begin with 4 carry another set, like this: … 3A00206546L07004901609004599 1015002 001001008010004002004007003 001 3A00206546L07004900609003099 1029001002001001006014002 3A00206546L07004900229000499 1015001001 3A00206546L070049001692559049033 1015 018036024 3A00206546L07004900229000499 1001 002 4A00176546L06804709001011100060651640015001001501063 065914 4A00176546L068047090010111000407616 1092 095614 4A00196546L098000100010111001706214450151062 065914 4A00176546L068047090010111000505913 1062 065914 4A00196546L09800010001011100260472140002001000201042 046114 4A00196546L0980001000101110025042214501200051042 046114 4A00196546L09800010001011100290372140005001220501032 036214 … I have searched for tricks to do this but I must not have used the right keywords, I found nothing. I suppose I could read the entire file as a single character variable for each line, then subset for lines that begin with 3 and save this in an ascii file that will then be reopened with a read.fwf call, and do the same with lines that begin with 4. But this does not appear to me to be very elegant nor efficient… Is there a better method? Thanks in advance, Denis Chabot __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] reading fixed width format data with 2 types of lines
I don't know if it's elegant enough for you, but you could split the file into two files with 'grep ^3 file file_3' and 'grep ^4 file file_4' and then read them in separately. Tim On Thu, Aug 12, 2010 at 01:57:19PM -0400, Denis Chabot wrote: Hi, I know how to read fixed width format data with read.fwf, but suddenly I need to read in a large number of old fwf files with 2 types of lines. Lines that begin with 3 in first column carry one set of variables, and lines that begin with 4 carry another set, like this: … 3A00206546L07004901609004599 1015002 001001008010004002004007003 001 3A00206546L07004900609003099 1029001002001001006014002 3A00206546L07004900229000499 1015001001 3A00206546L070049001692559049033 1015 018036024 3A00206546L07004900229000499 1001 002 4A00176546L06804709001011100060651640015001001501063 065914 4A00176546L068047090010111000407616 1092 095614 4A00196546L098000100010111001706214450151062 065914 4A00176546L068047090010111000505913 1062 065914 4A00196546L09800010001011100260472140002001000201042 046114 4A00196546L0980001000101110025042214501200051042 046114 4A00196546L09800010001011100290372140005001220501032 036214 … I have searched for tricks to do this but I must not have used the right keywords, I found nothing. I suppose I could read the entire file as a single character variable for each line, then subset for lines that begin with 3 and save this in an ascii file that will then be reopened with a read.fwf call, and do the same with lines that begin with 4. But this does not appear to me to be very elegant nor efficient… Is there a better method? Thanks in advance, Denis Chabot __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- -- Tim Gruene Institut fuer anorganische Chemie Tammannstr. 4 D-37077 Goettingen GPG Key ID = A46BEE1A signature.asc Description: Digital signature __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] reading fixed width format data with 2 types of lines
On Thu, 12 Aug 2010, Tim Gruene wrote: I don't know if it's elegant enough for you, but you could split the file into two files with 'grep ^3 file file_3' and 'grep ^4 file file_4' and then read them in separately. along the same lines, but all in R (untested) original.lines - readLines( filename ) tcon.3 - textConnection( grep( ^3, original.lines, value=T )) res.3 - read.fwf( tcon.3, etc ) close(tcon.3) tcon.4 - textConnection( grep( ^4, original.lines, value=T )) res.4 - read.fwf( tcon.4, etc ) close(tcon.4) rm( original.lines ) Or skip the readLines() step and use tcon.3 - pipe(paste(grep '^3',filename)) ... I think you can use 'findstr.exe' on windows in lieu of grep. HTH, Chuck Tim On Thu, Aug 12, 2010 at 01:57:19PM -0400, Denis Chabot wrote: Hi, I know how to read fixed width format data with read.fwf, but suddenly I need to read in a large number of old fwf files with 2 types of lines. Lines that begin with 3 in first column carry one set of variables, and lines that begin with 4 carry another set, like this: ??? 3A00206546L07004901609004599 1015002 001001008010004002004007003 001 3A00206546L07004900609003099 1029001002001001006014002 3A00206546L07004900229000499 1015001001 3A00206546L070049001692559049033 1015 018036024 3A00206546L07004900229000499 1001 002 4A00176546L06804709001011100060651640015001001501063 065914 4A00176546L068047090010111000407616 1092 095614 4A00196546L098000100010111001706214450151062 065914 4A00176546L068047090010111000505913 1062 065914 4A00196546L09800010001011100260472140002001000201042 046114 4A00196546L0980001000101110025042214501200051042 046114 4A00196546L09800010001011100290372140005001220501032 036214 ??? I have searched for tricks to do this but I must not have used the right keywords, I found nothing. I suppose I could read the entire file as a single character variable for each line, then subset for lines that begin with 3 and save this in an ascii file that will then be reopened with a read.fwf call, and do the same with lines that begin with 4. But this does not appear to me to be very elegant nor efficient??? Is there a better method? Thanks in advance, Denis Chabot __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- -- Tim Gruene Institut fuer anorganische Chemie Tammannstr. 4 D-37077 Goettingen GPG Key ID = A46BEE1A Charles C. Berry(858) 534-2098 Dept of Family/Preventive Medicine E mailto:cbe...@tajo.ucsd.edu UC San Diego http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.