On 08-Mar-07 jim holtman wrote: > How do you define a carriage return in the middle of a line > if a carriage return is also used to delimit a line? One of the > things you can do is to use 'count.fields' to determine the > number of fields in each line. For those lines that are not the > right length, you could combine them together with a 'paste' > command when you write them out.
For a moment I though, Walter, that you were simply suffering from repeat carriage-returns (CR) and getting blank lines, but then I saw that you refer to CRs "in the middle of input lines". The "blank line" situation is quite easily handled by pe-processing the file (before R sees it) if you have the tools available. For eaxample. the file "test.txt" Line 1 Line 2 Line 3 Line 4 Line 5 Line 6 (with 1 and then 2 extraneous blank lines), after being fed through awk '!/^$' becomes awk '!/^$/' < test.txt Line 1 Line 2 Line 3 Line 4 Line 5 Line 6 since '!/^$/' cause awk to output those lines which do not ("!") match ("/.../") the pattern "beginning of line" ("^") immediately followed by "end of line" ("$") -- i.e. it filters out blank lines. However, if your problem is extraneous CRs (or more generally End of Line [EOL] which in DOS/Windows is CRLF [CR followed by Line Feed] and in Unix/Linux is LF, or on a Mac is CR), then probably Jim Holtman's suggestion, or something like it, is the approach to take. But, as he implies, there may be more to your problem than meets the eye. Clearly extraneous EOLs indicate that the path between the original data and the preparation of the CSV file went wrong somwhere, of which the extra EOLs are are symptom; but they may not be the only symptom if you look carefully. Once people are careless about preparing data (and it's very easy to be careless when entring data into spreadsheets, for instance), then all sorts of things can go wrong (and in the case of Excel, I've known Excel iitself to invent content when the user accidentally strays outside the boundaries of the spreadsheet). Hence Jim's suggestion of counting fields is a useful start. However, supposing that extraneous EOLs is your only problem, the following strategy may solve it. Assume you have read in the file, and can count the number of fields (NF) in each line as read in. I'm assuming that (as described) you should have 3 fields per line. Then loop through the result, line by line: 1. Read a line and count fields (NF) 2. If NF=0, skip the line 3. If NF>3, warning "Too many fields in Line XXX" and abort 4. If NF=3, transfer the line to the "good" version 5. If NF<3, make temporary copy "tcp" of line and then 5.1 Read next line and count fields (NF1) 5.2 If NF1=0 skip the line 5.3 If NF+NF1>3 then warning and abort as above 5.4 If NF+NF1=3 then paste line to "tcp" and transfer to "good" 5.5 If NF+NF1<3 paste to "tcp" and then 5.5.1 Read next line and count fields (NF2) [and so on] I've put "Warning and abort" cases in there because if "Too many fields" occurs then you have indeed got more problems than simply extraneous EOLs, and you have no option but to go into the file by hand and see what is happening, and try to mend it. Likewise, if you simply count fields for each line, you can see how many instances there are of lines with too few fields. This (if you are lucky) will indicate the magnitude of the task of opening the file and editing out the errors by hand. Again, this is the sort of thing with problem files which I have usually approached using awk, since (in Linux/Usinx at least) you can do cat funnyfile.csv | awk 'BEGIN{FS=","} {nfields = NF; print nfields}' | sort -u which (a) sets the Field Separator (FS) to "," (for CSV); (b) counts the number of fields in a line (NF) and outputs it; (c) pipes the output to "sort -u" which removes duplicates and sorts the result. If the file is "clean" you will get only one number in the result -- the number of fields that should be in each line (in your case, 3). If you get more than one number, then you can identify the lines with problems on the lines of cat funnyfile.csv | awk 'BEGIN{FS=","} {nline = NR; nfields = NF} {if(nfields != 3){print nline " " nfields}' which tracks the line count (Number of records [NR]), counts the number of fields, checks whether this is wring and if so outputs the number of the offending input line and the number of fields it has. One potential problem which would not be identified in this way is the case where fields get omitted altogether and part of the following record gets wrapped into the deficient record, so that line breaks get out of step with the starts of records. (I've known this happen too ... ) You can then get the situation where records appear to have the right numbers of fields, but they are the wrong fields. I find I've been induced to go on at some length, because your problem has reminded me of so many dreadfully mangled data files I've met in the past, and have had to pick through in the kind of way described above. You have my sympathy! Good luck! Ted. > On 3/7/07, Walter R. Paczkowski <[EMAIL PROTECTED]> wrote: >> >> >> Hi, >> I'm hoping someone has a suggestion for handling a simple problem. >> A client gave me a comma separated value file (call it x.csv) >> that has an id and name and address for about 25,000 people >> (25,000 records). >> I used read.table to read it, but then discovered that there are >> stray carriage returns on several records. This plays havoc with >> read.table since it starts a new input line when it sees the >> carriage return. >> In short, the read is all wrong. >> I thought I could write a simple function to parse a line and >> write it back out, character by character. If a carriage >> return is found, it would simply be ignored on the writing >> back out part. But how do I identify a carriage return? What is >> the code or symbol? Is there any easier way to rid the file >> of carriage returns in the middle of the input lines? >> Any help is appreciated. >> Walt Paczkowski -------------------------------------------------------------------- E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 08-Mar-07 Time: 10:10:55 ------------------------------ XFMail ------------------------------ ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.