Re: [R] Drop matching lines from readLines
I guess invert does the trick. For recording ... example .. file - grep(Repurchase Price,file, fixed = TRUE, invert = TRUE) -Original Message- From: Santosh Srinivas [mailto:santosh.srini...@gmail.com] Sent: 14 October 2010 11:28 To: 'r-help' Subject: Drop matching lines from readLines Dear R-group, I have some noise in my text file (coding issues!) ... I imported a 200 MB text file using readlines Used grep to find the lines with the error? What is the easiest way to drop those lines? I plan to write back the cleaned data set to my base file. Thanks. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Drop matching lines from readLines
From: santosh.srini...@gmail.com To: r-help@r-project.org Date: Thu, 14 Oct 2010 11:27:57 +0530 Subject: [R] Drop matching lines from readLines Dear R-group, I have some noise in my text file (coding issues!) ... I imported a 200 MB text file using readlines Used grep to find the lines with the error? What is the easiest way to drop those lines? I plan to write back the cleaned data set to my base file. Generally for text processing, I've been using utilities external to R although there may be R alternatives that work better for you. You mention grep, I've suggested sed as a general way to fix formatting things, there is also something called uniq on linux or cygwin. I have gotten into the habit of using these for a variety of data manipulation tasks, only feed clean data into R. $ echo -e a bc\\na bc a bc a bc $ echo -e a bc\\na bc | uniq a bc $ uniq --help Usage: uniq [OPTION]... [INPUT [OUTPUT]] Filter adjacent matching lines from INPUT (or standard input), writing to OUTPUT (or standard output). With no options, matching lines are merged to the first occurrence. Mandatory arguments to long options are mandatory for short options too. -c, --count prefix lines by the number of occurrences -d, --repeated only print duplicate lines -D, --all-repeated[=delimit-method] print all duplicate lines delimit-method={none(default),prepend,separate} Delimiting is done with blank lines -f, --skip-fields=N avoid comparing the first N fields -i, --ignore-case ignore differences in case when comparing -s, --skip-chars=N avoid comparing the first N characters -u, --unique only print unique lines -z, --zero-terminated end lines with 0 byte, not newline -w, --check-chars=N compare no more than N characters in lines --help display this help and exit --version output version information and exit A field is a run of blanks (usually spaces and/or TABs), then non-blank characters. Fields are skipped before chars. Note: 'uniq' does not detect repeated lines unless they are adjacent. You may want to sort the input first, or use `sort -u' without `uniq'. Also, comparisons honor the rules specified by `LC_COLLATE'. Thanks. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Drop matching lines from readLines
If I understand correctly, the poster knows what regex error pattern to look for, in which case (mod memory capacity -- but 200 mb should not be a problem, I think) is not merely cleanData - dirtyData[!grepl(errorPatternregex,dirtyData)] sufficient? Cheers, Bert On Thu, Oct 14, 2010 at 4:05 AM, Mike Marchywka marchy...@hotmail.com wrote: From: santosh.srini...@gmail.com To: r-help@r-project.org Date: Thu, 14 Oct 2010 11:27:57 +0530 Subject: [R] Drop matching lines from readLines Dear R-group, I have some noise in my text file (coding issues!) ... I imported a 200 MB text file using readlines Used grep to find the lines with the error? What is the easiest way to drop those lines? I plan to write back the cleaned data set to my base file. Generally for text processing, I've been using utilities external to R although there may be R alternatives that work better for you. You mention grep, I've suggested sed as a general way to fix formatting things, there is also something called uniq on linux or cygwin. I have gotten into the habit of using these for a variety of data manipulation tasks, only feed clean data into R. $ echo -e a bc\\na bc a bc a bc $ echo -e a bc\\na bc | uniq a bc $ uniq --help Usage: uniq [OPTION]... [INPUT [OUTPUT]] Filter adjacent matching lines from INPUT (or standard input), writing to OUTPUT (or standard output). With no options, matching lines are merged to the first occurrence. Mandatory arguments to long options are mandatory for short options too. -c, --count prefix lines by the number of occurrences -d, --repeated only print duplicate lines -D, --all-repeated[=delimit-method] print all duplicate lines delimit-method={none(default),prepend,separate} Delimiting is done with blank lines -f, --skip-fields=N avoid comparing the first N fields -i, --ignore-case ignore differences in case when comparing -s, --skip-chars=N avoid comparing the first N characters -u, --unique only print unique lines -z, --zero-terminated end lines with 0 byte, not newline -w, --check-chars=N compare no more than N characters in lines --help display this help and exit --version output version information and exit A field is a run of blanks (usually spaces and/or TABs), then non-blank characters. Fields are skipped before chars. Note: 'uniq' does not detect repeated lines unless they are adjacent. You may want to sort the input first, or use `sort -u' without `uniq'. Also, comparisons honor the rules specified by `LC_COLLATE'. Thanks. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Bert Gunter Genentech Nonclinical Biostatistics __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Drop matching lines from readLines
Yes, thanks ... that works. -Original Message- From: Bert Gunter [mailto:gunter.ber...@gene.com] Sent: 14 October 2010 21:26 To: Mike Marchywka Cc: santosh.srini...@gmail.com; r-help@r-project.org Subject: Re: [R] Drop matching lines from readLines If I understand correctly, the poster knows what regex error pattern to look for, in which case (mod memory capacity -- but 200 mb should not be a problem, I think) is not merely cleanData - dirtyData[!grepl(errorPatternregex,dirtyData)] sufficient? Cheers, Bert On Thu, Oct 14, 2010 at 4:05 AM, Mike Marchywka marchy...@hotmail.com wrote: From: santosh.srini...@gmail.com To: r-help@r-project.org Date: Thu, 14 Oct 2010 11:27:57 +0530 Subject: [R] Drop matching lines from readLines Dear R-group, I have some noise in my text file (coding issues!) ... I imported a 200 MB text file using readlines Used grep to find the lines with the error? What is the easiest way to drop those lines? I plan to write back the cleaned data set to my base file. Generally for text processing, I've been using utilities external to R although there may be R alternatives that work better for you. You mention grep, I've suggested sed as a general way to fix formatting things, there is also something called uniq on linux or cygwin. I have gotten into the habit of using these for a variety of data manipulation tasks, only feed clean data into R. $ echo -e a bc\\na bc a bc a bc $ echo -e a bc\\na bc | uniq a bc $ uniq --help Usage: uniq [OPTION]... [INPUT [OUTPUT]] Filter adjacent matching lines from INPUT (or standard input), writing to OUTPUT (or standard output). With no options, matching lines are merged to the first occurrence. Mandatory arguments to long options are mandatory for short options too. -c, --count prefix lines by the number of occurrences -d, --repeated only print duplicate lines -D, --all-repeated[=delimit-method] print all duplicate lines delimit-method={none(default),prepend,separate} Delimiting is done with blank lines -f, --skip-fields=N avoid comparing the first N fields -i, --ignore-case ignore differences in case when comparing -s, --skip-chars=N avoid comparing the first N characters -u, --unique only print unique lines -z, --zero-terminated end lines with 0 byte, not newline -w, --check-chars=N compare no more than N characters in lines --help display this help and exit --version output version information and exit A field is a run of blanks (usually spaces and/or TABs), then non-blank characters. Fields are skipped before chars. Note: 'uniq' does not detect repeated lines unless they are adjacent. You may want to sort the input first, or use `sort -u' without `uniq'. Also, comparisons honor the rules specified by `LC_COLLATE'. Thanks. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Bert Gunter Genentech Nonclinical Biostatistics __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Drop matching lines from readLines
Dear R-group, I have some noise in my text file (coding issues!) ... I imported a 200 MB text file using readlines Used grep to find the lines with the error? What is the easiest way to drop those lines? I plan to write back the cleaned data set to my base file. Thanks. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.