Re: [R] Drop matching lines from readLines

2010-10-14 Thread Santosh Srinivas
I guess invert does the trick.
For recording ... example .. 
file - grep(Repurchase Price,file, fixed = TRUE, invert = TRUE)


-Original Message-
From: Santosh Srinivas [mailto:santosh.srini...@gmail.com] 
Sent: 14 October 2010 11:28
To: 'r-help'
Subject: Drop matching lines from readLines

Dear R-group,

I have some noise in my text file (coding issues!) ...  I imported a 200 MB
text file using readlines
Used grep to find the lines with the error?

What is the easiest way to drop those lines? I plan to write back the
cleaned data set to my base file.

Thanks.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Drop matching lines from readLines

2010-10-14 Thread Mike Marchywka







 From: santosh.srini...@gmail.com
 To: r-help@r-project.org
 Date: Thu, 14 Oct 2010 11:27:57 +0530
 Subject: [R] Drop matching lines from readLines

 Dear R-group,

 I have some noise in my text file (coding issues!) ... I imported a 200 MB
 text file using readlines
 Used grep to find the lines with the error?

 What is the easiest way to drop those lines? I plan to write back the
 cleaned data set to my base file.

Generally for text processing, I've been using utilities external to R
although there may be R alternatives that work better for you. You
mention grep, I've suggested sed as a general way to fix formatting things,
there is also something called uniq on linux or cygwin.
I have gotten into the habit of using these for a variety of data
manipulation tasks, only feed clean data into R.

$ echo -e a bc\\na bc
a bc
a bc

$ echo -e a bc\\na bc | uniq
a bc

$ uniq --help
Usage: uniq [OPTION]... [INPUT [OUTPUT]]
Filter adjacent matching lines from INPUT (or standard input),
writing to OUTPUT (or standard output).

With no options, matching lines are merged to the first occurrence.

Mandatory arguments to long options are mandatory for short options too.
  -c, --count   prefix lines by the number of occurrences
  -d, --repeated    only print duplicate lines
  -D, --all-repeated[=delimit-method]  print all duplicate lines
    delimit-method={none(default),prepend,separate}
    Delimiting is done with blank lines
  -f, --skip-fields=N   avoid comparing the first N fields
  -i, --ignore-case ignore differences in case when comparing
  -s, --skip-chars=N    avoid comparing the first N characters
  -u, --unique  only print unique lines
  -z, --zero-terminated  end lines with 0 byte, not newline
  -w, --check-chars=N   compare no more than N characters in lines
  --help display this help and exit
  --version  output version information and exit

A field is a run of blanks (usually spaces and/or TABs), then non-blank
characters.  Fields are skipped before chars.

Note: 'uniq' does not detect repeated lines unless they are adjacent.
You may want to sort the input first, or use `sort -u' without `uniq'.
Also, comparisons honor the rules specified by `LC_COLLATE'.











 Thanks.

  
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Drop matching lines from readLines

2010-10-14 Thread Bert Gunter
If I understand correctly, the poster knows what regex error pattern
to look for, in which case (mod memory capacity -- but 200 mb should
not be a problem, I think) is not merely

cleanData - dirtyData[!grepl(errorPatternregex,dirtyData)]

sufficient?

Cheers,
Bert

On Thu, Oct 14, 2010 at 4:05 AM, Mike Marchywka marchy...@hotmail.com wrote:






 
 From: santosh.srini...@gmail.com
 To: r-help@r-project.org
 Date: Thu, 14 Oct 2010 11:27:57 +0530
 Subject: [R] Drop matching lines from readLines

 Dear R-group,

 I have some noise in my text file (coding issues!) ... I imported a 200 MB
 text file using readlines
 Used grep to find the lines with the error?

 What is the easiest way to drop those lines? I plan to write back the
 cleaned data set to my base file.

 Generally for text processing, I've been using utilities external to R
 although there may be R alternatives that work better for you. You
 mention grep, I've suggested sed as a general way to fix formatting things,
 there is also something called uniq on linux or cygwin.
 I have gotten into the habit of using these for a variety of data
 manipulation tasks, only feed clean data into R.

 $ echo -e a bc\\na bc
 a bc
 a bc

 $ echo -e a bc\\na bc | uniq
 a bc

 $ uniq --help
 Usage: uniq [OPTION]... [INPUT [OUTPUT]]
 Filter adjacent matching lines from INPUT (or standard input),
 writing to OUTPUT (or standard output).

 With no options, matching lines are merged to the first occurrence.

 Mandatory arguments to long options are mandatory for short options too.
   -c, --count   prefix lines by the number of occurrences
   -d, --repeated    only print duplicate lines
   -D, --all-repeated[=delimit-method]  print all duplicate lines
     delimit-method={none(default),prepend,separate}
     Delimiting is done with blank lines
   -f, --skip-fields=N   avoid comparing the first N fields
   -i, --ignore-case ignore differences in case when comparing
   -s, --skip-chars=N    avoid comparing the first N characters
   -u, --unique  only print unique lines
   -z, --zero-terminated  end lines with 0 byte, not newline
   -w, --check-chars=N   compare no more than N characters in lines
   --help display this help and exit
   --version  output version information and exit

 A field is a run of blanks (usually spaces and/or TABs), then non-blank
 characters.  Fields are skipped before chars.

 Note: 'uniq' does not detect repeated lines unless they are adjacent.
 You may want to sort the input first, or use `sort -u' without `uniq'.
 Also, comparisons honor the rules specified by `LC_COLLATE'.











 Thanks.


 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 
Bert Gunter
Genentech Nonclinical Biostatistics

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Drop matching lines from readLines

2010-10-14 Thread Santosh Srinivas
Yes, thanks ... that works.

-Original Message-
From: Bert Gunter [mailto:gunter.ber...@gene.com] 
Sent: 14 October 2010 21:26
To: Mike Marchywka
Cc: santosh.srini...@gmail.com; r-help@r-project.org
Subject: Re: [R] Drop matching lines from readLines

If I understand correctly, the poster knows what regex error pattern
to look for, in which case (mod memory capacity -- but 200 mb should
not be a problem, I think) is not merely

cleanData - dirtyData[!grepl(errorPatternregex,dirtyData)]

sufficient?

Cheers,
Bert

On Thu, Oct 14, 2010 at 4:05 AM, Mike Marchywka marchy...@hotmail.com
wrote:






 
 From: santosh.srini...@gmail.com
 To: r-help@r-project.org
 Date: Thu, 14 Oct 2010 11:27:57 +0530
 Subject: [R] Drop matching lines from readLines

 Dear R-group,

 I have some noise in my text file (coding issues!) ... I imported a 200
MB
 text file using readlines
 Used grep to find the lines with the error?

 What is the easiest way to drop those lines? I plan to write back the
 cleaned data set to my base file.

 Generally for text processing, I've been using utilities external to R
 although there may be R alternatives that work better for you. You
 mention grep, I've suggested sed as a general way to fix formatting
things,
 there is also something called uniq on linux or cygwin.
 I have gotten into the habit of using these for a variety of data
 manipulation tasks, only feed clean data into R.

 $ echo -e a bc\\na bc
 a bc
 a bc

 $ echo -e a bc\\na bc | uniq
 a bc

 $ uniq --help
 Usage: uniq [OPTION]... [INPUT [OUTPUT]]
 Filter adjacent matching lines from INPUT (or standard input),
 writing to OUTPUT (or standard output).

 With no options, matching lines are merged to the first occurrence.

 Mandatory arguments to long options are mandatory for short options too.
   -c, --count   prefix lines by the number of occurrences
   -d, --repeated    only print duplicate lines
   -D, --all-repeated[=delimit-method]  print all duplicate lines
     delimit-method={none(default),prepend,separate}
     Delimiting is done with blank lines
   -f, --skip-fields=N   avoid comparing the first N fields
   -i, --ignore-case ignore differences in case when comparing
   -s, --skip-chars=N    avoid comparing the first N characters
   -u, --unique  only print unique lines
   -z, --zero-terminated  end lines with 0 byte, not newline
   -w, --check-chars=N   compare no more than N characters in lines
   --help display this help and exit
   --version  output version information and exit

 A field is a run of blanks (usually spaces and/or TABs), then non-blank
 characters.  Fields are skipped before chars.

 Note: 'uniq' does not detect repeated lines unless they are adjacent.
 You may want to sort the input first, or use `sort -u' without `uniq'.
 Also, comparisons honor the rules specified by `LC_COLLATE'.











 Thanks.


 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 
Bert Gunter
Genentech Nonclinical Biostatistics

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Drop matching lines from readLines

2010-10-13 Thread Santosh Srinivas
Dear R-group,

I have some noise in my text file (coding issues!) ...  I imported a 200 MB
text file using readlines
Used grep to find the lines with the error?

What is the easiest way to drop those lines? I plan to write back the
cleaned data set to my base file.

Thanks.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.