[R] grep triggering error on unicode character

Dennis Fisher Mon, 11 Oct 2010 12:37:20 -0700

Colleagues,

[R 2.11; OS X]


I am processing a file on the fly that contains the following text:
        XXXáá 
[email clients may display this differently -- the string is three X's followed 
by two instances of the letter a with an acute accent]
I read the file with:
        X       <- readLines(FILENAME)
In this instance, the text of interest is on line 213.  When I examine line 
213, it reads:
        XXX\xe1\xe1
This makes sense because the unicode mapping for á [a-acute] is U+00E1.

The problem arises when I attempt to manipulate the text in the file.  For 
example:
        > grep("XXX", X[213])
        integer(0)
        Warning message:
        In grep("XXX", X[213]) : input string 1 is invalid in this locale
Worse, yet:
        > tolower(X[213]) 
        Error in tolower(X[213]) : invalid multibyte string 1 

I am focussing on resolving the first problem, i.e., identifying a line 
containing XXX.  If I can do so, I can remove the offending lines before I 
execute the tolower command.
However, I am stumped as to how to resolve either problem.

Any help would be appreciated.

Thanks.

Dennis

Dennis Fisher MD
P < (The "P Less Than" Company)
Phone: 1-866-PLessThan (1-866-753-7784)
Fax: 1-866-PLessThan (1-866-753-7784)
www.PLessThan.com

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] grep triggering error on unicode character

Reply via email to