On 11/10/2010 3:36 PM, Dennis Fisher wrote:
Colleagues,
[R 2.11; OS X]
I am processing a file on the fly that contains the following text:
XXXáá
[email clients may display this differently -- the string is three X's followed
by two instances of the letter a with an acute accent]
I read the file with:
X <- readLines(FILENAME)
In this instance, the text of interest is on line 213. When I examine line
213, it reads:
XXX\xe1\xe1
This makes sense because the unicode mapping for á [a-acute] is U+00E1.
That's not what it's saying: it's saying you have three X's followed by
two unrecognized characters with hex codes E1. I imagine the original
file is encoded using Latin1, because that's how á is encoded there.
The problem arises when I attempt to manipulate the text in the file. For
example:
> grep("XXX", X[213])
integer(0)
Warning message:
In grep("XXX", X[213]) : input string 1 is invalid in this locale
Worse, yet:
> tolower(X[213])
Error in tolower(X[213]) : invalid multibyte string 1
I am focussing on resolving the first problem, i.e., identifying a line
containing XXX. If I can do so, I can remove the offending lines before I
execute the tolower command.
However, I am stumped as to how to resolve either problem.
Any help would be appreciated.
You need to declare the encoding of the file when you read it if it's
not in the default encoding for your locale, or re-encode it. See
?readLines.
Duncan Murdoch
Thanks.
Dennis
Dennis Fisher MD
P< (The "P Less Than" Company)
Phone: 1-866-PLessThan (1-866-753-7784)
Fax: 1-866-PLessThan (1-866-753-7784)
www.PLessThan.com
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.