Thank you for the answer. I was about to ask why I should avoid text connections, but actually I just noticed that with a binary connection for the read, the problem disappears (I mean, I replace "rt" with "rb" in the file open). R is even clever enough that, when feeded the latin1 file after an options(encoding="UTF-8") and no encoding in the readLines, it returns correctly a string with encoding "unknown" and byte 0xff in the raw representation (I would have expected at least a warning, but it silently reads bad UTF-8 bytes as simply raw bytes, it seems)
Thus the text connection does something more that causes a problem. Maybe it tries to translate characters twice? And this problem remains with read.table. Not surprising: by inspecting the source, I see it uses open(file "rt"). Jean-Claude Arbaut 2016-03-15 21:24 GMT+01:00 Duncan Murdoch <murdoch.dun...@gmail.com>: > I think you've identified a bug (or more than one) here, but your message is > so long, I haven't had time to go through it all. I'd suggest that you > write up a shorter version for the bug list. The shorter version would > > 1. Write the latin1 file using writeBin. > 2. Set options(encoding = "") and read it without error. > 3. Set options(encoding = "UTF-8") and get an error even if you explicitly > set encoding when reading. > 4. Set options(encoding = "latin1") and also get an error with or without > explicitly setting the encoding. > > I would limit the tests to readLines; read.table is much more complicated, > and isn't necessary to illustrate the problem. It just confuses things by > bringing it into the discussion. > > You should also avoid bringing text mode connections into the discussion > unless they are necessary. > > Duncan Murdoch > > > On 15/03/2016 3:05 PM, Jean-Claude Arbaut wrote: >> >> Hello R users, >> >> I am having problems to read a CSV file that contains names with character >> ÿ. >> In case it doesn't print correctly, it's Unicode character 00FF or LATIN >> SMALL >> LETTER Y WITH DIAERESIS. >> My computer has Windows 7 and R 3.2.4. >> >> Initially, I configured my computer to run options(encoding="UTF-8") >> in my .Rprofile, >> since I prefer this encoding, for portability. Good and modern >> standard, I thought. >> Rather than sending a large file, here is how to reproduce my problem: >> >> options(encoding="UTF-8") >> >> f <- file("test.txt", "wb") >> writeBin(as.integer(c(65, 13, 10, 66, 255, 67, 13, 10, 68, 13, 10)), >> f, size=1) >> close(f) >> read.table("test.txt", encoding="latin1") >> f <- file("test.txt", "rt") >> readLines(f, encoding="latin1") >> close(f) >> >> I write a file with three lines, in binary to avoid any translation: >> A >> B\xffC >> D >> >> Upon reading I get only: >> >> > read.table("test.txt", encoding="latin1") >> V1 >> 1 A >> 2 B >> Warning messages: >> 1: In read.table("test.txt", encoding = "latin1") : >> invalid input found on input connection 'test.txt' >> 2: In read.table("test.txt", encoding = "latin1") : >> incomplete final line found by readTableHeader on 'test.txt' >> > readLines(f, encoding="latin1") >> [1] "A" "B" >> Warning messages: >> 1: In readLines(f, encoding = "latin1") : >> invalid input found on input connection 'test.txt' >> 2: In readLines(f, encoding = "latin1") : >> incomplete final line found on 'test.txt' >> >> Hence the file is truncated. However, character \xff is a valid latin1 >> character, >> as one can check for instance at >> https://en.wikipedia.org/wiki/ISO/IEC_8859-1 >> I tried with an UTF-8 version of this file: >> >> f <- file("test.txt", "wb") >> writeBin(as.integer(c(65, 13, 10, 66, 195, 191, 67, 13, 10, 68, 13, >> 10)), f, size=1) >> close(f) >> read.table("test.txt", encoding="UTF-8") >> f <- file("test.txt", "rt") >> readLines(f, encoding="UTF-8") >> close(f) >> >> Since this character ÿ is encoded as two bytes 195, 191 in UTF-8, I would >> expect >> that I get my complete file. But I don't. Instead, I get: >> >> > read.table("test.txt", encoding="UTF-8") >> V1 >> 1 A >> 2 B >> 3 C >> 4 D >> Warning message: >> In read.table("test.txt", encoding = "UTF-8") : >> incomplete final line found by readTableHeader on 'test.txt' >> >> > readLines(f, encoding="UTF-8") >> [1] "A" "B" >> Warning message: >> In readLines(f, encoding = "UTF-8") : >> incomplete final line found on 'test.txt' >> >> I tried all the preceding but with options(encoding="latin1") at the >> beginning. >> For the first attempt, with byte 255, I get: >> >> > read.table("test.txt", encoding="latin1") >> V1 >> 1 A >> 2 B >> 3 C >> 4 D >> Warning message: >> In read.table("test.txt", encoding = "latin1") : >> incomplete final line found by readTableHeader on 'test.txt' >> > >> > f <- file("test.txt", "rt") >> > readLines(f, encoding="latin1") >> >> For the other attempt, with 195, 191: >> >> > read.table("test.txt", encoding="UTF-8") >> V1 >> 1 A >> 2 BÿC >> 3 D >> > >> > f <- file("test.txt", "rt") >> > readLines(f, encoding="UTF-8") >> [1] "A" "BÿC" "D" >> > close(f) >> >> Thus the second one does indeed work, it seems. Just a check: >> >> > a <- read.table("test.txt", encoding="UTF-8") >> > Encoding(a$V1) >> [1] "unknown" "UTF-8" "unknown" >> >> At last, I figured out that with the default encoding in R, both attempts >> work, >> with or without even giving the encoding as a parameter of read.table >> or readLines. >> However, I don't understand what happens: >> >> f <- file("test.txt", "wb") >> writeBin(as.integer(c(65, 13, 10, 66, 255, 67, 13, 10, 68, 13, 10)), >> f, size=1) >> close(f) >> a <- read.table("test.txt", encoding="latin1")$V1 >> Encoding(a) >> iconv(a[2], toRaw=T) >> a >> a <- read.table("test.txt")$V1 >> Encoding(a) >> iconv(a[2], toRaw=T) >> a >> >> This will yield: >> >> > a <- read.table("test.txt", encoding="latin1")$V1 >> > Encoding(a) >> [1] "unknown" "latin1" "unknown" >> > iconv(a[2], toRaw=T) >> [[1]] >> [1] 42 ff 43 >> > a >> [1] "A" "BÿC" "D" >> > >> > a <- read.table("test.txt")$V1 >> > Encoding(a) >> [1] "unknown" "unknown" "unknown" >> > iconv(a[2], toRaw=T) >> [[1]] >> [1] 42 ff 43 >> > a >> [1] "A" "BÿC" "D" >> >> The second line is correctly encoded, but the encoding is just not >> "marked" in one case. >> With the UTF-8 bytes: >> >> f <- file("test.txt", "wb") >> writeBin(as.integer(c(65, 13, 10, 66, 195, 191, 67, 13, 10, 68, 13, >> 10)), f, size=1) >> close(f) >> a <- read.table("test.txt", encoding="UTF-8")$V1 >> Encoding(a) >> iconv(a[2], toRaw=T) >> a >> a <- read.table("test.txt")$V1 >> Encoding(a) >> iconv(a[2], toRaw=T) >> a >> >> This will yield: >> >> > a <- read.table("test.txt", encoding="UTF-8")$V1 >> > Encoding(a) >> [1] "unknown" "UTF-8" "unknown" >> > iconv(a[2], toRaw=T) >> [[1]] >> [1] 42 c3 bf 43 >> > a >> [1] "A" "BÿC" "D" >> > a <- read.table("test.txt")$V1 >> > Encoding(a) >> [1] "unknown" "unknown" "unknown" >> > iconv(a[2], toRaw=T) >> [[1]] >> [1] 42 c3 bf 43 >> > a >> [1] "A" "BÿC" "D" >> >> Both are correctly read (the raw bytes are ok), but the second one doesn't >> print >> correctly because the encoding is not "marked". >> >> My thoughts: >> With options(encoding="native.enc"), the characters read are not >> translated, and are read >> as raw bytes, which can get an encoding mark to print correctly (otherwise >> it >> prints as native, that is mostly latin1). >> With options(encoding="latin1"), and reading the UTF-8 file, I guess it's >> mostly >> like the preceding: the characters are read as raw, and marked as >> UTF-8, which works. >> With options(encoding="latin1"), and reading the latin1 file (with the >> 0xFF byte), >> I don't understand what happens. The file gets truncated almost as if 0xFF >> were >> an EOF character - which is perplexing, since I think that in C, 0xFF >> is sometimes >> (wrongly) confused with EOF. >> And with options(encoding="UTF-8"), I am not sure what happens. >> >> Questions: >> * What's wrong with options(encoding="latin1")? >> * Is it unsafe to use another option(encoding) than the default >> native.enc, on Windows? >> * Is it safe to assume that with native.enc R reads raw characters >> and, only when requested, >> marks an encoding afterwards? (that is, I get "unknown" by default >> which is printed >> as latin1 on Windows, and if I enforce another encoding, it will be >> used whatever >> the bytes really are) >> * What does really happen with another option(encoding), especially UTF-8? >> * If I save a character variable to an Rdata file, is the file usable >> on another OS, >> or on the same with another default encoding (by changing >> options())? Does it depend >> whether the character string has un "unknown" encoding or an explicit >> one? >> * Is there a way (preferably an options()) to tell R to read text >> files as UTF-8 by default? >> Would it work with any one of read.table(), readLines(), or even >> source()? >> I thought options(encoding="UTF-8") would do, but it fails on the >> examples above. >> >> Best regards, >> >> Jean-Claude Arbaut >> >> ______________________________________________ >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.