Hi Duncan, I've put an example file online at https://docs.google.com/file/d/0B73Ve8vxnjR6QnRESXBQTHRUME0/edit?usp=sharing, with a screenshot showing the expected contents of the file at https://docs.google.com/file/d/0B73Ve8vxnjR6b1ZSQmtsRXdadVU/edit?usp=sharing
Hopefully you'll find this easy and the rest of us can feel dumb for not having figured it out... Thanks, Ista On Mon, Sep 16, 2013 at 1:39 PM, Duncan Murdoch <murdoch.dun...@gmail.com> wrote: > On 16/09/2013 12:04 PM, Maxim Linchits wrote: >> >> Here is that old post: >> >> http://r.789695.n4.nabble.com/read-csv-and-FileEncoding-in-Windows-version-of-R-2-13-0-td3567177.html > > > In that post, you'll see I asked for a sample file. I never received any > reply; presumably some spam filter didn't like what Alexander sent me, and > Nabble doesn't archive any attachment. > > Similarly, the Stackoverflow thread contains no sample data. > > Could someone who is having this problem please put a small sample online > for download? As I told Alexander last time, my experiments with files I > constructed myself showed no errors. > > Duncan Murdoch > > >> >> A taste: "Again, the issue is that opening this UTF-8 encoded file >> under R 2.13.0 yields an error, but opening it under R 2.12.2 works >> without any issues. (...)" >> >> On Mon, Sep 16, 2013 at 6:38 PM, Milan Bouchet-Valat <nalimi...@club.fr> >> wrote: >> > Le lundi 16 septembre 2013 à 10:40 +0200, Milan Bouchet-Valat a écrit : >> >> Le vendredi 13 septembre 2013 à 23:38 +0400, Maxim Linchits a écrit : >> >> > This is a condensed version of the same question on stackexchange >> >> > here: >> >> > >> >> > http://stackoverflow.com/questions/18789330/r-on-windows-character-encoding-hell >> >> > If you've already stumbled upon it feel free to ignore. >> >> > >> >> > My problem is that R on US Windows does not read *any* text file that >> >> > contains *any* foreign characters. It simply reads the first >> >> > consecutive n >> >> > ASCII characters and then throws a warning once it reached a foreign >> >> > character: >> >> > >> >> > > test <- read.table("test.txt", sep=";", dec=",", quote="", >> >> > fileEncoding="UTF-8") >> >> > Warning messages: >> >> > 1: In read.table("test.txt", sep = ";", dec = ",", quote = "", >> >> > fileEncoding >> >> > = "UTF-8") : >> >> > invalid input found on input connection 'test.txt' >> >> > 2: In read.table("test.txt", sep = ";", dec = ",", quote = "", >> >> > fileEncoding >> >> > = "UTF-8") : >> >> > incomplete final line found by readTableHeader on 'test.txt' >> >> > > print(test) >> >> > V1 >> >> > 1 english >> >> > >> >> > > Sys.getlocale() >> >> > [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United >> >> > States.1252; >> >> > LC_MONETARY=English_United >> >> > States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252" >> >> > >> >> > >> >> > It is important to note that that R on linux will read UTF-8 as well >> >> > as >> >> > exotic character sets without a problem. I've tried it with the exact >> >> > same >> >> > files (one was UTF-8 and another was OEM866 Cyrillic). >> >> > >> >> > If I do not include the fileEncoding parameter, read.table will read >> >> > the >> >> > whole CSV file. But naturally it will read it wrong because it does >> >> > not >> >> > know the encoding. So whenever I try to specify the fileEncoding, R >> >> > will >> >> > throw the warnings and stop once it reaches a foreign character. It's >> >> > the >> >> > same story with all international character encodings. >> >> > Other users on stackexchange have reported exactly the same issue. >> >> > >> >> > >> >> > Is anyone here who is on a US version of Windows able to import files >> >> > with >> >> > foreign characters? Please let me know. >> >> A reproducible example would have helped, as requested by the posting >> >> guide. >> >> >> >> Though I am also experiencing the same problem after saving the data >> >> below to a CSV file encoded in UTF-8 (you can do this using even the >> >> Notepad): >> >> "Ա","Բ" >> >> 1,10 >> >> 2,20 >> >> >> >> This is on a Windows 7 box using French locale, but same codepage 1252 >> >> as yours. What is interesting is that reading the file using >> >> readLines(file("myFile.csv", encoding="UTF-8")) >> >> gives no invalid characters. So there must be a bug in read.table(). >> >> >> >> >> >> But I must note I do not experience issues with French accentuated >> >> characters like "é" ("\Ue9"). On the contrary, reading Armenian >> >> characters like "Ա" ("\U531") gives weird results: the character >> >> appears >> >> as <U+0531> instead of Ա. >> >> >> >> Self-contained example, writing the file and reading it back from R: >> >> tmpfile <- tempfile() >> >> writeLines("\U531", file(tmpfile, "w", encoding="UTF-8")) >> >> readLines(file(tmpfile, encoding="UTF-8")) >> >> # "<U+0531>" >> >> >> >> The same phenomenon happens when creating a data frame from this >> >> character (as noted on StackExchange): >> >> data.frame("\U531") >> >> >> >> So my conclusion is that maybe Windows does not really support Unicode >> >> characters that are not "relevant" for your current locale. And that >> >> may >> >> have created bugs in the way R handles them in read.table(). R >> >> developers can probably tell us more about it. >> > After some more investigation, one part of the problem can be traced >> > back to scan() (with myFile.csv filled as described above): >> > scan("myFile.csv", encoding="UTF-8", sep=",", nlines=1) >> > # Read 2 items >> > # [1] "Ա" "Բ" >> > >> > Equivalent, but nonsensical to me: >> > scan("myFile.csv", fileEncoding="CP1252", encoding="UTF-8", sep=",", >> > nlines=1) >> > # Read 2 items >> > # [1] "Ա" "Բ" >> > >> > scan("myFile.csv", fileEncoding="UTF-8", sep=",", nlines=1) >> > # Read 0 items >> > # character(0) >> > # Warning message: >> > # In scan(file, what, nmax, sep, dex, quote, skip, nlines, na.strings, >> > : >> > # invalid input found on input connection 'myFile.csv' >> > >> > >> > So there seem to be one part of the issue in scan(), which for some >> > reason does not work when passed fileEncoding="UTF-8"; and another part >> > in read.table(), which transforms "Ա" ("\U531") into "X.U.0531.", >> > probably via make.names(), since: >> > make.names("\U531") >> > # "X.U.0531." >> > >> > >> > Does this make sense to R-core members? >> > >> > >> > Regards >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.