Re: [R] Non-ACSII characters in R on Windows
On 13-09-17 9:01 AM, Duncan Murdoch wrote: On 13-09-17 8:15 AM, Milan Bouchet-Valat wrote: Le lundi 16 septembre 2013 à 20:04 +0400, Maxim Linchits a écrit : Here is that old post: http://r.789695.n4.nabble.com/read-csv-and-FileEncoding-in-Windows-version-of-R-2-13-0-td3567177.html A taste: Again, the issue is that opening this UTF-8 encoded file under R 2.13.0 yields an error, but opening it under R 2.12.2 works without any issues. (...) I have tried with R 2.12.2 both 32 and 64 bit on Windows Server 2008 with the French (CP1252) locale, and I still experience an error with the test case I provided in previous messages. So it does not sound like it is the same issue. I can reproduce the error with a file sent to me by Maxim. From a quick look, I suspect that changes will be needed to read.table to handle this, and they'll be large enough that they won't make it into 3.0.2, but hopefully will go into R-patched after the release. This took a lot longer than expected, but some changes are now in R-devel (as of r64831). Files that can't be displayed in the local encoding are not necessarily displayed well, but they should be stored properly. Please let me know of any problems. I think the display issues can be improved, not sure about solved. Duncan Murdoch __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Non-ACSII characters in R on Windows
On 13-09-19 5:06 AM, Maxim Linchits wrote: Have any of the thread participants sent a bug report to R? If not, let me know if you intend to so so. Otherwise, I'll send a report myself. There's no bug, as far as I know. The issue is that various functions (by design) convert strings to the local encoding, and in the example you were trying, the local encoding can't represent all the characters, so they are shown using the hex codes, and things get messed up. I'm currently looking into changing the design, so that there is more use of UTF-8 internally. This is likely to have side effects, which need to be investigated carefully. Duncan Murdoch thanks On Tue, Sep 17, 2013 at 5:01 PM, Duncan Murdoch murdoch.dun...@gmail.com wrote: On 13-09-17 8:15 AM, Milan Bouchet-Valat wrote: Le lundi 16 septembre 2013 à 20:04 +0400, Maxim Linchits a écrit : Here is that old post: http://r.789695.n4.nabble.com/read-csv-and-FileEncoding-in-Windows-version-of-R-2-13-0-td3567177.html A taste: Again, the issue is that opening this UTF-8 encoded file under R 2.13.0 yields an error, but opening it under R 2.12.2 works without any issues. (...) I have tried with R 2.12.2 both 32 and 64 bit on Windows Server 2008 with the French (CP1252) locale, and I still experience an error with the test case I provided in previous messages. So it does not sound like it is the same issue. I can reproduce the error with a file sent to me by Maxim. From a quick look, I suspect that changes will be needed to read.table to handle this, and they'll be large enough that they won't make it into 3.0.2, but hopefully will go into R-patched after the release. Duncan Murdoch __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Non-ACSII characters in R on Windows
Le lundi 16 septembre 2013 à 20:04 +0400, Maxim Linchits a écrit : Here is that old post: http://r.789695.n4.nabble.com/read-csv-and-FileEncoding-in-Windows-version-of-R-2-13-0-td3567177.html A taste: Again, the issue is that opening this UTF-8 encoded file under R 2.13.0 yields an error, but opening it under R 2.12.2 works without any issues. (...) I have tried with R 2.12.2 both 32 and 64 bit on Windows Server 2008 with the French (CP1252) locale, and I still experience an error with the test case I provided in previous messages. So it does not sound like it is the same issue. Regards On Mon, Sep 16, 2013 at 6:38 PM, Milan Bouchet-Valat nalimi...@club.fr wrote: Le lundi 16 septembre 2013 à 10:40 +0200, Milan Bouchet-Valat a écrit : Le vendredi 13 septembre 2013 à 23:38 +0400, Maxim Linchits a écrit : This is a condensed version of the same question on stackexchange here: http://stackoverflow.com/questions/18789330/r-on-windows-character-encoding-hell If you've already stumbled upon it feel free to ignore. My problem is that R on US Windows does not read *any* text file that contains *any* foreign characters. It simply reads the first consecutive n ASCII characters and then throws a warning once it reached a foreign character: test - read.table(test.txt, sep=;, dec=,, quote=, fileEncoding=UTF-8) Warning messages: 1: In read.table(test.txt, sep = ;, dec = ,, quote = , fileEncoding = UTF-8) : invalid input found on input connection 'test.txt' 2: In read.table(test.txt, sep = ;, dec = ,, quote = , fileEncoding = UTF-8) : incomplete final line found by readTableHeader on 'test.txt' print(test) V1 1 english Sys.getlocale() [1] LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252; LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 It is important to note that that R on linux will read UTF-8 as well as exotic character sets without a problem. I've tried it with the exact same files (one was UTF-8 and another was OEM866 Cyrillic). If I do not include the fileEncoding parameter, read.table will read the whole CSV file. But naturally it will read it wrong because it does not know the encoding. So whenever I try to specify the fileEncoding, R will throw the warnings and stop once it reaches a foreign character. It's the same story with all international character encodings. Other users on stackexchange have reported exactly the same issue. Is anyone here who is on a US version of Windows able to import files with foreign characters? Please let me know. A reproducible example would have helped, as requested by the posting guide. Though I am also experiencing the same problem after saving the data below to a CSV file encoded in UTF-8 (you can do this using even the Notepad): Ա,Բ 1,10 2,20 This is on a Windows 7 box using French locale, but same codepage 1252 as yours. What is interesting is that reading the file using readLines(file(myFile.csv, encoding=UTF-8)) gives no invalid characters. So there must be a bug in read.table(). But I must note I do not experience issues with French accentuated characters like é (\Ue9). On the contrary, reading Armenian characters like Ա (\U531) gives weird results: the character appears as U+0531 instead of Ա. Self-contained example, writing the file and reading it back from R: tmpfile - tempfile() writeLines(\U531, file(tmpfile, w, encoding=UTF-8)) readLines(file(tmpfile, encoding=UTF-8)) # U+0531 The same phenomenon happens when creating a data frame from this character (as noted on StackExchange): data.frame(\U531) So my conclusion is that maybe Windows does not really support Unicode characters that are not relevant for your current locale. And that may have created bugs in the way R handles them in read.table(). R developers can probably tell us more about it. After some more investigation, one part of the problem can be traced back to scan() (with myFile.csv filled as described above): scan(myFile.csv, encoding=UTF-8, sep=,, nlines=1) # Read 2 items # [1] Ա Բ Equivalent, but nonsensical to me: scan(myFile.csv, fileEncoding=CP1252, encoding=UTF-8, sep=,, nlines=1) # Read 2 items # [1] Ա Բ scan(myFile.csv, fileEncoding=UTF-8, sep=,, nlines=1) # Read 0 items # character(0) # Warning message: # In scan(file, what, nmax, sep, dex, quote, skip, nlines, na.strings, : # invalid input found on input connection 'myFile.csv' So there seem to be one part of the issue in scan(), which for some reason does not work when passed fileEncoding=UTF-8; and another part in read.table(), which transforms Ա (\U531) into X.U.0531., probably via make.names(), since: make.names(\U531) # X.U.0531. Does this make sense to R-core
Re: [R] Non-ACSII characters in R on Windows
On 13-09-17 8:15 AM, Milan Bouchet-Valat wrote: Le lundi 16 septembre 2013 à 20:04 +0400, Maxim Linchits a écrit : Here is that old post: http://r.789695.n4.nabble.com/read-csv-and-FileEncoding-in-Windows-version-of-R-2-13-0-td3567177.html A taste: Again, the issue is that opening this UTF-8 encoded file under R 2.13.0 yields an error, but opening it under R 2.12.2 works without any issues. (...) I have tried with R 2.12.2 both 32 and 64 bit on Windows Server 2008 with the French (CP1252) locale, and I still experience an error with the test case I provided in previous messages. So it does not sound like it is the same issue. I can reproduce the error with a file sent to me by Maxim. From a quick look, I suspect that changes will be needed to read.table to handle this, and they'll be large enough that they won't make it into 3.0.2, but hopefully will go into R-patched after the release. Duncan Murdoch __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Non-ACSII characters in R on Windows
Le vendredi 13 septembre 2013 à 23:38 +0400, Maxim Linchits a écrit : This is a condensed version of the same question on stackexchange here: http://stackoverflow.com/questions/18789330/r-on-windows-character-encoding-hell If you've already stumbled upon it feel free to ignore. My problem is that R on US Windows does not read *any* text file that contains *any* foreign characters. It simply reads the first consecutive n ASCII characters and then throws a warning once it reached a foreign character: test - read.table(test.txt, sep=;, dec=,, quote=, fileEncoding=UTF-8) Warning messages: 1: In read.table(test.txt, sep = ;, dec = ,, quote = , fileEncoding = UTF-8) : invalid input found on input connection 'test.txt' 2: In read.table(test.txt, sep = ;, dec = ,, quote = , fileEncoding = UTF-8) : incomplete final line found by readTableHeader on 'test.txt' print(test) V1 1 english Sys.getlocale() [1] LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252; LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 It is important to note that that R on linux will read UTF-8 as well as exotic character sets without a problem. I've tried it with the exact same files (one was UTF-8 and another was OEM866 Cyrillic). If I do not include the fileEncoding parameter, read.table will read the whole CSV file. But naturally it will read it wrong because it does not know the encoding. So whenever I try to specify the fileEncoding, R will throw the warnings and stop once it reaches a foreign character. It's the same story with all international character encodings. Other users on stackexchange have reported exactly the same issue. Is anyone here who is on a US version of Windows able to import files with foreign characters? Please let me know. A reproducible example would have helped, as requested by the posting guide. Though I am also experiencing the same problem after saving the data below to a CSV file encoded in UTF-8 (you can do this using even the Notepad): Ա,Բ 1,10 2,20 This is on a Windows 7 box using French locale, but same codepage 1252 as yours. What is interesting is that reading the file using readLines(file(myFile.csv, encoding=UTF-8)) gives no invalid characters. So there must be a bug in read.table(). But I must note I do not experience issues with French accentuated characters like é (\Ue9). On the contrary, reading Armenian characters like Ա (\U531) gives weird results: the character appears as U+0531 instead of Ա. Self-contained example, writing the file and reading it back from R: tmpfile - tempfile() writeLines(\U531, file(tmpfile, w, encoding=UTF-8)) readLines(file(tmpfile, encoding=UTF-8)) # U+0531 The same phenomenon happens when creating a data frame from this character (as noted on StackExchange): data.frame(\U531) So my conclusion is that maybe Windows does not really support Unicode characters that are not relevant for your current locale. And that may have created bugs in the way R handles them in read.table(). R developers can probably tell us more about it. Regards __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Non-ACSII characters in R on Windows
Le lundi 16 septembre 2013 à 10:40 +0200, Milan Bouchet-Valat a écrit : Le vendredi 13 septembre 2013 à 23:38 +0400, Maxim Linchits a écrit : This is a condensed version of the same question on stackexchange here: http://stackoverflow.com/questions/18789330/r-on-windows-character-encoding-hell If you've already stumbled upon it feel free to ignore. My problem is that R on US Windows does not read *any* text file that contains *any* foreign characters. It simply reads the first consecutive n ASCII characters and then throws a warning once it reached a foreign character: test - read.table(test.txt, sep=;, dec=,, quote=, fileEncoding=UTF-8) Warning messages: 1: In read.table(test.txt, sep = ;, dec = ,, quote = , fileEncoding = UTF-8) : invalid input found on input connection 'test.txt' 2: In read.table(test.txt, sep = ;, dec = ,, quote = , fileEncoding = UTF-8) : incomplete final line found by readTableHeader on 'test.txt' print(test) V1 1 english Sys.getlocale() [1] LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252; LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 It is important to note that that R on linux will read UTF-8 as well as exotic character sets without a problem. I've tried it with the exact same files (one was UTF-8 and another was OEM866 Cyrillic). If I do not include the fileEncoding parameter, read.table will read the whole CSV file. But naturally it will read it wrong because it does not know the encoding. So whenever I try to specify the fileEncoding, R will throw the warnings and stop once it reaches a foreign character. It's the same story with all international character encodings. Other users on stackexchange have reported exactly the same issue. Is anyone here who is on a US version of Windows able to import files with foreign characters? Please let me know. A reproducible example would have helped, as requested by the posting guide. Though I am also experiencing the same problem after saving the data below to a CSV file encoded in UTF-8 (you can do this using even the Notepad): Ա,Բ 1,10 2,20 This is on a Windows 7 box using French locale, but same codepage 1252 as yours. What is interesting is that reading the file using readLines(file(myFile.csv, encoding=UTF-8)) gives no invalid characters. So there must be a bug in read.table(). But I must note I do not experience issues with French accentuated characters like é (\Ue9). On the contrary, reading Armenian characters like Ա (\U531) gives weird results: the character appears as U+0531 instead of Ա. Self-contained example, writing the file and reading it back from R: tmpfile - tempfile() writeLines(\U531, file(tmpfile, w, encoding=UTF-8)) readLines(file(tmpfile, encoding=UTF-8)) # U+0531 The same phenomenon happens when creating a data frame from this character (as noted on StackExchange): data.frame(\U531) So my conclusion is that maybe Windows does not really support Unicode characters that are not relevant for your current locale. And that may have created bugs in the way R handles them in read.table(). R developers can probably tell us more about it. After some more investigation, one part of the problem can be traced back to scan() (with myFile.csv filled as described above): scan(myFile.csv, encoding=UTF-8, sep=,, nlines=1) # Read 2 items # [1] Ա Բ Equivalent, but nonsensical to me: scan(myFile.csv, fileEncoding=CP1252, encoding=UTF-8, sep=,, nlines=1) # Read 2 items # [1] Ա Բ scan(myFile.csv, fileEncoding=UTF-8, sep=,, nlines=1) # Read 0 items # character(0) # Warning message: # In scan(file, what, nmax, sep, dex, quote, skip, nlines, na.strings, : # invalid input found on input connection 'myFile.csv' So there seem to be one part of the issue in scan(), which for some reason does not work when passed fileEncoding=UTF-8; and another part in read.table(), which transforms Ա (\U531) into X.U.0531., probably via make.names(), since: make.names(\U531) # X.U.0531. Does this make sense to R-core members? Regards __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Non-ACSII characters in R on Windows
UTF-8 on windows is a huge pain, this bites me often. Usually I give up and do the analysis on a Linux server. In previous struggles with this I've found this blog post enlightening: https://tomizonor.wordpress.com/2013/04/17/file-utf8-windows/ Best, Ista On Mon, Sep 16, 2013 at 10:38 AM, Milan Bouchet-Valat nalimi...@club.fr wrote: Le lundi 16 septembre 2013 à 10:40 +0200, Milan Bouchet-Valat a écrit : Le vendredi 13 septembre 2013 à 23:38 +0400, Maxim Linchits a écrit : This is a condensed version of the same question on stackexchange here: http://stackoverflow.com/questions/18789330/r-on-windows-character-encoding-hell If you've already stumbled upon it feel free to ignore. My problem is that R on US Windows does not read *any* text file that contains *any* foreign characters. It simply reads the first consecutive n ASCII characters and then throws a warning once it reached a foreign character: test - read.table(test.txt, sep=;, dec=,, quote=, fileEncoding=UTF-8) Warning messages: 1: In read.table(test.txt, sep = ;, dec = ,, quote = , fileEncoding = UTF-8) : invalid input found on input connection 'test.txt' 2: In read.table(test.txt, sep = ;, dec = ,, quote = , fileEncoding = UTF-8) : incomplete final line found by readTableHeader on 'test.txt' print(test) V1 1 english Sys.getlocale() [1] LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252; LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 It is important to note that that R on linux will read UTF-8 as well as exotic character sets without a problem. I've tried it with the exact same files (one was UTF-8 and another was OEM866 Cyrillic). If I do not include the fileEncoding parameter, read.table will read the whole CSV file. But naturally it will read it wrong because it does not know the encoding. So whenever I try to specify the fileEncoding, R will throw the warnings and stop once it reaches a foreign character. It's the same story with all international character encodings. Other users on stackexchange have reported exactly the same issue. Is anyone here who is on a US version of Windows able to import files with foreign characters? Please let me know. A reproducible example would have helped, as requested by the posting guide. Though I am also experiencing the same problem after saving the data below to a CSV file encoded in UTF-8 (you can do this using even the Notepad): Ա,Բ 1,10 2,20 This is on a Windows 7 box using French locale, but same codepage 1252 as yours. What is interesting is that reading the file using readLines(file(myFile.csv, encoding=UTF-8)) gives no invalid characters. So there must be a bug in read.table(). But I must note I do not experience issues with French accentuated characters like é (\Ue9). On the contrary, reading Armenian characters like Ա (\U531) gives weird results: the character appears as U+0531 instead of Ա. Self-contained example, writing the file and reading it back from R: tmpfile - tempfile() writeLines(\U531, file(tmpfile, w, encoding=UTF-8)) readLines(file(tmpfile, encoding=UTF-8)) # U+0531 The same phenomenon happens when creating a data frame from this character (as noted on StackExchange): data.frame(\U531) So my conclusion is that maybe Windows does not really support Unicode characters that are not relevant for your current locale. And that may have created bugs in the way R handles them in read.table(). R developers can probably tell us more about it. After some more investigation, one part of the problem can be traced back to scan() (with myFile.csv filled as described above): scan(myFile.csv, encoding=UTF-8, sep=,, nlines=1) # Read 2 items # [1] Ա Բ Equivalent, but nonsensical to me: scan(myFile.csv, fileEncoding=CP1252, encoding=UTF-8, sep=,, nlines=1) # Read 2 items # [1] Ա Բ scan(myFile.csv, fileEncoding=UTF-8, sep=,, nlines=1) # Read 0 items # character(0) # Warning message: # In scan(file, what, nmax, sep, dex, quote, skip, nlines, na.strings, : # invalid input found on input connection 'myFile.csv' So there seem to be one part of the issue in scan(), which for some reason does not work when passed fileEncoding=UTF-8; and another part in read.table(), which transforms Ա (\U531) into X.U.0531., probably via make.names(), since: make.names(\U531) # X.U.0531. Does this make sense to R-core members? Regards __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide
Re: [R] Non-ACSII characters in R on Windows
Hi Duncan, I've put an example file online at https://docs.google.com/file/d/0B73Ve8vxnjR6QnRESXBQTHRUME0/edit?usp=sharing, with a screenshot showing the expected contents of the file at https://docs.google.com/file/d/0B73Ve8vxnjR6b1ZSQmtsRXdadVU/edit?usp=sharing Hopefully you'll find this easy and the rest of us can feel dumb for not having figured it out... Thanks, Ista On Mon, Sep 16, 2013 at 1:39 PM, Duncan Murdoch murdoch.dun...@gmail.com wrote: On 16/09/2013 12:04 PM, Maxim Linchits wrote: Here is that old post: http://r.789695.n4.nabble.com/read-csv-and-FileEncoding-in-Windows-version-of-R-2-13-0-td3567177.html In that post, you'll see I asked for a sample file. I never received any reply; presumably some spam filter didn't like what Alexander sent me, and Nabble doesn't archive any attachment. Similarly, the Stackoverflow thread contains no sample data. Could someone who is having this problem please put a small sample online for download? As I told Alexander last time, my experiments with files I constructed myself showed no errors. Duncan Murdoch A taste: Again, the issue is that opening this UTF-8 encoded file under R 2.13.0 yields an error, but opening it under R 2.12.2 works without any issues. (...) On Mon, Sep 16, 2013 at 6:38 PM, Milan Bouchet-Valat nalimi...@club.fr wrote: Le lundi 16 septembre 2013 à 10:40 +0200, Milan Bouchet-Valat a écrit : Le vendredi 13 septembre 2013 à 23:38 +0400, Maxim Linchits a écrit : This is a condensed version of the same question on stackexchange here: http://stackoverflow.com/questions/18789330/r-on-windows-character-encoding-hell If you've already stumbled upon it feel free to ignore. My problem is that R on US Windows does not read *any* text file that contains *any* foreign characters. It simply reads the first consecutive n ASCII characters and then throws a warning once it reached a foreign character: test - read.table(test.txt, sep=;, dec=,, quote=, fileEncoding=UTF-8) Warning messages: 1: In read.table(test.txt, sep = ;, dec = ,, quote = , fileEncoding = UTF-8) : invalid input found on input connection 'test.txt' 2: In read.table(test.txt, sep = ;, dec = ,, quote = , fileEncoding = UTF-8) : incomplete final line found by readTableHeader on 'test.txt' print(test) V1 1 english Sys.getlocale() [1] LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252; LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 It is important to note that that R on linux will read UTF-8 as well as exotic character sets without a problem. I've tried it with the exact same files (one was UTF-8 and another was OEM866 Cyrillic). If I do not include the fileEncoding parameter, read.table will read the whole CSV file. But naturally it will read it wrong because it does not know the encoding. So whenever I try to specify the fileEncoding, R will throw the warnings and stop once it reaches a foreign character. It's the same story with all international character encodings. Other users on stackexchange have reported exactly the same issue. Is anyone here who is on a US version of Windows able to import files with foreign characters? Please let me know. A reproducible example would have helped, as requested by the posting guide. Though I am also experiencing the same problem after saving the data below to a CSV file encoded in UTF-8 (you can do this using even the Notepad): Ա,Բ 1,10 2,20 This is on a Windows 7 box using French locale, but same codepage 1252 as yours. What is interesting is that reading the file using readLines(file(myFile.csv, encoding=UTF-8)) gives no invalid characters. So there must be a bug in read.table(). But I must note I do not experience issues with French accentuated characters like é (\Ue9). On the contrary, reading Armenian characters like Ա (\U531) gives weird results: the character appears as U+0531 instead of Ա. Self-contained example, writing the file and reading it back from R: tmpfile - tempfile() writeLines(\U531, file(tmpfile, w, encoding=UTF-8)) readLines(file(tmpfile, encoding=UTF-8)) # U+0531 The same phenomenon happens when creating a data frame from this character (as noted on StackExchange): data.frame(\U531) So my conclusion is that maybe Windows does not really support Unicode characters that are not relevant for your current locale. And that may have created bugs in the way R handles them in read.table(). R developers can probably tell us more about it. After some more investigation, one part of the problem can be traced back to scan() (with myFile.csv filled as described above): scan(myFile.csv, encoding=UTF-8, sep=,, nlines=1) # Read 2 items # [1] Ա Բ Equivalent, but nonsensical
Re: [R] Non-ACSII characters in R on Windows
Le lundi 16 septembre 2013 à 13:39 -0400, Duncan Murdoch a écrit : On 16/09/2013 12:04 PM, Maxim Linchits wrote: Here is that old post: http://r.789695.n4.nabble.com/read-csv-and-FileEncoding-in-Windows-version-of-R-2-13-0-td3567177.html In that post, you'll see I asked for a sample file. I never received any reply; presumably some spam filter didn't like what Alexander sent me, and Nabble doesn't archive any attachment. Similarly, the Stackoverflow thread contains no sample data. Could someone who is having this problem please put a small sample online for download? As I told Alexander last time, my experiments with files I constructed myself showed no errors. Yes, this was my first reaction, and then I saw the link to a second thread on StackOverflow with such an example. This is the one I took in my previous posts in this thread. If you want to get the file directly instead of pasting the contents it by hand, here is a version that should be enough: http://nalimilan.perso.neuf.fr/transfert/utf8.csv Regards Duncan Murdoch A taste: Again, the issue is that opening this UTF-8 encoded file under R 2.13.0 yields an error, but opening it under R 2.12.2 works without any issues. (...) On Mon, Sep 16, 2013 at 6:38 PM, Milan Bouchet-Valat nalimi...@club.fr wrote: Le lundi 16 septembre 2013 à 10:40 +0200, Milan Bouchet-Valat a écrit : Le vendredi 13 septembre 2013 à 23:38 +0400, Maxim Linchits a écrit : This is a condensed version of the same question on stackexchange here: http://stackoverflow.com/questions/18789330/r-on-windows-character-encoding-hell If you've already stumbled upon it feel free to ignore. My problem is that R on US Windows does not read *any* text file that contains *any* foreign characters. It simply reads the first consecutive n ASCII characters and then throws a warning once it reached a foreign character: test - read.table(test.txt, sep=;, dec=,, quote=, fileEncoding=UTF-8) Warning messages: 1: In read.table(test.txt, sep = ;, dec = ,, quote = , fileEncoding = UTF-8) : invalid input found on input connection 'test.txt' 2: In read.table(test.txt, sep = ;, dec = ,, quote = , fileEncoding = UTF-8) : incomplete final line found by readTableHeader on 'test.txt' print(test) V1 1 english Sys.getlocale() [1] LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252; LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 It is important to note that that R on linux will read UTF-8 as well as exotic character sets without a problem. I've tried it with the exact same files (one was UTF-8 and another was OEM866 Cyrillic). If I do not include the fileEncoding parameter, read.table will read the whole CSV file. But naturally it will read it wrong because it does not know the encoding. So whenever I try to specify the fileEncoding, R will throw the warnings and stop once it reaches a foreign character. It's the same story with all international character encodings. Other users on stackexchange have reported exactly the same issue. Is anyone here who is on a US version of Windows able to import files with foreign characters? Please let me know. A reproducible example would have helped, as requested by the posting guide. Though I am also experiencing the same problem after saving the data below to a CSV file encoded in UTF-8 (you can do this using even the Notepad): Ա,Բ 1,10 2,20 This is on a Windows 7 box using French locale, but same codepage 1252 as yours. What is interesting is that reading the file using readLines(file(myFile.csv, encoding=UTF-8)) gives no invalid characters. So there must be a bug in read.table(). But I must note I do not experience issues with French accentuated characters like é (\Ue9). On the contrary, reading Armenian characters like Ա (\U531) gives weird results: the character appears as U+0531 instead of Ա. Self-contained example, writing the file and reading it back from R: tmpfile - tempfile() writeLines(\U531, file(tmpfile, w, encoding=UTF-8)) readLines(file(tmpfile, encoding=UTF-8)) # U+0531 The same phenomenon happens when creating a data frame from this character (as noted on StackExchange): data.frame(\U531) So my conclusion is that maybe Windows does not really support Unicode characters that are not relevant for your current locale. And that may have created bugs in the way R handles them in read.table(). R developers can probably tell us more about it. After some more investigation, one part of the problem can be traced back to scan() (with myFile.csv filled as described above): scan(myFile.csv, encoding=UTF-8, sep=,, nlines=1)
Re: [R] Non-ACSII characters in R on Windows
On 16/09/2013 12:04 PM, Maxim Linchits wrote: Here is that old post: http://r.789695.n4.nabble.com/read-csv-and-FileEncoding-in-Windows-version-of-R-2-13-0-td3567177.html In that post, you'll see I asked for a sample file. I never received any reply; presumably some spam filter didn't like what Alexander sent me, and Nabble doesn't archive any attachment. Similarly, the Stackoverflow thread contains no sample data. Could someone who is having this problem please put a small sample online for download? As I told Alexander last time, my experiments with files I constructed myself showed no errors. Duncan Murdoch A taste: Again, the issue is that opening this UTF-8 encoded file under R 2.13.0 yields an error, but opening it under R 2.12.2 works without any issues. (...) On Mon, Sep 16, 2013 at 6:38 PM, Milan Bouchet-Valat nalimi...@club.fr wrote: Le lundi 16 septembre 2013 à 10:40 +0200, Milan Bouchet-Valat a écrit : Le vendredi 13 septembre 2013 à 23:38 +0400, Maxim Linchits a écrit : This is a condensed version of the same question on stackexchange here: http://stackoverflow.com/questions/18789330/r-on-windows-character-encoding-hell If you've already stumbled upon it feel free to ignore. My problem is that R on US Windows does not read *any* text file that contains *any* foreign characters. It simply reads the first consecutive n ASCII characters and then throws a warning once it reached a foreign character: test - read.table(test.txt, sep=;, dec=,, quote=, fileEncoding=UTF-8) Warning messages: 1: In read.table(test.txt, sep = ;, dec = ,, quote = , fileEncoding = UTF-8) : invalid input found on input connection 'test.txt' 2: In read.table(test.txt, sep = ;, dec = ,, quote = , fileEncoding = UTF-8) : incomplete final line found by readTableHeader on 'test.txt' print(test) V1 1 english Sys.getlocale() [1] LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252; LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 It is important to note that that R on linux will read UTF-8 as well as exotic character sets without a problem. I've tried it with the exact same files (one was UTF-8 and another was OEM866 Cyrillic). If I do not include the fileEncoding parameter, read.table will read the whole CSV file. But naturally it will read it wrong because it does not know the encoding. So whenever I try to specify the fileEncoding, R will throw the warnings and stop once it reaches a foreign character. It's the same story with all international character encodings. Other users on stackexchange have reported exactly the same issue. Is anyone here who is on a US version of Windows able to import files with foreign characters? Please let me know. A reproducible example would have helped, as requested by the posting guide. Though I am also experiencing the same problem after saving the data below to a CSV file encoded in UTF-8 (you can do this using even the Notepad): Ա,Բ 1,10 2,20 This is on a Windows 7 box using French locale, but same codepage 1252 as yours. What is interesting is that reading the file using readLines(file(myFile.csv, encoding=UTF-8)) gives no invalid characters. So there must be a bug in read.table(). But I must note I do not experience issues with French accentuated characters like é (\Ue9). On the contrary, reading Armenian characters like Ա (\U531) gives weird results: the character appears as U+0531 instead of Ա. Self-contained example, writing the file and reading it back from R: tmpfile - tempfile() writeLines(\U531, file(tmpfile, w, encoding=UTF-8)) readLines(file(tmpfile, encoding=UTF-8)) # U+0531 The same phenomenon happens when creating a data frame from this character (as noted on StackExchange): data.frame(\U531) So my conclusion is that maybe Windows does not really support Unicode characters that are not relevant for your current locale. And that may have created bugs in the way R handles them in read.table(). R developers can probably tell us more about it. After some more investigation, one part of the problem can be traced back to scan() (with myFile.csv filled as described above): scan(myFile.csv, encoding=UTF-8, sep=,, nlines=1) # Read 2 items # [1] Ա Բ Equivalent, but nonsensical to me: scan(myFile.csv, fileEncoding=CP1252, encoding=UTF-8, sep=,, nlines=1) # Read 2 items # [1] Ա Բ scan(myFile.csv, fileEncoding=UTF-8, sep=,, nlines=1) # Read 0 items # character(0) # Warning message: # In scan(file, what, nmax, sep, dex, quote, skip, nlines, na.strings, : # invalid input found on input connection 'myFile.csv' So there seem to be one part of the issue in scan(), which for some reason does not work when passed fileEncoding=UTF-8; and another part in read.table(), which transforms Ա (\U531) into X.U.0531., probably via make.names(), since: