Re: [R] Non-ACSII characters in R on Windows

Duncan Murdoch Mon, 16 Sep 2013 16:57:13 -0700

On 16/09/2013 12:04 PM, Maxim Linchits wrote:

Here is that old post:
http://r.789695.n4.nabble.com/read-csv-and-FileEncoding-in-Windows-version-of-R-2-13-0-td3567177.html

In that post, you'll see I asked for a sample file. I never receivedany reply; presumably some spam filter didn't like what Alexander sentme, and Nabble doesn't archive any attachment.


Similarly, the Stackoverflow thread contains no sample data.

Could someone who is having this problem please put a small sampleonline for download? As I told Alexander last time, my experiments withfiles I constructed myself showed no errors.


Duncan Murdoch


A taste: "Again, the issue is that opening this UTF-8 encoded file
under R 2.13.0 yields an error, but opening it under R 2.12.2 works
without any issues. (...)"

On Mon, Sep 16, 2013 at 6:38 PM, Milan Bouchet-Valat <nalimi...@club.fr> wrote:
> Le lundi 16 septembre 2013 à 10:40 +0200, Milan Bouchet-Valat a écrit :
>> Le vendredi 13 septembre 2013 à 23:38 +0400, Maxim Linchits a écrit :
>> > This is a condensed version of the same question on stackexchange here:
>> > 
http://stackoverflow.com/questions/18789330/r-on-windows-character-encoding-hell
>> > If you've already stumbled upon it feel free to ignore.
>> >
>> > My problem is that R on US Windows does not read *any* text file that
>> > contains *any* foreign characters. It simply reads the first consecutive n
>> > ASCII characters and then throws a warning once it reached a foreign
>> > character:
>> >
>> > > test <- read.table("test.txt", sep=";", dec=",", quote="",
>> > fileEncoding="UTF-8")
>> > Warning messages:
>> > 1: In read.table("test.txt", sep = ";", dec = ",", quote = "", fileEncoding
>> > = "UTF-8") :
>> >   invalid input found on input connection 'test.txt'
>> > 2: In read.table("test.txt", sep = ";", dec = ",", quote = "", fileEncoding
>> > = "UTF-8") :
>> >   incomplete final line found by readTableHeader on 'test.txt'
>> > > print(test)
>> >        V1
>> > 1 english
>> >
>> > > Sys.getlocale()
>> >    [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
>> > States.1252;
>> >      LC_MONETARY=English_United
>> > States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
>> >
>> >
>> > It is important to note that that R on linux will read UTF-8 as well as
>> > exotic character sets without a problem. I've tried it with the exact same
>> > files (one was UTF-8 and another was OEM866 Cyrillic).
>> >
>> > If I do not include the fileEncoding parameter, read.table will read the
>> > whole CSV file. But naturally it will read it wrong because it does not
>> > know the encoding. So whenever I try to specify the fileEncoding, R will
>> > throw the warnings and stop once it reaches a foreign character. It's the
>> > same story with all international character encodings.
>> > Other users on stackexchange have reported exactly the same issue.
>> >
>> >
>> > Is anyone here who is on a US version of Windows able to import files with
>> > foreign characters? Please let me know.
>> A reproducible example would have helped, as requested by the posting
>> guide.
>>
>> Though I am also experiencing the same problem after saving the data
>> below to a CSV file encoded in UTF-8 (you can do this using even the
>> Notepad):
>> "Ա","Բ"
>> 1,10
>> 2,20
>>
>> This is on a Windows 7 box using French locale, but same codepage 1252
>> as yours. What is interesting is that reading the file using
>> readLines(file("myFile.csv", encoding="UTF-8"))
>> gives no invalid characters. So there must be a bug in read.table().
>>
>>
>> But I must note I do not experience issues with French accentuated
>> characters like "é" ("\Ue9"). On the contrary, reading Armenian
>> characters like "Ա" ("\U531") gives weird results: the character appears
>> as <U+0531> instead of Ա.
>>
>> Self-contained example, writing the file and reading it back from R:
>> tmpfile <- tempfile()
>> writeLines("\U531", file(tmpfile, "w", encoding="UTF-8"))
>> readLines(file(tmpfile, encoding="UTF-8"))
>> # "<U+0531>"
>>
>> The same phenomenon happens when creating a data frame from this
>> character (as noted on StackExchange):
>> data.frame("\U531")
>>
>> So my conclusion is that maybe Windows does not really support Unicode
>> characters that are not "relevant" for your current locale. And that may
>> have created bugs in the way R handles them in read.table(). R
>> developers can probably tell us more about it.
> After some more investigation, one part of the problem can be traced
> back to scan() (with myFile.csv filled as described above):
> scan("myFile.csv", encoding="UTF-8", sep=",", nlines=1)
> # Read 2 items
> # [1] "Ա" "Բ"
>
> Equivalent, but nonsensical to me:
> scan("myFile.csv", fileEncoding="CP1252", encoding="UTF-8", sep=",", nlines=1)
> # Read 2 items
> # [1] "Ա" "Բ"
>
> scan("myFile.csv", fileEncoding="UTF-8", sep=",", nlines=1)
> # Read 0 items
> # character(0)
> # Warning message:
> # In scan(file, what, nmax, sep, dex, quote, skip, nlines, na.strings,  :
> #  invalid input found on input connection 'myFile.csv'
>
>
> So there seem to be one part of the issue in scan(), which for some
> reason does not work when passed fileEncoding="UTF-8"; and another part
> in read.table(), which transforms "Ա" ("\U531") into "X.U.0531.",
> probably via make.names(), since:
> make.names("\U531")
> # "X.U.0531."
>
>
> Does this make sense to R-core members?
>
>
> Regards

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Non-ACSII characters in R on Windows

Reply via email to