You seem confused about Latin-1: those characters are not in Latin-1. (MicroSoft code pages are a proprietary encoding, some code pages such as CP1252 being extensions to Latin-1.)

You have not given the 'at a minimum information' asked for in the posting guide so we have no way to reproduce this, and without showing us the output on your system, we have no idea what you saw.

[As a convenience to Windows users, R does in some cases assume that they are using Latin-1 encodings. If they use extensions to Latin-1 then there are no guarantees that code written for strict Latin-1 will work.]

On 01/08/2017 10:19, Daniel Possenriede wrote:
Upon further inspection, I think these are at least two problems.
First the issue with printing latin1/cp1252 characters in the "80" to "9F"
code range.

x <- c("€", "–", "‰")
Encoding(x)
print(x)

I assume that these are Unicode escapes!? (Given that Encoding(x) shows
"latin1" I'd rather expect latin1/cp1252 escapes here, but these would be     
e.g. "\x80", right? My locale is LC_COLLATE=German_Germany.1252 btw.)
Now I don't know why print tries to convert to Unicode, but if these indeed
are Unicode escapes, then there is something wrong with the conversion from
cp1252 to Unicode.
In general, most cp1252 char codes translate to Unicode like CP1252: "00"
-> Unicode "0000", "01" -> "0001", "02" -> "0002", etc. see
http://www.cp1252.com/.
The exception is the cp1252 "80" to "9F" code range. E.g. the Euro sign is
"80" in cp1252 but "20AC" in Unicode, endash "96" in cp1252, "2013" in
Unicode.
The same error seems to happen with

enc2utf8(x)

Now with iconv() the result is as expected.

iconv(x, to = "UTF-8")


The second problem IMO is that encoding markers get lost with the enc2*
functions

As you are changing encodings, you do not want to preserve encoding!

x_utf8 <- enc2utf8(x)
Encoding(x_utf8)
x_nat <- enc2native(x_utf8)
Encoding(x_nat)

In an actual Latin-1 locale on Linux

> x_utf8 <- c("éè", "\u20ac", "\u2013")
> Encoding(x_utf8)
[1] "latin1" "UTF-8"  "UTF-8"
> enc2native(x_utf8)
[1] "éè"     "<U+20AC>" "<U+2013>"
> Encoding(.Last.value)
[1] "latin1"  "unknown" "unknown"

as expected.

Again, this is not the case with iconv()

x_iutf8 <- iconv(x, to = "UTF-8")
Encoding(x_iutf8)
x_inat <- iconv(x_iutf8, from = "UTF-8")
Encoding(x_inat)

iconv is converting from/to the current locale's encoding, presumably CP1252, not from the marked encoding (as the help page states explicitly.)

--
Brian D. Ripley,                  rip...@stats.ox.ac.uk
Emeritus Professor of Applied Statistics, University of Oxford

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to