On 13-11-10 2:41 PM, Sverre Stausland wrote:
With respect to your comment (sorry, the e-mail you wrote that in
didn't get to my inbox):

I don't think so.  In general, functions that convert to the native
encoding break UTF-8 on Windows, because the native encoding is often
Latin1 or some other encoding that doesn't cover all the characters in
UTF-8.

As I understand it, the native encoding in Windows is UTF-18, not Latin1:
http://msdn.microsoft.com/en-us/library/dd374081.aspx

And UTF-18 is a superset of UTF-8, isn't it?

UTF-16 is not an 8-bit compatible encoding. You can't store a UTF-16 string in a C null-terminated char* string, it needs a different type of storage.

Because of this, Windows uses wide 16 bit characters in many cases internally, but converts to 8 bit characters with an encoding depending on the locale. At the time they invented this convention, UTF-8 didn't exist (it was invented in 1993, the year Microsoft released Windows NT using UCS-2 encodings), but there's no real reason they couldn't have added a new locale that defaults to UTF-8 as the 8 bit encoding.

R on the other hand didn't choose to use UTF-16 as its internal encoding because at the time it was written, Unix systems didn't properly support it. They generally used 8 bit encodings, and because of its roots, that made sense for R too.

Duncan Murdoch


Sverre

On Sun, Nov 10, 2013 at 1:49 PM, Duncan Murdoch
<murdoch.dun...@gmail.com> wrote:
On 13-11-10 7:31 AM, Sverre Stausland wrote:

My e-mail was intended as a typical "feature request", and I couldn't
find any more suitable place for that than the r-devel mailing list. I
am not a programmer, so I don't have the skills to write this into R's
source code myself.

The incentive is nevertheless clear enough. I believe a software
program in 2013 which imports, manipulates, and exports text in
various formats (text files, picture files, postscript files, etc.)
would normally be expected to support UTF-8. It might not be trivial
to implement as R is written now, but the expectation will still be
there. So I still believe it would be a good idea if R soon would be
able to support UTF-8.


R does support UTF-8.  It all works smoothly in a UTF-8 locale, not so
smoothly if you have your computer set up to use a different 8 bit encoding.


I'm not quite able to piece together from the information you gave
what the underlying issues are. What I read is:
(1) Some R functions convert characters to the native encoding.
(2) Windows did not support UTF-8 when R was first written.
(3) Unix did not support UCS-2 when R was first written.

I'm guessing here that the implications are:
(1) R's write.table() converts characters to a native encoding.
(2) The native encoding in Windows 7 is not UTF-8.
(3) The native encoding in Unix systems is UTF-8.


You got it right for the first 4.  Regarding (2) in your second list, that's
right, and in fact UTF-8 is not supported as a native encoding.
And point (3) is optional, though UTF-8 is the dominant encoding nowadays.

The easiest solution is for you to switch to a Unix variant and set it up to
use UTF-8 as the native encoding.

Next easiest would be for Microsoft to add UTF-8 as a code page.

Most difficult would be for R to handle UTF-8 properly on systems with
limited support for it.

We probably will add small changes that let you work around the Windows
problems, but they won't be very satisfactory to anyone.  I don't think we
will make the big changes that would make R look like "a software program in
2013", since it would be so much work, and there's such an easy workaround.

Duncan Murdoch


But this is just guesswork.




PS. A related issue:

http://stackoverflow.com/questions/19881553/using-unicode-inside-rs-expression-command

Sverre



______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to