Re: [R-pkg-devel] Package Encoding and Literal Strings

Tomas Kalibera Thu, 17 Dec 2020 01:47:50 -0800

On 12/16/20 11:07 PM, jo...@jorisgoosen.nl wrote:

David,


Thanks for the response!

So the problem is a bit worse then just setting `encoding="UTF-8"` on
functions like readLines.
I'll describe our setup a bit:
So we run R embedded in a separate executable and through a whole bunch of
C(++) magic get that to the main executable that runs the actual interface.
All the code that isn't R basically uses UTF-8. This works good and we've
made sure that all of our source code is encoded properly and I've verified
that for this particular problem at least my source file is definitely
encoded in UTF-8 (Ive checked a hexdump).

The simplest solution, that we initially took, to get R+Windows to
cooperate with everything is to simply set the locale to "C" before
starting R. That way R simply assumes UTF-8 is native and everything worked
splendidly. Until of course a file needs to be opened in R that contains
some non-ASCII characters. I noticed the problem because a korean user had
hangul in his username and that broke everything. This because R was trying
to convert to a different locale than Windows was using.

Setting locale to "C" does not make R assume UTF-8 is the nativeencoding, there is no way to make UTF-8 the current native encoding in Ron the current builds of R on Windows. This is an old limitation ofWindows, only recently fixed by Microsoft in recent Windows 10 and withUCRT Windows runtime (see my blog post [1] for more - to make R supportthis we need a new toolchain to build R).

If you set the locale to C encoding, you are telling R the nativeencoding is C/POSIX (essentially ASCII), not UTF-8. Encoding-sensitiveoperations, including conversions, including those conversions thathappen without user control e.g. for interacting with Windows, willproduce incorrect results (garbage) or in better case errors, warnings,omitted, substituted or transliterated characters.

In principle setting the encoding via locale is dangerous on Windows,because Windows has two current encodings, not just one. By settinglocale you set the one used in the C runtime, but not the other one usedby the system calls. If all code (in R, packages, external libraries)was perfect, this would still work as long as all strings used wererepresentable in both encodings. For other strings it won't work, andthen code is not perfect in this regard, it is usually written assumingthere is one current encoding, which common sense dictates should be thecase. With the recent UTF-8 support ([1]), one can switch both of theseto UTF-8.

The solution I've now been working on is:
I took the sourcecode of R 4.0.3 and changed the backend of "gettext" to
add an `encoding="something something"` option. And a bit of extra stuff
like `bind_textdomain_codeset` in case I need to tweak the codeset/charset
that gettext uses.
I think I've got that working properly now and once I solve the problem of
the encoding in a pkg I will open a bugreport/feature-request and I'll add
a patch that implements it.

A number of similar "shortcuts" have been added to R in the past, butthey may the code more complex, harder to maintain and use, and can'trealistically solve all of these problems, anyway. Strings willeventually be assumed to be in what is the current native encoding bythe C library. In R, any external code R uses, or code R packages use.Now that Microsoft finally is supporting UTF-8, the way to get out ofthis is switching to UTF-8. This needs only small changes to R sourcecode compared to those "shortcuts" (or to using UTF-16LE). I'd beagainst polluting the code with any more "shortcuts".

The problem I'm stuck with now is simply this:
I have an R pkg here that I want to test the translations with and the code
is definitely saved as UTF-8, the package has "Encoding: UTF-8" in the
DESCRIPTION and it all loads and works. The particular problem I have is
that the R code contains literally: `mathotString <- "Mathôt!"`
The actual file contains the hexadecimal representation of ô as proper
utf-8: "0xC3 0xB4" but R turns it into: "0xf4".
Seemingly on loading the package, because I haven't done anything with it
except put it in my debug c-function to print its contents as
hexadecimals...

The only thing I want to achieve here is that when R loads the package it
keeps those strings in their original UTF-8 encoding, without converting it
to "native" or the strange unicode codepoint it seemingly placed in there
instead. Because otherwise I cannot get gettext to work fully in UTF-8 mode.

Is this already possible in R?

In principle, working with strings not representable in the currentencoding is not reliable (and never will be). It can still work in somespecific cases and uses. Parsing a UTF-8 string literal from a file,with correctly declared encoding as documented in WRE, should work atleast in single-byte encodings. But what happens after that string isparsed is another thing. The parsing is based internally on using these"shortcuts", that is lying to a part of the parser about the encoding,and telling the rest of the parser that it is really something else (notnative, but UTF-8). The part that is being "lied to" may get confused ornot. It would not when the real native encoding is say latin1, a commoncase in the past for which the hack was created, but it might when it isa double-byte encoding that conflicts with the text being parsed indangerous ways. This is also why this hack only makes sense for stringliterals (and comments), and still only to a limit as the strings may bemisinterpreted later after parsing.

So a really short summary is: you can only reliably use stringsrepresentable in the current encoding in R, and that encoding cannot beUTF-8 on Windows in released versions of R. There is an experimentalversion, see [1], if you could experiment with that and see whether thatmight work for your applications, could try to find and report bugsthere (e.g. to me directly), that would be useful.

If you find behavior re encodings in released versions of R thatcontradicts the current documentation, please report with a minimalreproducible example, such cases should be fixed (even though sometimesthe "fix" would be just changing the documentation, the effort reallyshould be now for supporting UTF-8 for real). Specifically with"mathotString", you might try creating an example that does not includeany package (just calls to parse with encoding options set), only thengradually adding more of package loading if that does not reproduce. Itwould be important to know the current encoding (sessionInfo, l10n_info).


Best,
Tomas

[1]https://developer.r-project.org/Blog/public/2020/07/30/windows/utf-8-build-of-r-and-cran-packages/index.html


Cheers,
Joris



On Wed, 16 Dec 2020 at 20:15, David Bosak <dbosa...@gmail.com> wrote:

Joris:



I’ve fought with encoding problems on Windows a lot.  Here are some
general suggestions.



    1. Put “@encoding UTF-8” on any Roxygen comments.
    2. Put “encoding = “UTF-8” on any functions like writeLines or
    readLines that read/write to a text file.
    3. This post:
    https://kevinushey.github.io/blog/2018/02/21/string-encoding-and-r/



If you have a more specific problem, please describe and we can try to
help.



David



Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for
Windows 10



*From: *jo...@jorisgoosen.nl
*Sent: *Wednesday, December 16, 2020 1:52 PM
*To: *r-package-devel@r-project.org
*Subject: *[R-pkg-devel] Package Encoding and Literal Strings



Hello All,



Some context, I am one of the programmers of a software pkg (

https://jasp-stats.org/) that uses an embedded instance of R to do

statistics. And make that a bit easier for people who are intimidated by R

or like to have something more GUI oriented.





We have been working on translating the interface but ran into several

problems related to encoding of strings. We prefer to use UTF-8 for

everything and this works wonderful on unix systems, as is to be expected.



Windows however is a different matter. Currently I am working on some local

changes to "do_gettext" and some related internal functions of R to be able

to get UTF-8 encoded output from there.



But I ran into a bit of a problem and I think this mailinglist is probably

the best place to start.



It seems that if I have an R package that specifies "Encoding: UTF-8" in

DESCRIPTION the literal strings inside the package are converted to the

local codeset/codepage regardless of what I want.



Is it possible to keep the strings in UTF-8 internally in such a pkg

somehow?



Best regards,

Joris Goosen

University of Amsterdam



                 [[alternative HTML version deleted]]



______________________________________________

R-package-devel@r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-package-devel

        [[alternative HTML version deleted]]

______________________________________________
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel


______________________________________________
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel

Re: [R-pkg-devel] Package Encoding and Literal Strings

Reply via email to