Hello Tomas, I have made a minimal example that demonstrates my problem: https://github.com/JorisGoosen/utf8StringsPkg
This package is encoded in UTF-8 as is Test.R. There is a little Rcpp function in there I wrote that displays the bytes straight from R's CHAR to be sure no conversion is happening. I would expect that the mathotString had "C3 B4" for "ô" but instead it gets "F4". As you can see when you run `utf8StringsPkg::testutf8_in_locale()`. Cheers, Joris On Fri, 18 Dec 2020 at 11:48, Tomas Kalibera <tomas.kalib...@gmail.com> wrote: > On 12/17/20 6:43 PM, jo...@jorisgoosen.nl wrote: > > > > On Thu, 17 Dec 2020 at 18:22, Tomas Kalibera <tomas.kalib...@gmail.com> > wrote: > >> On 12/17/20 5:17 PM, jo...@jorisgoosen.nl wrote: >> >> >> >> On Thu, 17 Dec 2020 at 10:46, Tomas Kalibera <tomas.kalib...@gmail.com> >> wrote: >> >>> On 12/16/20 11:07 PM, jo...@jorisgoosen.nl wrote: >>> > David, >>> > >>> > Thanks for the response! >>> > >>> > So the problem is a bit worse then just setting `encoding="UTF-8"` on >>> > functions like readLines. >>> > I'll describe our setup a bit: >>> > So we run R embedded in a separate executable and through a whole >>> bunch of >>> > C(++) magic get that to the main executable that runs the actual >>> interface. >>> > All the code that isn't R basically uses UTF-8. This works good and >>> we've >>> > made sure that all of our source code is encoded properly and I've >>> verified >>> > that for this particular problem at least my source file is definitely >>> > encoded in UTF-8 (Ive checked a hexdump). >>> > >>> > The simplest solution, that we initially took, to get R+Windows to >>> > cooperate with everything is to simply set the locale to "C" before >>> > starting R. That way R simply assumes UTF-8 is native and everything >>> worked >>> > splendidly. Until of course a file needs to be opened in R that >>> contains >>> > some non-ASCII characters. I noticed the problem because a korean user >>> had >>> > hangul in his username and that broke everything. This because R was >>> trying >>> > to convert to a different locale than Windows was using. >>> >>> Setting locale to "C" does not make R assume UTF-8 is the native >>> encoding, there is no way to make UTF-8 the current native encoding in R >>> on the current builds of R on Windows. This is an old limitation of >>> Windows, only recently fixed by Microsoft in recent Windows 10 and with >>> UCRT Windows runtime (see my blog post [1] for more - to make R support >>> this we need a new toolchain to build R). >>> >>> If you set the locale to C encoding, you are telling R the native >>> encoding is C/POSIX (essentially ASCII), not UTF-8. Encoding-sensitive >>> operations, including conversions, including those conversions that >>> happen without user control e.g. for interacting with Windows, will >>> produce incorrect results (garbage) or in better case errors, warnings, >>> omitted, substituted or transliterated characters. >>> >>> In principle setting the encoding via locale is dangerous on Windows, >>> because Windows has two current encodings, not just one. By setting >>> locale you set the one used in the C runtime, but not the other one used >>> by the system calls. If all code (in R, packages, external libraries) >>> was perfect, this would still work as long as all strings used were >>> representable in both encodings. For other strings it won't work, and >>> then code is not perfect in this regard, it is usually written assuming >>> there is one current encoding, which common sense dictates should be the >>> case. With the recent UTF-8 support ([1]), one can switch both of these >>> to UTF-8. >>> >> >> Well, this is exactly why I want to get rid of the situation. But this >> messes up the output because everything else expects UTF-8 which is why I'm >> looking for some kind of solution. >> >> >> >>> > The solution I've now been working on is: >>> > I took the sourcecode of R 4.0.3 and changed the backend of "gettext" >>> to >>> > add an `encoding="something something"` option. And a bit of extra >>> stuff >>> > like `bind_textdomain_codeset` in case I need to tweak the >>> codeset/charset >>> > that gettext uses. >>> > I think I've got that working properly now and once I solve the >>> problem of >>> > the encoding in a pkg I will open a bugreport/feature-request and I'll >>> add >>> > a patch that implements it. >>> >>> A number of similar "shortcuts" have been added to R in the past, but >>> they may the code more complex, harder to maintain and use, and can't >>> realistically solve all of these problems, anyway. Strings will >>> eventually be assumed to be in what is the current native encoding by >>> the C library. In R, any external code R uses, or code R packages use. >>> Now that Microsoft finally is supporting UTF-8, the way to get out of >>> this is switching to UTF-8. This needs only small changes to R source >>> code compared to those "shortcuts" (or to using UTF-16LE). I'd be >>> against polluting the code with any more "shortcuts". >>> >> >> I think the addition of " bind_textdomain_codeset" is not strictly >> necessary and can be left out. Because I think setting an environment >> variable as "OUTPUT_CHARSET=UTF-8" gives the same result for us. >> The addition of the "encoding" option to the internal "do_gettext" is >> just a few lines of code and I also undid some duplication between >> do_gettext and do_ngettext. Which should make it easier to maintain. But >> all of that is moot if there is no way to keep the literal strings from >> sources in UTF-8 anyhow. >> >> Before starting on this I did actually read your blogpost about UTF-8 >> several times and it seems like the best way forward. Not to mention it >> would make my life easier and me happier when I can stop worrying about >> Windows/Dos codepages! >> Thank you for your work on it indeed! >> >> But my problem with that is that a number of people still use an older >> version of windows and your solution won't work there. Which would mean >> that we either drop support for them or they would have to live with either >> weirdlooking translations. Or I have to go back to the suboptimal solution >> of the "C" locale which I really do want to avoid. Because as you said it >> breaks other stuff in unpredictable ways. >> >> The number of people using too old version of Windows should be small >> when this could become ready for production. Windows 8.1. is still >> supported, but there is the free upgrade to Windows 10 (also from no longer >> supported Windows 7), so this should not be a problem for desktop machines. >> It will be a problem for servers. >> > Well, I would not expect anyone to use a GUI-heavy application meant for > researchers on a server anyway so that would be fine. > >> >> >>> > The problem I'm stuck with now is simply this: >>> > I have an R pkg here that I want to test the translations with and the >>> code >>> > is definitely saved as UTF-8, the package has "Encoding: UTF-8" in the >>> > DESCRIPTION and it all loads and works. The particular problem I have >>> is >>> > that the R code contains literally: `mathotString <- "Mathôt!"` >>> > The actual file contains the hexadecimal representation of ô as proper >>> > utf-8: "0xC3 0xB4" but R turns it into: "0xf4". >>> > Seemingly on loading the package, because I haven't done anything with >>> it >>> > except put it in my debug c-function to print its contents as >>> > hexadecimals... >>> > >>> > The only thing I want to achieve here is that when R loads the package >>> it >>> > keeps those strings in their original UTF-8 encoding, without >>> converting it >>> > to "native" or the strange unicode codepoint it seemingly placed in >>> there >>> > instead. Because otherwise I cannot get gettext to work fully in UTF-8 >>> mode. >>> > >>> > Is this already possible in R? >>> >>> In principle, working with strings not representable in the current >>> encoding is not reliable (and never will be). It can still work in some >>> specific cases and uses. Parsing a UTF-8 string literal from a file, >>> with correctly declared encoding as documented in WRE, should work at >>> least in single-byte encodings. But what happens after that string is >>> parsed is another thing. The parsing is based internally on using these >>> "shortcuts", that is lying to a part of the parser about the encoding, >>> and telling the rest of the parser that it is really something else (not >>> native, but UTF-8). >> >> >> So the reason the string literals are turned into the local encoding is >> because setting the "Encoding" on a package is essentially a hack? >> >> String literals may be turned into local encoding because that is how >> R/packages/external software is written - it needs native encoding. Hacks >> here come when such code is given a string not in the local encoding, >> assuming that under some conditions such code will work. This includes a >> part of the parser and a hack to implement argument "encoding" of >> "parse()", which allows to parse (non-representable) UTF-8 strings when >> running in a single-byte locale such as latin 1 (see ?parse). >> > So the same `parse` function is used for loading a package? > > Parsing for usual packages is done at build time, when they are serialized > ("prepared for lazy loading"). I would have to look for the details in the > code, but either way, if the input is in UTF-8 but the native encoding is > different, either the input has to be converted to native encoding for the > parser, or that hack when part of the parser is being lied to about the > encoding (either via "parse()" or other way). If you have a minimal > reproducible example, I can help you find out whether the behavior seen is > expected/documented/bug. > > Because in that case I wonder if the "Encoding" option in "DESCRIPTION" is > handled the same as `encoding=` in parse. > > ?parse states: > > Character strings in the result will have a declared encoding if > encoding is "latin1" or "UTF-8", or if text is supplied with every > element of known encoding in a Latin-1 or UTF-8 locale. > > The sentence is a bit hard for me personally to parse but I interpret that > first part to mean that if "encoding" is specified as "UTF-8" all the > character string in the result will also have that encoding. > Is that a correct interpretation? > Because if so I do believe I found a problem and I will try to make a > minimal reproducable example. > > Please look first at this part of "?parse": > > "encoding: encoding to be assumed for input strings. If the value is > ‘"latin1"’ or ‘"UTF-8"’ it is used to mark character strings as known to be > in Latin-1 or UTF-8: it is not used to re-encode the input. To do the > latter, specify the encoding as part of the connection ‘con’ or _via_ > ‘options(encoding=)’: see the example under ‘file’. Arguments ‘encoding = > "latin1"’ and ‘encoding = "UTF-8"’ are ignored with a warning when running > in a MBCS locale." > > Together with the one you cite: > > "Character strings in the result will have a declared encoding if > ‘encoding’ is ‘"latin1"’ or ‘"UTF-8"’, or if ‘text’ is supplied with every > element of known encoding in a Latin-1 or UTF-8 locale." > > There are two things: which encoding strings are really encoded in, and > which encoding they are declared to be in. Normally this should always be > the same encoding (UTF-8, latin-1, or the concrete known native encoding), > but the "encoding=" argument allows to play with this. Strings declared to > be in "native" encoding for a while are treated as (single-byte) unknown > encoding and eventually they are declared to be of the encoding from the > "encoding=" argument. This only applies to strings declared as "native". > When strings are declared as UTF-8 or latin-1, they must be in that > encoding, and believed to be in that, the "encoding=" argument does not > affect those. > > So, when your inputs are declared as UTF-8, the "encoding=" hack should > not apply to them. Also note that ASCII strings are never declared to be > UTF-8 nor latin-1, they are always as "native" (and ASCII is assumed a > subset of all encodings). But your inputs probably are not declared to be > in UTF-8 (note this is "declared" wrt to Encoding() R function, the > encoding flag that character objects in R have), because you are probably > parsing from a file. I'd really need a reproducible example to be able to > explain what you are seeing. > > Best > Tomas > > > >> >> >>> The part that is being "lied to" may get confused or >>> not. It would not when the real native encoding is say latin1, a common >>> case in the past for which the hack was created, but it might when it is >>> a double-byte encoding that conflicts with the text being parsed in >>> dangerous ways. This is also why this hack only makes sense for string >>> literals (and comments), and still only to a limit as the strings may be >>> misinterpreted later after parsing. >>> >> >> Well our case is entirely limited to string literals that are presented >> to the user through an all-utf-8 interface. >> So I would assume not of the edge-cases would come into play. >> Any systempaths and things like that would still be in local encoding. >> >> >> >> >>> So a really short summary is: you can only reliably use strings >>> representable in the current encoding in R, and that encoding cannot be >>> UTF-8 on Windows in released versions of R. There is an experimental >>> version, see [1], if you could experiment with that and see whether that >>> might work for your applications, could try to find and report bugs >>> there (e.g. to me directly), that would be useful. >>> >> >> So when I read in certain R documentation that string can have an "UTF-8" >> encoding in R this is not true? >> As in, when I read documentation such as >> https://stat.ethz.ch/R-manual/R-devel/library/base/html/Encoding.html it >> really seems to indicate to me that UTF-8 is in fact supported in R on >> windows. >> My assumption was that R uses `translateChar` internally to make sure it >> is in the right encoding before interfacing with the OS and other places >> where this might matter. >> >> UTF-8 is supported in R on Windows in many ways, as documented. As long >> as you are using UTF-8 strings representable in the current encoding, so >> that they can be converted to native encoding and back without problems, >> you are fine, R will do the conversions as needed. The troubles come when >> such conversion is not possible. In the example of the parser, without the >> "encoding=" argument to "parse()", the parser will just work on any text >> you give to it, even when the text is in UTF-8: it will work by first >> converting to native encoding and then doing the parsing, no hacks >> involved. When interacting with external software, you'd just tell R to >> provide the strings in the encoding needed by that external software, so >> possibly UTF-8, so possibly convert, but all would work fine. The problem >> are characters not representable in the native encoding. >> > Exactly, I want to be able to support chinese etc as well while running in > a west-european locale. > This is also what mislead me, because I thought it was actually reading it > like that but the character is part of my local locale so I didn't notice > it. Especially as it was being printed correctly. I only noticed after > printing the literal values. > > >> >> >>> If you find behavior re encodings in released versions of R that >>> contradicts the current documentation, please report with a minimal >>> reproducible example, such cases should be fixed (even though sometimes >>> the "fix" would be just changing the documentation, the effort really >>> should be now for supporting UTF-8 for real). Specifically with >>> "mathotString", you might try creating an example that does not include >>> any package (just calls to parse with encoding options set), only then >>> gradually adding more of package loading if that does not reproduce. It >>> would be important to know the current encoding (sessionInfo, l10n_info). >>> >> >> Well, the reason I mailed the mailing list was because I couldn't for the >> life of me find any documentation that told me anything in particular about >> how literal strings are supposed to be stored in memory. But it just seems >> logical to me that if R already supports parsing and loading a package >> encoded with UTF-8 and it supports having UTF-8 strings in memory next to >> strings in native encoding the most straightforward way of loading this >> literal strings would be in UTF-8. >> >> You mean the memory representation? For that there would be R Internals >> and the sources, essentially there are CHARSXP objects which include an >> encoding tag (UTF-8, Latin-1 or native) and the raw bytes. But you would >> not access these objects directly, instead use translateChar() if you >> needed strings them in native encoding or translateCharUTF8() if in UTF-8, >> and this is documented in Writing R Extensions. >> > Exactly, because gettext operates in C and the source files for that are > also in utf-8 the actual memory representation of the string in R needs to > be identical, otherwise it won't work. > >> I think it would be really good if you could provide a complete, minimal >> reproducible example of your problem. It may be there is some >> misunderstanding, especially if you are working with characters >> representable in the current encoding, there should be no problem. >> > It depends on if I now understand ?parse correctly in that it should have > the strings in a package that is parsed with the specified encoding in that > encoding or not. As I wondered above. > >> I would love to use the new version of R that supports properly >> interfacing with windows 10. >> And given that the only other supported version of Windows is 8.1 and >> barely anyone uses it. So it might be worth dropping support for that. >> I just hoped I could find a workable solution without such a step. >> >> I understand, also it may take a bit of time before this would become >> stable. >> > Of course. > Hopefully I can still use my current workaround for the time being and > then switch over to the UTF-8 ready version if it becomes production-ready > at some point. > > Cheers, > Joris > > Best >> Tomas >> >> >> Cheers, >> Joris >> >> >>> >>> Best, >>> Tomas >>> >>> [1] >>> >>> https://developer.r-project.org/Blog/public/2020/07/30/windows/utf-8-build-of-r-and-cran-packages/index.html >>> >>> > >>> > Cheers, >>> > Joris >>> >>> > >>> > >>> > On Wed, 16 Dec 2020 at 20:15, David Bosak <dbosa...@gmail.com> wrote: >>> > >>> >> Joris: >>> >> >>> >> >>> >> >>> >> I’ve fought with encoding problems on Windows a lot. Here are some >>> >> general suggestions. >>> >> >>> >> >>> >> >>> >> 1. Put “@encoding UTF-8” on any Roxygen comments. >>> >> 2. Put “encoding = “UTF-8” on any functions like writeLines or >>> >> readLines that read/write to a text file. >>> >> 3. This post: >>> >> >>> https://kevinushey.github.io/blog/2018/02/21/string-encoding-and-r/ >>> >> >>> >> >>> >> >>> >> If you have a more specific problem, please describe and we can try to >>> >> help. >>> >> >>> >> >>> >> >>> >> David >>> >> >>> >> >>> >> >>> >> Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for >>> >> Windows 10 >>> >> >>> >> >>> >> >>> >> *From: *jo...@jorisgoosen.nl >>> >> *Sent: *Wednesday, December 16, 2020 1:52 PM >>> >> *To: *r-package-devel@r-project.org >>> >> *Subject: *[R-pkg-devel] Package Encoding and Literal Strings >>> >> >>> >> >>> >> >>> >> Hello All, >>> >> >>> >> >>> >> >>> >> Some context, I am one of the programmers of a software pkg ( >>> >> >>> >> https://jasp-stats.org/) that uses an embedded instance of R to do >>> >> >>> >> statistics. And make that a bit easier for people who are intimidated >>> by R >>> >> >>> >> or like to have something more GUI oriented. >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> We have been working on translating the interface but ran into several >>> >> >>> >> problems related to encoding of strings. We prefer to use UTF-8 for >>> >> >>> >> everything and this works wonderful on unix systems, as is to be >>> expected. >>> >> >>> >> >>> >> >>> >> Windows however is a different matter. Currently I am working on some >>> local >>> >> >>> >> changes to "do_gettext" and some related internal functions of R to >>> be able >>> >> >>> >> to get UTF-8 encoded output from there. >>> >> >>> >> >>> >> >>> >> But I ran into a bit of a problem and I think this mailinglist is >>> probably >>> >> >>> >> the best place to start. >>> >> >>> >> >>> >> >>> >> It seems that if I have an R package that specifies "Encoding: UTF-8" >>> in >>> >> >>> >> DESCRIPTION the literal strings inside the package are converted to >>> the >>> >> >>> >> local codeset/codepage regardless of what I want. >>> >> >>> >> >>> >> >>> >> Is it possible to keep the strings in UTF-8 internally in such a pkg >>> >> >>> >> somehow? >>> >> >>> >> >>> >> >>> >> Best regards, >>> >> >>> >> Joris Goosen >>> >> >>> >> University of Amsterdam >>> >> >>> >> >>> >> >>> >> [[alternative HTML version deleted]] >>> >> >>> >> >>> >> >>> >> ______________________________________________ >>> >> >>> >> R-package-devel@r-project.org mailing list >>> >> >>> >> https://stat.ethz.ch/mailman/listinfo/r-package-devel >>> >> >>> >> >>> >> >>> > [[alternative HTML version deleted]] >>> > >>> > ______________________________________________ >>> > R-package-devel@r-project.org mailing list >>> > https://stat.ethz.ch/mailman/listinfo/r-package-devel >>> >>> >>> >> > [[alternative HTML version deleted]] ______________________________________________ R-package-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-package-devel