Hello Tomas, On Mon, 21 Dec 2020 at 21:21, Tomas Kalibera <tomas.kalib...@gmail.com> wrote:
> Hi Joris, > > On 12/21/20 7:33 PM, jo...@jorisgoosen.nl wrote: > > Hello Tomas, > > Thank you for the feedback and your summary of how things now work and > what goes wrong for the tao- and mathot-string confirms all of my > suspicions. And it also describes my exact problem fairly well. > > It seems it does come down to R not keeping the UTF-8 encoding of the > literal strings on Windows with a "typical codepage" when loading a > package. > This despite reading it from file in that particular encoding and also > specifying the same in DESCRIPTION. > While `eval(parse(..., encoding="UTF-8"))` *does* keep the encoding on the > literal strings. Which means there is some discrepancy between the two. > That means the way a package is loaded it uses a different path then when > using `eval(parse(..., encoding="UTF-8"))`? > > Yes, it must be a different path. The DESCRIPTION field defines what > encoding is the input in, so that R can read it. It does not tell R how it > should represent the strings internally. The behavior is ok, well except > for non-representable characters. > > You mention: > > Strings that cannot be represented in the native encoding like tao will > get the escapes, and so cannot be converted back to UTF-8. This is not > great, but I see it was the case already in 3.6 (so not a recent > regression) and I don't think it would be worth the time trying to fix that > - as discussed earlier, only switching to UTF-8 would fix all of these > translations, not just one. > > Not a recent regression means it used to work the same for both and > keeping the UTF-8 encoding? > I've tried R 3 and it already doesnt work there, I also tried 2.8 but > couldnt get my testpkg (simplified to use "charToRaw" instead of a C-call) > to install there. > However, having this work would already be quite useful as our custom GUI > on top of R is fully UTF-8 anyhow. > > By "not a recent regression" I meant it wasn't broken recently. It > probably never worked the way you (and me and probably everyone else) would > like it to work, that is it probably always translated to native encoding, > because that was the only option except rewriting all of our code, > packages, external libraries to use UTF-16LE (as discussed before). > Too bad, but that was what I was afraid of in the first place. > And I would certainly be up for figuring out how to fix the regression so > that we can use this until your work on the UTF-8 version with UCRT is > released. > On the other hand, maybe this would not be the wisest investment of my > time. > > I bet your applications do more than just load a package and then access > string literals in the code. And as soon as you do anything with those > strings, R may translate them to native encoding (well unless we document > this does not happen, typically some code around connections, file paths, > etc). So, providing a shortcut for this case I am afraid wouldn't help you > much. If the problem was just parsing, you could also use "\u" escapes as > workaround in the literals. Remember, the parse(,encoding="UTF-8") only > could work in single-byte encodings. > Ah yeah, the original problem with that was that the `xgettext` parsingscript doesn't know how to handle those escapes. But that means we will just have to fix that then. > I've tried using the installer and toolchain you linked to in > https://developer.r-project.org/Blog/public/2020/07/30/windows/utf-8-build-of-r-and-cran-packages/index.html > and use that to compile our software. > This normally works with the Rtools toolchain, but it seems that "make" is > missing from your toolchain. When I build (our project with Riniside in it) > using your toolchain in the beginning of PATH and using mingw32-make from > rtools40 I run into problems of a missing "cc1plus". > > Sorry, building native code is still involved with that demo. You would > have to set PATHs and well maybe alter the installation or build from > source, as described in > > https://svn.r-project.org/R-dev-web/trunk/WindowsBuilds/winutf8/winutf8.html > > What might be actually easier, you could try a current development > version, I will send you a link. > Cheers, Joris I got the link and will have a go at that and reply there with any remarks or questions. > If I read https://mxe.cc/ it seems it is meant for cross-compiling, not > locally on Windows? > Maybe that is what is going wrong. > But despite trying for quite a bit I couldn't get our software to compile > in such a way it could link with R. > Which means I couldn't test if it solves our problem... > > You can compile native code locally on Windows, the toolchain includes a > native compiler and I build R packages natively as well. Cross-compilation > is used to build the compiler toolchain and external libraries for > packages.Cheers > Tomas > > Cheers, > Joris > > > On Fri, 18 Dec 2020 at 18:05, Tomas Kalibera <tomas.kalib...@gmail.com> > wrote: > >> Hi Joris, >> >> thanks for the example. You can actually simply have Test.R assign the >> two variables and then run >> >> Encoding(utf8StringsPkg1::mathotString) >> charToRaw(utf8StringsPkg1::mathotString) >> Encoding(utf8StringsPkg1::tao) >> charToRaw(utf8StringsPkg1::tao) >> >> I tried on Linux, Windows/UTF-8 (the experimental version) and >> Windows/latin-1 (released version). In all cases, both strings are >> converted to native encoding. The mathotString is converted to latin-1 >> fine, because it is representable there. The tao string when running in >> latin-1 locale gets the escapes <xx>: >> >> "<e9><99><b6><e5><be><b7><e5><ba><86>" >> >> Btw, the parse(,encoding="UTF-8") hack works, when you parse the modified >> Test.R file (with the two assignments), and eval the output, you will get >> those strings in UTF-8. But when you don't eval and print the parse tree in >> Rgui, it will not be printed correctly (again a limitation of these hacks, >> they could only do so much). >> >> When accessing strings from C, you should always be prepared for any >> encoding in a CHARSXP, so when you want UTF-8, use "translateCharUTF8()" >> instead of "CHAR()". That will work fine on representable strings like >> mathotString, and that is conceptually the correct way to access them. >> >> Strings that cannot be represented in the native encoding like tao will >> get the escapes, and so cannot be converted back to UTF-8. This is not >> great, but I see it was the case already in 3.6 (so not a recent >> regression) and I don't think it would be worth the time trying to fix that >> - as discussed earlier, only switching to UTF-8 would fix all of these >> translations, not just one. Btw, the example works fine on the >> experimentation UTF-8 build on Windows. >> >> I am sorry there is not a simple fix for non-representable characters. >> >> Best >> Tomas >> >> >> >> On 12/18/20 1:53 PM, jo...@jorisgoosen.nl wrote: >> >> Hello Tomas, >> >> I have made a minimal example that demonstrates my problem: >> https://github.com/JorisGoosen/utf8StringsPkg >> >> This package is encoded in UTF-8 as is Test.R. There is a little Rcpp >> function in there I wrote that displays the bytes straight from R's CHAR to >> be sure no conversion is happening. >> I would expect that the mathotString had "C3 B4" for "ô" but instead it >> gets "F4". As you can see when you run >> `utf8StringsPkg::testutf8_in_locale()`. >> >> Cheers, >> Joris >> >> >> >> On Fri, 18 Dec 2020 at 11:48, Tomas Kalibera <tomas.kalib...@gmail.com> >> wrote: >> >>> On 12/17/20 6:43 PM, jo...@jorisgoosen.nl wrote: >>> >>> >>> >>> On Thu, 17 Dec 2020 at 18:22, Tomas Kalibera <tomas.kalib...@gmail.com> >>> wrote: >>> >>>> On 12/17/20 5:17 PM, jo...@jorisgoosen.nl wrote: >>>> >>>> >>>> >>>> On Thu, 17 Dec 2020 at 10:46, Tomas Kalibera <tomas.kalib...@gmail.com> >>>> wrote: >>>> >>>>> On 12/16/20 11:07 PM, jo...@jorisgoosen.nl wrote: >>>>> > David, >>>>> > >>>>> > Thanks for the response! >>>>> > >>>>> > So the problem is a bit worse then just setting `encoding="UTF-8"` on >>>>> > functions like readLines. >>>>> > I'll describe our setup a bit: >>>>> > So we run R embedded in a separate executable and through a whole >>>>> bunch of >>>>> > C(++) magic get that to the main executable that runs the actual >>>>> interface. >>>>> > All the code that isn't R basically uses UTF-8. This works good and >>>>> we've >>>>> > made sure that all of our source code is encoded properly and I've >>>>> verified >>>>> > that for this particular problem at least my source file is >>>>> definitely >>>>> > encoded in UTF-8 (Ive checked a hexdump). >>>>> > >>>>> > The simplest solution, that we initially took, to get R+Windows to >>>>> > cooperate with everything is to simply set the locale to "C" before >>>>> > starting R. That way R simply assumes UTF-8 is native and everything >>>>> worked >>>>> > splendidly. Until of course a file needs to be opened in R that >>>>> contains >>>>> > some non-ASCII characters. I noticed the problem because a korean >>>>> user had >>>>> > hangul in his username and that broke everything. This because R was >>>>> trying >>>>> > to convert to a different locale than Windows was using. >>>>> >>>>> Setting locale to "C" does not make R assume UTF-8 is the native >>>>> encoding, there is no way to make UTF-8 the current native encoding in >>>>> R >>>>> on the current builds of R on Windows. This is an old limitation of >>>>> Windows, only recently fixed by Microsoft in recent Windows 10 and >>>>> with >>>>> UCRT Windows runtime (see my blog post [1] for more - to make R >>>>> support >>>>> this we need a new toolchain to build R). >>>>> >>>>> If you set the locale to C encoding, you are telling R the native >>>>> encoding is C/POSIX (essentially ASCII), not UTF-8. Encoding-sensitive >>>>> operations, including conversions, including those conversions that >>>>> happen without user control e.g. for interacting with Windows, will >>>>> produce incorrect results (garbage) or in better case errors, >>>>> warnings, >>>>> omitted, substituted or transliterated characters. >>>>> >>>>> In principle setting the encoding via locale is dangerous on Windows, >>>>> because Windows has two current encodings, not just one. By setting >>>>> locale you set the one used in the C runtime, but not the other one >>>>> used >>>>> by the system calls. If all code (in R, packages, external libraries) >>>>> was perfect, this would still work as long as all strings used were >>>>> representable in both encodings. For other strings it won't work, and >>>>> then code is not perfect in this regard, it is usually written >>>>> assuming >>>>> there is one current encoding, which common sense dictates should be >>>>> the >>>>> case. With the recent UTF-8 support ([1]), one can switch both of >>>>> these >>>>> to UTF-8. >>>>> >>>> >>>> Well, this is exactly why I want to get rid of the situation. But this >>>> messes up the output because everything else expects UTF-8 which is why I'm >>>> looking for some kind of solution. >>>> >>>> >>>> >>>>> > The solution I've now been working on is: >>>>> > I took the sourcecode of R 4.0.3 and changed the backend of >>>>> "gettext" to >>>>> > add an `encoding="something something"` option. And a bit of extra >>>>> stuff >>>>> > like `bind_textdomain_codeset` in case I need to tweak the >>>>> codeset/charset >>>>> > that gettext uses. >>>>> > I think I've got that working properly now and once I solve the >>>>> problem of >>>>> > the encoding in a pkg I will open a bugreport/feature-request and >>>>> I'll add >>>>> > a patch that implements it. >>>>> >>>>> A number of similar "shortcuts" have been added to R in the past, but >>>>> they may the code more complex, harder to maintain and use, and can't >>>>> realistically solve all of these problems, anyway. Strings will >>>>> eventually be assumed to be in what is the current native encoding by >>>>> the C library. In R, any external code R uses, or code R packages use. >>>>> Now that Microsoft finally is supporting UTF-8, the way to get out of >>>>> this is switching to UTF-8. This needs only small changes to R source >>>>> code compared to those "shortcuts" (or to using UTF-16LE). I'd be >>>>> against polluting the code with any more "shortcuts". >>>>> >>>> >>>> I think the addition of " bind_textdomain_codeset" is not strictly >>>> necessary and can be left out. Because I think setting an environment >>>> variable as "OUTPUT_CHARSET=UTF-8" gives the same result for us. >>>> The addition of the "encoding" option to the internal "do_gettext" is >>>> just a few lines of code and I also undid some duplication between >>>> do_gettext and do_ngettext. Which should make it easier to maintain. But >>>> all of that is moot if there is no way to keep the literal strings from >>>> sources in UTF-8 anyhow. >>>> >>>> Before starting on this I did actually read your blogpost about UTF-8 >>>> several times and it seems like the best way forward. Not to mention it >>>> would make my life easier and me happier when I can stop worrying about >>>> Windows/Dos codepages! >>>> Thank you for your work on it indeed! >>>> >>>> But my problem with that is that a number of people still use an older >>>> version of windows and your solution won't work there. Which would mean >>>> that we either drop support for them or they would have to live with either >>>> weirdlooking translations. Or I have to go back to the suboptimal solution >>>> of the "C" locale which I really do want to avoid. Because as you said it >>>> breaks other stuff in unpredictable ways. >>>> >>>> The number of people using too old version of Windows should be small >>>> when this could become ready for production. Windows 8.1. is still >>>> supported, but there is the free upgrade to Windows 10 (also from no longer >>>> supported Windows 7), so this should not be a problem for desktop machines. >>>> It will be a problem for servers. >>>> >>> Well, I would not expect anyone to use a GUI-heavy application meant for >>> researchers on a server anyway so that would be fine. >>> >>>> >>>> >>>>> > The problem I'm stuck with now is simply this: >>>>> > I have an R pkg here that I want to test the translations with and >>>>> the code >>>>> > is definitely saved as UTF-8, the package has "Encoding: UTF-8" in >>>>> the >>>>> > DESCRIPTION and it all loads and works. The particular problem I >>>>> have is >>>>> > that the R code contains literally: `mathotString <- "Mathôt!"` >>>>> > The actual file contains the hexadecimal representation of ô as >>>>> proper >>>>> > utf-8: "0xC3 0xB4" but R turns it into: "0xf4". >>>>> > Seemingly on loading the package, because I haven't done anything >>>>> with it >>>>> > except put it in my debug c-function to print its contents as >>>>> > hexadecimals... >>>>> > >>>>> > The only thing I want to achieve here is that when R loads the >>>>> package it >>>>> > keeps those strings in their original UTF-8 encoding, without >>>>> converting it >>>>> > to "native" or the strange unicode codepoint it seemingly placed in >>>>> there >>>>> > instead. Because otherwise I cannot get gettext to work fully in >>>>> UTF-8 mode. >>>>> > >>>>> > Is this already possible in R? >>>>> >>>>> In principle, working with strings not representable in the current >>>>> encoding is not reliable (and never will be). It can still work in >>>>> some >>>>> specific cases and uses. Parsing a UTF-8 string literal from a file, >>>>> with correctly declared encoding as documented in WRE, should work at >>>>> least in single-byte encodings. But what happens after that string is >>>>> parsed is another thing. The parsing is based internally on using >>>>> these >>>>> "shortcuts", that is lying to a part of the parser about the encoding, >>>>> and telling the rest of the parser that it is really something else >>>>> (not >>>>> native, but UTF-8). >>>> >>>> >>>> So the reason the string literals are turned into the local encoding is >>>> because setting the "Encoding" on a package is essentially a hack? >>>> >>>> String literals may be turned into local encoding because that is how >>>> R/packages/external software is written - it needs native encoding. Hacks >>>> here come when such code is given a string not in the local encoding, >>>> assuming that under some conditions such code will work. This includes a >>>> part of the parser and a hack to implement argument "encoding" of >>>> "parse()", which allows to parse (non-representable) UTF-8 strings when >>>> running in a single-byte locale such as latin 1 (see ?parse). >>>> >>> So the same `parse` function is used for loading a package? >>> >>> Parsing for usual packages is done at build time, when they are >>> serialized ("prepared for lazy loading"). I would have to look for the >>> details in the code, but either way, if the input is in UTF-8 but the >>> native encoding is different, either the input has to be converted to >>> native encoding for the parser, or that hack when part of the parser is >>> being lied to about the encoding (either via "parse()" or other way). If >>> you have a minimal reproducible example, I can help you find out whether >>> the behavior seen is expected/documented/bug. >>> >>> Because in that case I wonder if the "Encoding" option in "DESCRIPTION" >>> is handled the same as `encoding=` in parse. >>> >>> ?parse states: >>> > Character strings in the result will have a declared encoding if >>> encoding is "latin1" or "UTF-8", or if text is supplied with every >>> element of known encoding in a Latin-1 or UTF-8 locale. >>> >>> The sentence is a bit hard for me personally to parse but I interpret >>> that first part to mean that if "encoding" is specified as "UTF-8" all the >>> character string in the result will also have that encoding. >>> Is that a correct interpretation? >>> Because if so I do believe I found a problem and I will try to make a >>> minimal reproducable example. >>> >>> Please look first at this part of "?parse": >>> >>> "encoding: encoding to be assumed for input strings. If the value is >>> ‘"latin1"’ or ‘"UTF-8"’ it is used to mark character strings as known to be >>> in Latin-1 or UTF-8: it is not used to re-encode the input. To do the >>> latter, specify the encoding as part of the connection ‘con’ or _via_ >>> ‘options(encoding=)’: see the example under ‘file’. Arguments ‘encoding = >>> "latin1"’ and ‘encoding = "UTF-8"’ are ignored with a warning when running >>> in a MBCS locale." >>> >>> Together with the one you cite: >>> >>> "Character strings in the result will have a declared encoding if >>> ‘encoding’ is ‘"latin1"’ or ‘"UTF-8"’, or if ‘text’ is supplied with every >>> element of known encoding in a Latin-1 or UTF-8 locale." >>> >>> There are two things: which encoding strings are really encoded in, and >>> which encoding they are declared to be in. Normally this should always be >>> the same encoding (UTF-8, latin-1, or the concrete known native encoding), >>> but the "encoding=" argument allows to play with this. Strings declared to >>> be in "native" encoding for a while are treated as (single-byte) unknown >>> encoding and eventually they are declared to be of the encoding from the >>> "encoding=" argument. This only applies to strings declared as "native". >>> When strings are declared as UTF-8 or latin-1, they must be in that >>> encoding, and believed to be in that, the "encoding=" argument does not >>> affect those. >>> >>> So, when your inputs are declared as UTF-8, the "encoding=" hack should >>> not apply to them. Also note that ASCII strings are never declared to be >>> UTF-8 nor latin-1, they are always as "native" (and ASCII is assumed a >>> subset of all encodings). But your inputs probably are not declared to be >>> in UTF-8 (note this is "declared" wrt to Encoding() R function, the >>> encoding flag that character objects in R have), because you are probably >>> parsing from a file. I'd really need a reproducible example to be able to >>> explain what you are seeing. >>> >>> Best >>> Tomas >>> >>> >>> >>>> >>>> >>>>> The part that is being "lied to" may get confused or >>>>> not. It would not when the real native encoding is say latin1, a >>>>> common >>>>> case in the past for which the hack was created, but it might when it >>>>> is >>>>> a double-byte encoding that conflicts with the text being parsed in >>>>> dangerous ways. This is also why this hack only makes sense for string >>>>> literals (and comments), and still only to a limit as the strings may >>>>> be >>>>> misinterpreted later after parsing. >>>>> >>>> >>>> Well our case is entirely limited to string literals that are presented >>>> to the user through an all-utf-8 interface. >>>> So I would assume not of the edge-cases would come into play. >>>> Any systempaths and things like that would still be in local encoding. >>>> >>>> >>>> >>>> >>>>> So a really short summary is: you can only reliably use strings >>>>> representable in the current encoding in R, and that encoding cannot >>>>> be >>>>> UTF-8 on Windows in released versions of R. There is an experimental >>>>> version, see [1], if you could experiment with that and see whether >>>>> that >>>>> might work for your applications, could try to find and report bugs >>>>> there (e.g. to me directly), that would be useful. >>>>> >>>> >>>> So when I read in certain R documentation that string can have an >>>> "UTF-8" encoding in R this is not true? >>>> As in, when I read documentation such as >>>> https://stat.ethz.ch/R-manual/R-devel/library/base/html/Encoding.html >>>> it really seems to indicate to me that UTF-8 is in fact supported in R on >>>> windows. >>>> My assumption was that R uses `translateChar` internally to make sure >>>> it is in the right encoding before interfacing with the OS and other places >>>> where this might matter. >>>> >>>> UTF-8 is supported in R on Windows in many ways, as documented. As long >>>> as you are using UTF-8 strings representable in the current encoding, so >>>> that they can be converted to native encoding and back without problems, >>>> you are fine, R will do the conversions as needed. The troubles come when >>>> such conversion is not possible. In the example of the parser, without the >>>> "encoding=" argument to "parse()", the parser will just work on any text >>>> you give to it, even when the text is in UTF-8: it will work by first >>>> converting to native encoding and then doing the parsing, no hacks >>>> involved. When interacting with external software, you'd just tell R to >>>> provide the strings in the encoding needed by that external software, so >>>> possibly UTF-8, so possibly convert, but all would work fine. The problem >>>> are characters not representable in the native encoding. >>>> >>> Exactly, I want to be able to support chinese etc as well while running >>> in a west-european locale. >>> This is also what mislead me, because I thought it was actually reading >>> it like that but the character is part of my local locale so I didn't >>> notice it. Especially as it was being printed correctly. I only noticed >>> after printing the literal values. >>> >>> >>>> >>>> >>>>> If you find behavior re encodings in released versions of R that >>>>> contradicts the current documentation, please report with a minimal >>>>> reproducible example, such cases should be fixed (even though >>>>> sometimes >>>>> the "fix" would be just changing the documentation, the effort really >>>>> should be now for supporting UTF-8 for real). Specifically with >>>>> "mathotString", you might try creating an example that does not >>>>> include >>>>> any package (just calls to parse with encoding options set), only then >>>>> gradually adding more of package loading if that does not reproduce. >>>>> It >>>>> would be important to know the current encoding (sessionInfo, >>>>> l10n_info). >>>>> >>>> >>>> Well, the reason I mailed the mailing list was because I couldn't for >>>> the life of me find any documentation that told me anything in particular >>>> about how literal strings are supposed to be stored in memory. But it just >>>> seems logical to me that if R already supports parsing and loading a >>>> package encoded with UTF-8 and it supports having UTF-8 strings in memory >>>> next to strings in native encoding the most straightforward way of loading >>>> this literal strings would be in UTF-8. >>>> >>>> You mean the memory representation? For that there would be R Internals >>>> and the sources, essentially there are CHARSXP objects which include an >>>> encoding tag (UTF-8, Latin-1 or native) and the raw bytes. But you would >>>> not access these objects directly, instead use translateChar() if you >>>> needed strings them in native encoding or translateCharUTF8() if in UTF-8, >>>> and this is documented in Writing R Extensions. >>>> >>> Exactly, because gettext operates in C and the source files for that are >>> also in utf-8 the actual memory representation of the string in R needs to >>> be identical, otherwise it won't work. >>> >>>> I think it would be really good if you could provide a complete, >>>> minimal reproducible example of your problem. It may be there is some >>>> misunderstanding, especially if you are working with characters >>>> representable in the current encoding, there should be no problem. >>>> >>> It depends on if I now understand ?parse correctly in that it should >>> have the strings in a package that is parsed with the specified encoding in >>> that encoding or not. As I wondered above. >>> >>>> I would love to use the new version of R that supports properly >>>> interfacing with windows 10. >>>> And given that the only other supported version of Windows is 8.1 and >>>> barely anyone uses it. So it might be worth dropping support for that. >>>> I just hoped I could find a workable solution without such a step. >>>> >>>> I understand, also it may take a bit of time before this would become >>>> stable. >>>> >>> Of course. >>> Hopefully I can still use my current workaround for the time being and >>> then switch over to the UTF-8 ready version if it becomes production-ready >>> at some point. >>> >>> Cheers, >>> Joris >>> >>> Best >>>> Tomas >>>> >>>> >>>> Cheers, >>>> Joris >>>> >>>> >>>>> >>>>> Best, >>>>> Tomas >>>>> >>>>> [1] >>>>> >>>>> https://developer.r-project.org/Blog/public/2020/07/30/windows/utf-8-build-of-r-and-cran-packages/index.html >>>>> >>>>> > >>>>> > Cheers, >>>>> > Joris >>>>> >>>>> > >>>>> > >>>>> > On Wed, 16 Dec 2020 at 20:15, David Bosak <dbosa...@gmail.com> >>>>> wrote: >>>>> > >>>>> >> Joris: >>>>> >> >>>>> >> >>>>> >> >>>>> >> I’ve fought with encoding problems on Windows a lot. Here are some >>>>> >> general suggestions. >>>>> >> >>>>> >> >>>>> >> >>>>> >> 1. Put “@encoding UTF-8” on any Roxygen comments. >>>>> >> 2. Put “encoding = “UTF-8” on any functions like writeLines or >>>>> >> readLines that read/write to a text file. >>>>> >> 3. This post: >>>>> >> >>>>> https://kevinushey.github.io/blog/2018/02/21/string-encoding-and-r/ >>>>> >> >>>>> >> >>>>> >> >>>>> >> If you have a more specific problem, please describe and we can try >>>>> to >>>>> >> help. >>>>> >> >>>>> >> >>>>> >> >>>>> >> David >>>>> >> >>>>> >> >>>>> >> >>>>> >> Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for >>>>> >> Windows 10 >>>>> >> >>>>> >> >>>>> >> >>>>> >> *From: *jo...@jorisgoosen.nl >>>>> >> *Sent: *Wednesday, December 16, 2020 1:52 PM >>>>> >> *To: *r-package-devel@r-project.org >>>>> >> *Subject: *[R-pkg-devel] Package Encoding and Literal Strings >>>>> >> >>>>> >> >>>>> >> >>>>> >> Hello All, >>>>> >> >>>>> >> >>>>> >> >>>>> >> Some context, I am one of the programmers of a software pkg ( >>>>> >> >>>>> >> https://jasp-stats.org/) that uses an embedded instance of R to do >>>>> >> >>>>> >> statistics. And make that a bit easier for people who are >>>>> intimidated by R >>>>> >> >>>>> >> or like to have something more GUI oriented. >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> We have been working on translating the interface but ran into >>>>> several >>>>> >> >>>>> >> problems related to encoding of strings. We prefer to use UTF-8 for >>>>> >> >>>>> >> everything and this works wonderful on unix systems, as is to be >>>>> expected. >>>>> >> >>>>> >> >>>>> >> >>>>> >> Windows however is a different matter. Currently I am working on >>>>> some local >>>>> >> >>>>> >> changes to "do_gettext" and some related internal functions of R to >>>>> be able >>>>> >> >>>>> >> to get UTF-8 encoded output from there. >>>>> >> >>>>> >> >>>>> >> >>>>> >> But I ran into a bit of a problem and I think this mailinglist is >>>>> probably >>>>> >> >>>>> >> the best place to start. >>>>> >> >>>>> >> >>>>> >> >>>>> >> It seems that if I have an R package that specifies "Encoding: >>>>> UTF-8" in >>>>> >> >>>>> >> DESCRIPTION the literal strings inside the package are converted to >>>>> the >>>>> >> >>>>> >> local codeset/codepage regardless of what I want. >>>>> >> >>>>> >> >>>>> >> >>>>> >> Is it possible to keep the strings in UTF-8 internally in such a pkg >>>>> >> >>>>> >> somehow? >>>>> >> >>>>> >> >>>>> >> >>>>> >> Best regards, >>>>> >> >>>>> >> Joris Goosen >>>>> >> >>>>> >> University of Amsterdam >>>>> >> >>>>> >> >>>>> >> >>>>> >> [[alternative HTML version deleted]] >>>>> >> >>>>> >> >>>>> >> >>>>> >> ______________________________________________ >>>>> >> >>>>> >> R-package-devel@r-project.org mailing list >>>>> >> >>>>> >> https://stat.ethz.ch/mailman/listinfo/r-package-devel >>>>> >> >>>>> >> >>>>> >> >>>>> > [[alternative HTML version deleted]] >>>>> > >>>>> > ______________________________________________ >>>>> > R-package-devel@r-project.org mailing list >>>>> > https://stat.ethz.ch/mailman/listinfo/r-package-devel >>>>> >>>>> >>>>> >>>> >>> >> > [[alternative HTML version deleted]] ______________________________________________ R-package-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-package-devel