Re: [R-pkg-devel] Package Encoding and Literal Strings

jo...@jorisgoosen.nl Tue, 22 Dec 2020 02:32:44 -0800

Hello Tomas,

On Mon, 21 Dec 2020 at 21:21, Tomas Kalibera <tomas.kalib...@gmail.com>
wrote:


> Hi Joris,
>
> On 12/21/20 7:33 PM, jo...@jorisgoosen.nl wrote:
>
> Hello Tomas,
>
> Thank you for the feedback and your summary of how things now work and
> what goes wrong for the tao- and mathot-string confirms all of my
> suspicions. And it also describes my exact problem fairly well.
>
> It seems it does come down to R not keeping the UTF-8 encoding of the
> literal strings on Windows with a "typical codepage" when loading a
> package.
> This despite reading it from file in that particular encoding and also
> specifying the same in DESCRIPTION.
> While `eval(parse(..., encoding="UTF-8"))` *does* keep the encoding on the
> literal strings. Which means there is some discrepancy between the two.
> That means the way a package is loaded it uses a different path then when
> using `eval(parse(..., encoding="UTF-8"))`?
>
> Yes, it must be a different path. The DESCRIPTION field defines what
> encoding is the input in, so that R can read it. It does not tell R how it
> should represent the strings internally. The behavior is ok, well except
> for non-representable characters.
>
> You mention:
> > Strings that cannot be represented in the native encoding like tao will
> get the escapes, and so cannot be converted back to UTF-8. This is not
> great, but I  see it was the case already in 3.6 (so not a recent
> regression) and I don't think it would be worth the time trying to fix that
> - as discussed earlier, only switching to UTF-8 would fix all of these
> translations, not just one.
>
> Not a recent regression means it used to work the same for both and
> keeping the UTF-8 encoding?
> I've tried R 3 and it already doesnt work there, I also tried 2.8 but
> couldnt get my testpkg (simplified to use "charToRaw" instead of a C-call)
> to install there.
> However, having this work would already be quite useful as our custom GUI
> on top of R is fully UTF-8 anyhow.
>
> By "not a recent regression" I meant it wasn't broken recently. It
> probably never worked the way you (and me and probably everyone else) would
> like it to work, that is it probably always translated to native encoding,
> because that was the only option except rewriting all of our code,
> packages, external libraries to use UTF-16LE (as discussed before).
>

Too bad, but that was what I was afraid of in the first place.

> And I would certainly be up for figuring out how to fix the regression so
> that we can use this until your work on the UTF-8 version with UCRT is
> released.
> On the other hand, maybe this would not be the wisest investment of my
> time.
>
> I bet your applications do more than just load a package and then access
> string literals in the code. And as soon as you do anything with those
> strings, R may translate them to native encoding (well unless we document
> this does not happen, typically some code around connections, file paths,
> etc). So, providing a shortcut for this case I am afraid wouldn't help you
> much. If the problem was just parsing, you could also use "\u" escapes as
> workaround in the literals. Remember, the parse(,encoding="UTF-8") only
> could work in single-byte encodings.
>
Ah yeah, the original problem with that was that the `xgettext`
parsingscript doesn't know how to handle those escapes. But that means we
will just have to fix that then.

> I've tried using the installer and toolchain you linked to in
> https://developer.r-project.org/Blog/public/2020/07/30/windows/utf-8-build-of-r-and-cran-packages/index.html
> and use that to compile our software.
> This normally works with the Rtools toolchain, but it seems that "make" is
> missing from your toolchain. When I build (our project with Riniside in it)
> using your toolchain in the beginning of PATH and using mingw32-make from
> rtools40 I run into problems of a missing "cc1plus".
>
> Sorry, building native code is still involved with that demo. You would
> have to set PATHs and well maybe alter the installation or build from
> source, as described in
>
> https://svn.r-project.org/R-dev-web/trunk/WindowsBuilds/winutf8/winutf8.html
>
> What might be actually easier, you could try a current development
> version, I will send you a link.
>

Cheers,
Joris

I got the link and will have a go at that and reply there with any remarks
or questions.

> If I read https://mxe.cc/ it seems it is meant for cross-compiling, not
> locally on Windows?
> Maybe that is what is going wrong.
> But despite trying for quite a bit I couldn't get our software to compile
> in such a way it could link with R.
> Which means I couldn't test if it solves our problem...
>
> You can compile native code locally on Windows, the toolchain includes a
> native compiler and I build R packages natively as well. Cross-compilation
> is used to build the compiler toolchain and external libraries for
> packages.Cheers
> Tomas
>

> Cheers,
> Joris
>
>
> On Fri, 18 Dec 2020 at 18:05, Tomas Kalibera <tomas.kalib...@gmail.com>
> wrote:
>
>> Hi Joris,
>>
>> thanks for the example. You can actually simply have Test.R assign the
>> two variables and then run
>>
>> Encoding(utf8StringsPkg1::mathotString)
>> charToRaw(utf8StringsPkg1::mathotString)
>> Encoding(utf8StringsPkg1::tao)
>> charToRaw(utf8StringsPkg1::tao)
>>
>> I tried on Linux, Windows/UTF-8 (the experimental version) and
>> Windows/latin-1 (released version). In all cases, both strings are
>> converted to native encoding. The mathotString is converted to latin-1
>> fine, because it is representable there. The tao string when running in
>> latin-1 locale gets the escapes <xx>:
>>
>> "<e9><99><b6><e5><be><b7><e5><ba><86>"
>>
>> Btw, the parse(,encoding="UTF-8") hack works, when you parse the modified
>> Test.R file (with the two assignments), and eval the output, you will get
>> those strings in UTF-8. But when you don't eval and print the parse tree in
>> Rgui, it will not be printed correctly (again a limitation of these hacks,
>> they could only do so much).
>>
>> When accessing strings from C, you should always be prepared for any
>> encoding in a CHARSXP, so when you want UTF-8, use "translateCharUTF8()"
>> instead of "CHAR()". That will work fine on representable strings like
>> mathotString, and that is conceptually the correct way to access them.
>>
>> Strings that cannot be represented in the native encoding like tao will
>> get the escapes, and so cannot be converted back to UTF-8. This is not
>> great, but I  see it was the case already in 3.6 (so not a recent
>> regression) and I don't think it would be worth the time trying to fix that
>> - as discussed earlier, only switching to UTF-8 would fix all of these
>> translations, not just one. Btw, the example works fine on the
>> experimentation UTF-8 build on Windows.
>>
>> I am sorry there is not a simple fix for non-representable characters.
>>
>> Best
>> Tomas
>>
>>
>>
>> On 12/18/20 1:53 PM, jo...@jorisgoosen.nl wrote:
>>
>> Hello Tomas,
>>
>> I have made a minimal example that demonstrates my problem:
>> https://github.com/JorisGoosen/utf8StringsPkg
>>
>> This package is encoded in UTF-8 as is Test.R. There is a little Rcpp
>> function in there I wrote that displays the bytes straight from R's CHAR to
>> be sure no conversion is happening.
>> I would expect that the mathotString had "C3 B4" for "ô" but instead it
>> gets "F4". As you can see when you run
>> `utf8StringsPkg::testutf8_in_locale()`.
>>
>> Cheers,
>> Joris
>>
>>
>>
>> On Fri, 18 Dec 2020 at 11:48, Tomas Kalibera <tomas.kalib...@gmail.com>
>> wrote:
>>
>>> On 12/17/20 6:43 PM, jo...@jorisgoosen.nl wrote:
>>>
>>>
>>>
>>> On Thu, 17 Dec 2020 at 18:22, Tomas Kalibera <tomas.kalib...@gmail.com>
>>> wrote:
>>>
>>>> On 12/17/20 5:17 PM, jo...@jorisgoosen.nl wrote:
>>>>
>>>>
>>>>
>>>> On Thu, 17 Dec 2020 at 10:46, Tomas Kalibera <tomas.kalib...@gmail.com>
>>>> wrote:
>>>>
>>>>> On 12/16/20 11:07 PM, jo...@jorisgoosen.nl wrote:
>>>>> > David,
>>>>> >
>>>>> > Thanks for the response!
>>>>> >
>>>>> > So the problem is a bit worse then just setting `encoding="UTF-8"` on
>>>>> > functions like readLines.
>>>>> > I'll describe our setup a bit:
>>>>> > So we run R embedded in a separate executable and through a whole
>>>>> bunch of
>>>>> > C(++) magic get that to the main executable that runs the actual
>>>>> interface.
>>>>> > All the code that isn't R basically uses UTF-8. This works good and
>>>>> we've
>>>>> > made sure that all of our source code is encoded properly and I've
>>>>> verified
>>>>> > that for this particular problem at least my source file is
>>>>> definitely
>>>>> > encoded in UTF-8 (Ive checked a hexdump).
>>>>> >
>>>>> > The simplest solution, that we initially took, to get R+Windows to
>>>>> > cooperate with everything is to simply set the locale to "C" before
>>>>> > starting R. That way R simply assumes UTF-8 is native and everything
>>>>> worked
>>>>> > splendidly. Until of course a file needs to be opened in R that
>>>>> contains
>>>>> > some non-ASCII characters. I noticed the problem because a korean
>>>>> user had
>>>>> > hangul in his username and that broke everything. This because R was
>>>>> trying
>>>>> > to convert to a different locale than Windows was using.
>>>>>
>>>>> Setting locale to "C" does not make R assume UTF-8 is the native
>>>>> encoding, there is no way to make UTF-8 the current native encoding in
>>>>> R
>>>>> on the current builds of R on Windows. This is an old limitation of
>>>>> Windows, only recently fixed by Microsoft in recent Windows 10 and
>>>>> with
>>>>> UCRT Windows runtime (see my blog post [1] for more - to make R
>>>>> support
>>>>> this we need a new toolchain to build R).
>>>>>
>>>>> If you set the locale to C encoding, you are telling R the native
>>>>> encoding is C/POSIX (essentially ASCII), not UTF-8. Encoding-sensitive
>>>>> operations, including conversions, including those conversions that
>>>>> happen without user control e.g. for interacting with Windows, will
>>>>> produce incorrect results (garbage) or in better case errors,
>>>>> warnings,
>>>>> omitted, substituted or transliterated characters.
>>>>>
>>>>> In principle setting the encoding via locale is dangerous on Windows,
>>>>> because Windows has two current encodings, not just one. By setting
>>>>> locale you set the one used in the C runtime, but not the other one
>>>>> used
>>>>> by the system calls. If all code (in R, packages, external libraries)
>>>>> was perfect, this would still work as long as all strings used were
>>>>> representable in both encodings. For other strings it won't work, and
>>>>> then code is not perfect in this regard, it is usually written
>>>>> assuming
>>>>> there is one current encoding, which common sense dictates should be
>>>>> the
>>>>> case. With the recent UTF-8 support ([1]), one can switch both of
>>>>> these
>>>>> to UTF-8.
>>>>>
>>>>
>>>> Well, this is exactly why I want to get rid of the situation. But this
>>>> messes up the output because everything else expects UTF-8 which is why I'm
>>>> looking for some kind of solution.
>>>>
>>>>
>>>>
>>>>> > The solution I've now been working on is:
>>>>> > I took the sourcecode of R 4.0.3 and changed the backend of
>>>>> "gettext" to
>>>>> > add an `encoding="something something"` option. And a bit of extra
>>>>> stuff
>>>>> > like `bind_textdomain_codeset` in case I need to tweak the
>>>>> codeset/charset
>>>>> > that gettext uses.
>>>>> > I think I've got that working properly now and once I solve the
>>>>> problem of
>>>>> > the encoding in a pkg I will open a bugreport/feature-request and
>>>>> I'll add
>>>>> > a patch that implements it.
>>>>>
>>>>> A number of similar "shortcuts" have been added to R in the past, but
>>>>> they may the code more complex, harder to maintain and use, and can't
>>>>> realistically solve all of these problems, anyway. Strings will
>>>>> eventually be assumed to be in what is the current native encoding by
>>>>> the C library. In R, any external code R uses, or code R packages use.
>>>>> Now that Microsoft finally is supporting UTF-8, the way to get out of
>>>>> this is switching to UTF-8. This needs only small changes to R source
>>>>> code compared to those "shortcuts" (or to using UTF-16LE). I'd be
>>>>> against polluting the code with any more "shortcuts".
>>>>>
>>>>
>>>> I think the addition of " bind_textdomain_codeset" is not strictly
>>>> necessary and can be left out. Because I think setting an environment
>>>> variable as "OUTPUT_CHARSET=UTF-8" gives the same result for us.
>>>> The addition of the "encoding" option to the internal "do_gettext" is
>>>> just a few lines of code and I also undid some duplication between
>>>> do_gettext and do_ngettext. Which should make it easier to maintain. But
>>>> all of that is moot if there is no way to keep the literal strings from
>>>> sources in UTF-8 anyhow.
>>>>
>>>> Before starting on this I did actually read your blogpost about UTF-8
>>>> several times and it seems like the best way forward. Not to mention it
>>>> would make my life easier and me happier when I can stop worrying about
>>>> Windows/Dos codepages!
>>>> Thank you for your work on it indeed!
>>>>
>>>> But my problem with that is that a number of people still use an older
>>>> version of windows and your solution won't work there. Which would mean
>>>> that we either drop support for them or they would have to live with either
>>>> weirdlooking translations. Or I have to go back to the suboptimal solution
>>>> of the "C" locale which I really do want to avoid. Because as you said it
>>>> breaks other stuff in unpredictable ways.
>>>>
>>>> The number of people using too old version of Windows should be small
>>>> when this could become ready for production. Windows 8.1. is still
>>>> supported, but there is the free upgrade to Windows 10 (also from no longer
>>>> supported Windows 7), so this should not be a problem for desktop machines.
>>>> It will be a problem for servers.
>>>>
>>> Well, I would not expect anyone to use a GUI-heavy application meant for
>>> researchers on a server anyway so that would be fine.
>>>
>>>>
>>>>
>>>>> > The problem I'm stuck with now is simply this:
>>>>> > I have an R pkg here that I want to test the translations with and
>>>>> the code
>>>>> > is definitely saved as UTF-8, the package has "Encoding: UTF-8" in
>>>>> the
>>>>> > DESCRIPTION and it all loads and works. The particular problem I
>>>>> have is
>>>>> > that the R code contains literally: `mathotString <- "Mathôt!"`
>>>>> > The actual file contains the hexadecimal representation of ô as
>>>>> proper
>>>>> > utf-8: "0xC3 0xB4" but R turns it into: "0xf4".
>>>>> > Seemingly on loading the package, because I haven't done anything
>>>>> with it
>>>>> > except put it in my debug c-function to print its contents as
>>>>> > hexadecimals...
>>>>> >
>>>>> > The only thing I want to achieve here is that when R loads the
>>>>> package it
>>>>> > keeps those strings in their original UTF-8 encoding, without
>>>>> converting it
>>>>> > to "native" or the strange unicode codepoint it seemingly placed in
>>>>> there
>>>>> > instead. Because otherwise I cannot get gettext to work fully in
>>>>> UTF-8 mode.
>>>>> >
>>>>> > Is this already possible in R?
>>>>>
>>>>> In principle, working with strings not representable in the current
>>>>> encoding is not reliable (and never will be). It can still work in
>>>>> some
>>>>> specific cases and uses. Parsing a UTF-8 string literal from a file,
>>>>> with correctly declared encoding as documented in WRE, should work at
>>>>> least in single-byte encodings. But what happens after that string is
>>>>> parsed is another thing. The parsing is based internally on using
>>>>> these
>>>>> "shortcuts", that is lying to a part of the parser about the encoding,
>>>>> and telling the rest of the parser that it is really something else
>>>>> (not
>>>>> native, but UTF-8).
>>>>
>>>>
>>>> So the reason the string literals are turned into the local encoding is
>>>> because setting the "Encoding" on a package is essentially a hack?
>>>>
>>>> String literals may be turned into local encoding because that is how
>>>> R/packages/external software is written - it needs native encoding. Hacks
>>>> here come when such code is given a string not in the local encoding,
>>>> assuming that under some conditions such code will work. This includes a
>>>> part of the parser and a hack to implement argument "encoding" of
>>>> "parse()", which allows to parse (non-representable) UTF-8 strings when
>>>> running in a single-byte locale such as latin 1 (see ?parse).
>>>>
>>> So the same `parse` function is used for loading a package?
>>>
>>> Parsing for usual packages is done at build time, when they are
>>> serialized ("prepared for lazy loading"). I would have to look for the
>>> details in the code, but either way, if the input is in UTF-8 but the
>>> native encoding is different, either the input has to be converted to
>>> native encoding for the parser, or that hack when part of the parser is
>>> being lied to about the encoding (either via "parse()" or other way). If
>>> you have a minimal reproducible example, I can help you find out whether
>>> the behavior seen is expected/documented/bug.
>>>
>>> Because in that case I wonder if the "Encoding" option in "DESCRIPTION"
>>> is handled the same as `encoding=` in parse.
>>>
>>> ?parse states:
>>> > Character strings in the result will have a declared encoding if
>>> encoding is "latin1" or "UTF-8", or if text is supplied with every
>>> element of known encoding in a Latin-1 or UTF-8 locale.
>>>
>>> The sentence is a bit hard for me personally to parse but I interpret
>>> that first part to mean that if "encoding" is specified as "UTF-8" all the
>>> character string in the result will also have that encoding.
>>> Is that a correct interpretation?
>>> Because if so I do believe I found a problem and I will try to make a
>>> minimal reproducable example.
>>>
>>> Please look first at this part of "?parse":
>>>
>>> "encoding: encoding to be assumed for input strings.  If the value is
>>> ‘"latin1"’ or ‘"UTF-8"’ it is used to mark character strings as known to be
>>> in Latin-1 or UTF-8: it is not used to re-encode the input.  To do the
>>> latter, specify the encoding as part of the connection ‘con’ or _via_
>>> ‘options(encoding=)’: see the example under ‘file’. Arguments ‘encoding =
>>> "latin1"’ and ‘encoding = "UTF-8"’ are ignored with a warning when running
>>> in a MBCS locale."
>>>
>>> Together with the one you cite:
>>>
>>> "Character strings in the result will have a declared encoding if
>>> ‘encoding’ is ‘"latin1"’ or ‘"UTF-8"’, or if ‘text’ is supplied with every
>>> element of known encoding in a Latin-1 or UTF-8 locale."
>>>
>>> There are two things: which encoding strings are really encoded in, and
>>> which encoding they are declared to be in. Normally this should always be
>>> the same encoding (UTF-8, latin-1, or the concrete known native encoding),
>>> but the "encoding=" argument allows to play with this. Strings declared to
>>> be in "native" encoding for a while are treated as (single-byte) unknown
>>> encoding and eventually they are declared to be of the encoding from the
>>> "encoding=" argument. This only applies to strings declared as "native".
>>> When strings are declared as UTF-8 or latin-1, they must be in that
>>> encoding, and believed to be in that, the "encoding=" argument does not
>>> affect those.
>>>
>>> So, when your inputs are declared as UTF-8, the "encoding=" hack should
>>> not apply to them. Also note that ASCII strings are never declared to be
>>> UTF-8 nor latin-1, they are always as "native" (and ASCII is assumed a
>>> subset of all encodings). But your inputs probably are not declared to be
>>> in UTF-8 (note this is "declared" wrt to Encoding() R function, the
>>> encoding flag that character objects in R have), because you are probably
>>> parsing from a file. I'd really need a reproducible example to be able to
>>> explain what you are seeing.
>>>
>>> Best
>>> Tomas
>>>
>>>
>>>
>>>>
>>>>
>>>>> The part that is being "lied to" may get confused or
>>>>> not. It would not when the real native encoding is say latin1, a
>>>>> common
>>>>> case in the past for which the hack was created, but it might when it
>>>>> is
>>>>> a double-byte encoding that conflicts with the text being parsed in
>>>>> dangerous ways. This is also why this hack only makes sense for string
>>>>> literals (and comments), and still only to a limit as the strings may
>>>>> be
>>>>> misinterpreted later after parsing.
>>>>>
>>>>
>>>> Well our case is entirely limited to string literals that are presented
>>>> to the user through an all-utf-8 interface.
>>>> So I would assume not of the edge-cases would come into play.
>>>> Any systempaths and things like that would still be in local encoding.
>>>>
>>>>
>>>>
>>>>
>>>>> So a really short summary is: you can only reliably use strings
>>>>> representable in the current encoding in R, and that encoding cannot
>>>>> be
>>>>> UTF-8 on Windows in released versions of R. There is an experimental
>>>>> version, see [1], if you could experiment with that and see whether
>>>>> that
>>>>> might work for your applications, could try to find and report bugs
>>>>> there (e.g. to me directly), that would be useful.
>>>>>
>>>>
>>>> So when I read in certain R documentation that string can have an
>>>> "UTF-8" encoding in R this is not true?
>>>> As in, when I read documentation such as
>>>> https://stat.ethz.ch/R-manual/R-devel/library/base/html/Encoding.html
>>>> it really seems to indicate to me that UTF-8 is in fact supported in R on
>>>> windows.
>>>> My assumption was that R uses `translateChar` internally to make sure
>>>> it is in the right encoding before interfacing with the OS and other places
>>>> where this might matter.
>>>>
>>>> UTF-8 is supported in R on Windows in many ways, as documented. As long
>>>> as you are using UTF-8 strings representable in the current encoding, so
>>>> that they can be converted to native encoding and back without problems,
>>>> you are fine, R will do the conversions as needed. The troubles come when
>>>> such conversion is not possible. In the example of the parser, without the
>>>> "encoding=" argument to "parse()", the parser will just work on any text
>>>> you give to it, even when the text is in UTF-8: it will work by first
>>>> converting to native encoding and then doing the parsing, no hacks
>>>> involved. When interacting with external software, you'd just tell R to
>>>> provide the strings in the encoding needed by that external software, so
>>>> possibly UTF-8, so possibly convert, but all would work fine. The problem
>>>> are characters not representable in the native encoding.
>>>>
>>> Exactly, I want to be able to support chinese etc as well while running
>>> in a west-european locale.
>>> This is also what mislead me, because I thought it was actually reading
>>> it like that but the character is part of my local locale so I didn't
>>> notice it. Especially as it was being printed correctly. I only noticed
>>> after printing the literal values.
>>>
>>>
>>>>
>>>>
>>>>> If you find behavior re encodings in released versions of R that
>>>>> contradicts the current documentation, please report with a minimal
>>>>> reproducible example, such cases should be fixed (even though
>>>>> sometimes
>>>>> the "fix" would be just changing the documentation, the effort really
>>>>> should be now for supporting UTF-8 for real). Specifically with
>>>>> "mathotString", you might try creating  an example that does not
>>>>> include
>>>>> any package (just calls to parse with encoding options set), only then
>>>>> gradually adding more of package loading if that does not reproduce.
>>>>> It
>>>>> would be important to know the current encoding (sessionInfo,
>>>>> l10n_info).
>>>>>
>>>>
>>>> Well, the reason I mailed the mailing list was because I couldn't for
>>>> the life of me find any documentation that told me anything in particular
>>>> about how literal strings are supposed to be stored in memory. But it just
>>>> seems logical to me that if R already supports parsing and loading a
>>>> package encoded with UTF-8 and it supports having UTF-8 strings in memory
>>>> next to strings in native encoding the most straightforward way of loading
>>>> this literal strings would be in UTF-8.
>>>>
>>>> You mean the memory representation? For that there would be R Internals
>>>> and the sources, essentially there are CHARSXP objects which include an
>>>> encoding tag (UTF-8, Latin-1 or native) and the raw bytes. But you would
>>>> not access these objects directly, instead use translateChar() if you
>>>> needed strings them in native encoding or translateCharUTF8() if in UTF-8,
>>>> and this is documented in Writing R Extensions.
>>>>
>>> Exactly, because gettext operates in C and the source files for that are
>>> also in utf-8 the actual memory representation of the string in R needs to
>>> be identical, otherwise it won't work.
>>>
>>>> I think it would be really good if you could provide a complete,
>>>> minimal reproducible example of your problem. It may be there is some
>>>> misunderstanding, especially if you are working with characters
>>>> representable in the current encoding, there should be no problem.
>>>>
>>> It depends on if I now understand ?parse correctly in that it should
>>> have the strings in a package that is parsed with the specified encoding in
>>> that encoding or not. As I wondered above.
>>>
>>>> I would love to use the new version of R that supports properly
>>>> interfacing with windows 10.
>>>> And given that the only other supported version of Windows is 8.1 and
>>>> barely anyone uses it. So it might be worth dropping support for that.
>>>> I just hoped I could find a workable solution without such a step.
>>>>
>>>> I understand, also it may take a bit of time before this would become
>>>> stable.
>>>>
>>> Of course.
>>> Hopefully I can still use my current workaround for the time being and
>>> then switch over to the UTF-8 ready version if it becomes production-ready
>>> at some point.
>>>
>>> Cheers,
>>> Joris
>>>
>>> Best
>>>> Tomas
>>>>
>>>>
>>>> Cheers,
>>>> Joris
>>>>
>>>>
>>>>>
>>>>> Best,
>>>>> Tomas
>>>>>
>>>>> [1]
>>>>>
>>>>> https://developer.r-project.org/Blog/public/2020/07/30/windows/utf-8-build-of-r-and-cran-packages/index.html
>>>>>
>>>>> >
>>>>> > Cheers,
>>>>> > Joris
>>>>>
>>>>> >
>>>>> >
>>>>> > On Wed, 16 Dec 2020 at 20:15, David Bosak <dbosa...@gmail.com>
>>>>> wrote:
>>>>> >
>>>>> >> Joris:
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> I’ve fought with encoding problems on Windows a lot.  Here are some
>>>>> >> general suggestions.
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>     1. Put “@encoding UTF-8” on any Roxygen comments.
>>>>> >>     2. Put “encoding = “UTF-8” on any functions like writeLines or
>>>>> >>     readLines that read/write to a text file.
>>>>> >>     3. This post:
>>>>> >>
>>>>> https://kevinushey.github.io/blog/2018/02/21/string-encoding-and-r/
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> If you have a more specific problem, please describe and we can try
>>>>> to
>>>>> >> help.
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> David
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for
>>>>> >> Windows 10
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> *From: *jo...@jorisgoosen.nl
>>>>> >> *Sent: *Wednesday, December 16, 2020 1:52 PM
>>>>> >> *To: *r-package-devel@r-project.org
>>>>> >> *Subject: *[R-pkg-devel] Package Encoding and Literal Strings
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> Hello All,
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> Some context, I am one of the programmers of a software pkg (
>>>>> >>
>>>>> >> https://jasp-stats.org/) that uses an embedded instance of R to do
>>>>> >>
>>>>> >> statistics. And make that a bit easier for people who are
>>>>> intimidated by R
>>>>> >>
>>>>> >> or like to have something more GUI oriented.
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> We have been working on translating the interface but ran into
>>>>> several
>>>>> >>
>>>>> >> problems related to encoding of strings. We prefer to use UTF-8 for
>>>>> >>
>>>>> >> everything and this works wonderful on unix systems, as is to be
>>>>> expected.
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> Windows however is a different matter. Currently I am working on
>>>>> some local
>>>>> >>
>>>>> >> changes to "do_gettext" and some related internal functions of R to
>>>>> be able
>>>>> >>
>>>>> >> to get UTF-8 encoded output from there.
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> But I ran into a bit of a problem and I think this mailinglist is
>>>>> probably
>>>>> >>
>>>>> >> the best place to start.
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> It seems that if I have an R package that specifies "Encoding:
>>>>> UTF-8" in
>>>>> >>
>>>>> >> DESCRIPTION the literal strings inside the package are converted to
>>>>> the
>>>>> >>
>>>>> >> local codeset/codepage regardless of what I want.
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> Is it possible to keep the strings in UTF-8 internally in such a pkg
>>>>> >>
>>>>> >> somehow?
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> Best regards,
>>>>> >>
>>>>> >> Joris Goosen
>>>>> >>
>>>>> >> University of Amsterdam
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>                  [[alternative HTML version deleted]]
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> ______________________________________________
>>>>> >>
>>>>> >> R-package-devel@r-project.org mailing list
>>>>> >>
>>>>> >> https://stat.ethz.ch/mailman/listinfo/r-package-devel
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >       [[alternative HTML version deleted]]
>>>>> >
>>>>> > ______________________________________________
>>>>> > R-package-devel@r-project.org mailing list
>>>>> > https://stat.ethz.ch/mailman/listinfo/r-package-devel
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

        [[alternative HTML version deleted]]

______________________________________________
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel

Re: [R-pkg-devel] Package Encoding and Literal Strings

Reply via email to