Re: [Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones

Tomas Kalibera Thu, 11 Apr 2019 00:56:01 -0700

On 4/11/19 9:10 AM, Tomáš Bořil wrote:

Or, if this cannot be done easily, please, disable the "utf-8" value
in source(..., ) function on Windows R.
source(..., encoding = "utf-8")
-> error: "utf-8" does not work right on Windows.
-> (or, at least) warning: "utf-8" is handled by "best fit" on Windows
and some characters in string literals may be automatically changed.


Because, at this state, the UTF-8 encoding of R source files on
Windows is a fake Unicode as it can handle only 256 different ANSI
characters in reality.

This is not a fair statement. source(,encoding="UTF-8") works asdocumented. It translates from (full) UTF-8 to current native encoding,which is documented. I believe the authors who made these designdecisions over a decade ago, under different circumstances, andcarefully implemented the code, tested, and documented for you to usefor free, deserve to be addressed with some respect. It is not theirresponsibility to read the documentation for you, and if you had readand understood it, you would not have used source(,encoding="UTF-8")with characters not representable in current native encoding on Windows.The authors should not be blamed for that the design _today_ does notseem perfect for _todays_ systems (and how could they have guessed atthat time Windows will still not support UTF-8 as native encoding today).


Tomas

Thanks,
Tomas


On Thu, Apr 11, 2019 at 8:53 AM Tomáš Bořil <bor...@gmail.com> wrote:

For me, this would be a perfect solution.

I.e., do not use the “best” fit and leave it to user’s competence:
a) in some functions, utf-8 works
b) in others -> error is thrown (e.g., incomplete string, NA, etc.)
=> user has to change the code with his/her intentional “best fit string 
literal substitute” or use another function that can handle utf-8.

Making an R code working right only on some platforms / trying to keep a 
back-compatibility meaning “the code does not do what you want and the 
behaviour differs depending on each every locale but at least, it does not 
throw an error” is generally not a good idea - it is dangerous. Users / coders 
should know that there is something wrong with their strings and some 
characters are “eaten alive”.

Tomas

čt 11. 4. 2019 v 8:26 odesílatel Tomas Kalibera <tomas.kalib...@gmail.com> 
napsal:

On 4/10/19 6:32 PM, Jeroen Ooms wrote:

On Wed, Apr 10, 2019 at 5:45 PM Duncan Murdoch <murdoch.dun...@gmail.com> wrote:

On 10/04/2019 10:29 a.m., Yihui Xie wrote:

Since it is "technically easy" to disable the best fit conversion and
the best fit is rarely good, how about providing an option for
code/package authors to disable it? I'm asking because this is one of
the most painful issues in packages that may need to source() code
containing UTF-8 characters that are not representable in the Windows
native encoding. Examples include knitr/rmarkdown and shiny. Basically
users won't be able to knit documents or run Shiny apps correctly when
the code contains characters that cannot be represented in the native
encoding.

Wouldn't things be worse with it disabled than currently?  I'd expect
the line containing the "ř" to end up as NA instead of converting to "r".

I don't think it would be worse, because in this case R would not
implicitly convert strings to (best fit) latin1 on Windows, but
instead keep the (correct) string in its UTF-8 encoding. The NA only
appears if the user explicitly forces a conversion to latin1, which is
not the problem here I think.

The original problem that I can reproduce in RGui is that if you enter
   "ř" in RGui, R opportunistically converts this to latin1, because it
can. However if you enter text which can definitely not be represented
in latin1, R encodes the string correctly in UTF-8 form.

Rgui is a "Windows Unicode" application (uses UTF16-LE) but it needs to
convert the input to native encoding before passing it to R, which is
based on locales. However, that string is passed by R to the parser,
which Rgui takes advantage of and converts non-representable characters
to their \uxxxx escapes which are understood by the parser. Using this
trick, Unicode characters can get to the parser from Rgui (but of course
then still in risk of conversion later when the program runs). Rgui only
escapes characters that cannot be represented, unfortunately, the
standard C99 API for that implemented on Windows does the best fit. This
could be fixed in Rgui by calling a special Windows API function and
could be done, but with the mentioned risk that it would break existing
uses that capture the existing behavior.

This is the only place I know of where removing best fit would lead to
correct representation of UTF-8 characters. Other places will give NA,
some other escapes, code will fail to parse (e.g. "incomplete string",
one can get that easily with source()).

Tomas

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones

Reply via email to