Re: [Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones

Tomáš Bořil Thu, 11 Apr 2019 01:12:01 -0700

I do not blame anybody and I do have a huge respect to all authors of
R. Actually, I like R very much and I would like to thank to everyone
who contributes to it. I use R regularly in my work (moved from Java,
C# and Matlab), I have created a package rPraat for phonetic analyses
and I think R is a very well designed language which will survive
decades. I am trying to bring new users (my students at non-technical
University) to use programming for their everyday problems
(statistics, phonetic analyses, text processing) and they enjoy R. I
am really positive in this (it is hard to express emotions in e-mails
without using emoticons in every sentence). And that is why I would
like it have even more perfect.


I only suggest to add one line of code (metaphorically) in source()
function in R for Windows to make it even better and to warn all users
who do not read a whole documentation for each function thoroughly and
carefully.

Tomas


On Thu, Apr 11, 2019 at 9:54 AM Tomas Kalibera <tomas.kalib...@gmail.com> wrote:
>
> On 4/11/19 9:10 AM, Tomáš Bořil wrote:
> > Or, if this cannot be done easily, please, disable the "utf-8" value
> > in source(..., ) function on Windows R.
> > source(..., encoding = "utf-8")
> > -> error: "utf-8" does not work right on Windows.
> > -> (or, at least) warning: "utf-8" is handled by "best fit" on Windows
> > and some characters in string literals may be automatically changed.
> >
> > Because, at this state, the UTF-8 encoding of R source files on
> > Windows is a fake Unicode as it can handle only 256 different ANSI
> > characters in reality.
>
> This is not a fair statement. source(,encoding="UTF-8") works as
> documented. It translates from (full) UTF-8 to current native encoding,
> which is documented. I believe the authors who made these design
> decisions over a decade ago, under different circumstances, and
> carefully implemented the code, tested, and documented for you to use
> for free, deserve to be addressed with some respect. It is not their
> responsibility to read the documentation for you, and if you had read
> and understood it, you would not have used source(,encoding="UTF-8")
> with characters not representable in current native encoding on Windows.
> The authors should not be blamed for that the design _today_ does not
> seem perfect for _todays_ systems (and how could they have guessed at
> that time Windows will still not support UTF-8 as native encoding today).
>
> Tomas
> > Thanks,
> > Tomas
> >
> >
> > On Thu, Apr 11, 2019 at 8:53 AM Tomáš Bořil <bor...@gmail.com> wrote:
> >> For me, this would be a perfect solution.
> >>
> >> I.e., do not use the “best” fit and leave it to user’s competence:
> >> a) in some functions, utf-8 works
> >> b) in others -> error is thrown (e.g., incomplete string, NA, etc.)
> >> => user has to change the code with his/her intentional “best fit string 
> >> literal substitute” or use another function that can handle utf-8.
> >>
> >> Making an R code working right only on some platforms / trying to keep a 
> >> back-compatibility meaning “the code does not do what you want and the 
> >> behaviour differs depending on each every locale but at least, it does not 
> >> throw an error” is generally not a good idea - it is dangerous. Users / 
> >> coders should know that there is something wrong with their strings and 
> >> some characters are “eaten alive”.
> >>
> >> Tomas
> >>
> >> čt 11. 4. 2019 v 8:26 odesílatel Tomas Kalibera <tomas.kalib...@gmail.com> 
> >> napsal:
> >>> On 4/10/19 6:32 PM, Jeroen Ooms wrote:
> >>>> On Wed, Apr 10, 2019 at 5:45 PM Duncan Murdoch 
> >>>> <murdoch.dun...@gmail.com> wrote:
> >>>>> On 10/04/2019 10:29 a.m., Yihui Xie wrote:
> >>>>>> Since it is "technically easy" to disable the best fit conversion and
> >>>>>> the best fit is rarely good, how about providing an option for
> >>>>>> code/package authors to disable it? I'm asking because this is one of
> >>>>>> the most painful issues in packages that may need to source() code
> >>>>>> containing UTF-8 characters that are not representable in the Windows
> >>>>>> native encoding. Examples include knitr/rmarkdown and shiny. Basically
> >>>>>> users won't be able to knit documents or run Shiny apps correctly when
> >>>>>> the code contains characters that cannot be represented in the native
> >>>>>> encoding.
> >>>>> Wouldn't things be worse with it disabled than currently?  I'd expect
> >>>>> the line containing the "ř" to end up as NA instead of converting to 
> >>>>> "r".
> >>>> I don't think it would be worse, because in this case R would not
> >>>> implicitly convert strings to (best fit) latin1 on Windows, but
> >>>> instead keep the (correct) string in its UTF-8 encoding. The NA only
> >>>> appears if the user explicitly forces a conversion to latin1, which is
> >>>> not the problem here I think.
> >>>>
> >>>> The original problem that I can reproduce in RGui is that if you enter
> >>>>    "ř" in RGui, R opportunistically converts this to latin1, because it
> >>>> can. However if you enter text which can definitely not be represented
> >>>> in latin1, R encodes the string correctly in UTF-8 form.
> >>> Rgui is a "Windows Unicode" application (uses UTF16-LE) but it needs to
> >>> convert the input to native encoding before passing it to R, which is
> >>> based on locales. However, that string is passed by R to the parser,
> >>> which Rgui takes advantage of and converts non-representable characters
> >>> to their \uxxxx escapes which are understood by the parser. Using this
> >>> trick, Unicode characters can get to the parser from Rgui (but of course
> >>> then still in risk of conversion later when the program runs). Rgui only
> >>> escapes characters that cannot be represented, unfortunately, the
> >>> standard C99 API for that implemented on Windows does the best fit. This
> >>> could be fixed in Rgui by calling a special Windows API function and
> >>> could be done, but with the mentioned risk that it would break existing
> >>> uses that capture the existing behavior.
> >>>
> >>> This is the only place I know of where removing best fit would lead to
> >>> correct representation of UTF-8 characters. Other places will give NA,
> >>> some other escapes, code will fail to parse (e.g. "incomplete string",
> >>> one can get that easily with source()).
> >>>
> >>> Tomas
> >>>
> >>> ______________________________________________
> >>> R-devel@r-project.org mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones

Reply via email to