Re: [Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones

Tomáš Bořil Wed, 10 Apr 2019 09:14:57 -0700

Yes, again in a script sourced by source(encoding = ...). But also by
typing it directly in R console.


Most of the time, I use RStudio as a front-end. For this experiment, I
also verified it in Rgui. In both front-ends, it behaves completely in
the same way.

An optional parameter to source() function which would translate all
UTF-8 characters in string literals to their "\Uxxxx" codes sounds as
a great idea (and I hope it would fix 99.9% of problems I have -
because that is the way I overcome these problems nowadays) - and the
same behaviour in command line...

Tomas

> What do you mean it is "converted before"? Under what context? Again a
> script sourced by source(encoding=) ?
>
> And, are you using Rgui as front-end?

>>   The only problem is that I
>> cannot simple use enc2utf8("œ") - it is converted to "o" before
>> executing the function. Instead of that, I have to explicitly type
>> "\U00159" throughout my code.
On Wed, Apr 10, 2019 at 5:29 PM Tomas Kalibera <tomas.kalib...@gmail.com> wrote:
>
> On 4/10/19 3:02 PM, Tomáš Bořil wrote:
> > The thing is, I would rather prefer R (in that rare occasions where an
> > old function does not support anything but ANSI encoding) throwing  an
> > error:
> > "Unicode encoding not supported, please change the string in your
> > code" instead of silently converting some characters to different ones
> > without any warning.
> In principle it probably could be optional as Yihui Xie asks on R-devel,
> we will discuss that internally. If the Windows "best fit" is a big
> problem on its own, this is something that could be done quickly, if
> optional. We could turn into error only conversions that we have control
> of (inside R code), indeed, but that should be most.
> > I understand that there are some functions which are not
> > Unicode-compatible yet but according to the Stackoverflow discussion I
> > cited before, in many cases (90% or more?) everything works right with
> > Encoding("\U00159") == "UTF-8" (in my scripts, I have not found any
> > problem with explicit UTF-8 coding yet).
>
> Well there has been a lot of effort invested to make that possible, so
> that many internal string functions do not convert unnecessarily into
> UTF-8, mostly by Duncan Murdoch, but much more needs to be done and
> there is the problem with packages. Of course if you find a concrete R
> function that unnecessarily converts (source() is debatable, I know
> about it, so some other), you are welcome to report, I or someone can
> fix. A common problem is I/O (connections) and there the fix won't be
> easy, it would have to be re-designed. The problem is that when we have
> something typed "char *" inside R, it needs to be always in native
> encoding, any mix would lead to total chaos.
>
> The full solution would however only be fully switching to UTF-8
> internally on Windows (and then char * would always mean UTF-8), we have
> discussed this many times inside R Core (and many times before I
> joined), I am sure it will be discussed again at some point and we are
> aware of course of the problem. Please trust us it is hard to do - we
> know the code as we (collectively) have written it. People contributing
> to SO are users and package developers, not developers of the core. You
> can get more correct information from people on R-devel (package
> developers and sometimes core developers).
>
> >   The only problem is that I
> > cannot simple use enc2utf8("œ") - it is converted to "o" before
> > executing the function. Instead of that, I have to explicitly type
> > "\U00159" throughout my code.
>
> What do you mean it is "converted before"? Under what context? Again a
> script sourced by source(encoding=) ?
>
> And, are you using Rgui as front-end?
>
> > In my lectures, I have Czech, Russian and English students and it is
> > also impossible to create a script that works for everyone. In fact, I
> > know that Czech "ř" can be translated to my native (Czech) encoding. I
> > have just chosen the example as it is reproducible in English locale.
>
>
> > Originally, I had a problem with IPA characted (phonetic symbol) "œ",
> > i.e. "\U00153". In Czech locale, it is translated to "o". In English,
> > it is not converted - it remains "œ". But if I use "\U00153" in Czech
> > locale, nothing is converted and everything works right.
>
> Yes, the \u* sequence I hear is commonly used to represent UTF-8 string
> literals in something that is not UTF-8 itself. Note if you have a
> package, you can have R source files with UTF-8 encoded literal strings
> if you declare Encoding: UTF-8 in the DESCRIPTION file (see Writing R
> Extensions for details), even though sometimes people run into
> trouble/bugs as well.
>
> You probably know none of these problems exist on Linux nor macOS, where
> UTF-8 is the native encoding.
>
> Tomas
>
> >
> > Tomas
> >
> >
> >
> > On Wed, Apr 10, 2019 at 2:37 PM Tomas Kalibera <tomas.kalib...@gmail.com> 
> > wrote:
> >> On 4/10/19 2:06 PM, Tomáš Bořil wrote:
> >>
> >> Thank you for the explanation but I just do not understand one thing - why 
> >> it would need to  recreate the R from a scratch to work with Unicode 
> >> internally?
> >>
> >> If I call the script with
> >> eval(parse("script.R", encoding = "UTF-8"))
> >> it works perfectly - it looks like R functions already support Unicode. 
> >> When I type "\U00159", R also has no problem with that.
> >>
> >> Well there is support for unicode, but the problem is that at some point 
> >> translation to native encoding is needed. The parser does not do that, 
> >> nothing you call in your example script does it, but many other functions 
> >> do. Note that you can use UTF-8 without problems as long as you only have 
> >> characters that can be represented also in the current native encoding. 
> >> So, if you run in a Czech locale, Czech characters in UTF-8 will work 
> >> fine, just they will sometimes be translated to corresponding Czech 
> >> characters in your native encoding.
> >>
> >> If you want to learn more about encodings in R, look at ?Encoding, Writing 
> >> R Extensions, etc. In principle, ever R object representing a string has a 
> >> flag whether the string is in UTF-8, in latin1, or in current native 
> >> encoding. But C structures typed "char *" almost always are in current 
> >> native encoding, any mixture would lead to chaos. Most functions operating 
> >> on strings have to specially handle UTF-8, MBCS encodings, ASCII, etc. All 
> >> of that would have to be rewritten. Many Windows API calls are still using 
> >> the native encoding version (some can use UTF16-LE via conversion from 
> >> UTF-8 or other encodings).
> >>
> >> In principle, it should work to have UTF-8 coded string constants in R 
> >> programs, and definitely so if you use \uxxxx (see Writing R Extensions 
> >> for details). But you should always run in a native encoding where these 
> >> characters can be represented, otherwise it may or may not work, depending 
> >> on which functions you call.
> >>
> >> Tomas
> >>
> >>
> >> Thanks,
> >> Tomas
> >>
> >> st 10. 4. 2019 v 13:52 odesílatel Tomas Kalibera 
> >> <tomas.kalib...@gmail.com> napsal:
> >>> On 4/10/19 1:35 PM, Tomáš Bořil wrote:
> >>>> Which users make their code depending on an automatic conversion which
> >>>> behaves differently in each Europe country, but only on Windows?
> >>> I meant the "best fit". The same R scripts for the same data sets would
> >>> be returning different results, people capture existing behavior without
> >>> necessarily knowing about it. Removing the "best fit" would not remove
> >>> the translation to native encoding, you would get NA or some escape
> >>> sequence/character code number instead of the "best fit" character.  It
> >>> would not solve the problem.
> >>>
> >>> The real problem is that the conversion to native encoding happens. This
> >>> question has been discussed many times before, but in short, it would
> >>> take probably many 1000s of hours of developer time to rewrite R to use
> >>> UTF-8 internally, but convert to UTF16-LE in all Windows API calls. It
> >>> will cause changes to documented behavior. What may not be obvious,
> >>> there is a problem with package code written in C/C++ that ignores
> >>> encoding flags (that is almost all native code in packages). That code
> >>> will stop working and there will be no way to test - because the input
> >>> data in the contributed examples/tests are ASCII.
> >>>
> >>> If Windows start supporting UTF-8 as native encoding, the fix will be a
> >>> lot easier (I hope ~100hours), and without the compatibility problems -
> >>> just users who would wish to use UTF-8 as native encoding will be
> >>> affected, and things will probably work for them even with poorly
> >>> written packages.
> >>>
> >>> Tomas
> >>>
> >>>
> >>>> If someone needs the explicit conversion, he can call the iconv() 
> >>>> function.
> >>>>
> >>>> Much more people using R for text processing are frustrated they can
> >>>> code only in ASCII (0-255), even though their code is saved in
> >>>> Unicode.
> >>>>
> >>>> Tomas
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Wed, Apr 10, 2019 at 1:26 PM Tomas Kalibera 
> >>>> <tomas.kalib...@gmail.com> wrote:
> >>>>> On 4/10/19 1:14 PM, Jeroen Ooms wrote:
> >>>>>> On Wed, Apr 10, 2019 at 12:19 PM Tomáš Bořil <bor...@gmail.com> wrote:
> >>>>>>> Minimalistic example:
> >>>>>>> Let's type "ř" (LATIN SMALL LETTER R WITH CARON) in RGui console:
> >>>>>>>> "ř"
> >>>>>>> [1] "r"
> >>>>>>>
> >>>>>>> Although the script is in UTF-8, the characters are replaced by
> >>>>>>> "simplified" substitutes uncontrollably (depending on OS locale). The
> >>>>>>> same goes with simply entering the code statements in R Console.
> >>>>>>>
> >>>>>>> The problem does not occur on OS with UTF-8 locale (Mac OS, Linux...)
> >>>>>> I think this is a "feature" of win_iconv that is bundled with base R
> >>>>>> on Windows (./src/extra/win_iconv). The character from your example is
> >>>>>> not part of the latin1 (iso-8859-1) set, however, win-iconv seems to
> >>>>>> do so anyway:
> >>>>>>
> >>>>>>> x <- "\U00159"
> >>>>>>> print(x)
> >>>>>> [1] "ř"
> >>>>>>> iconv(x, 'UTF-8', 'iso-8859-1')
> >>>>>> [1] "r"
> >>>>>>
> >>>>>> On MacOS, iconv tells us this character cannot be represented as 
> >>>>>> latin1:
> >>>>>>
> >>>>>>> x <- "\U00159"
> >>>>>>> print(x)
> >>>>>> [1] "ř"
> >>>>>>> iconv(x, 'UTF-8', 'iso-8859-1')
> >>>>>> [1] NA
> >>>>>>
> >>>>>> I'm actually not sure why base-R needs win_iconv (but I'm not an
> >>>>>> encoding expert at all). Perhaps we could try to unbundle it and use
> >>>>>> the standard libiconv provided by the Rtools toolchain bundle to get
> >>>>>> more consistent results.
> >>>>> win_iconv just calls into Windows API to do the conversion, it is
> >>>>> technically easy to disable the "best fit" conversion, but I think it
> >>>>> won't be a good idea. In some cases, perhaps rare, the best fit is good,
> >>>>> actually including the conversion from "ř" to "r" which makes perfect
> >>>>> sense. But more importantly, changing the behavior could affect users
> >>>>> who expect the substitution to happen because it has been happening for
> >>>>> many years, and it won't help others much.
> >>>>>
> >>>>> Tomas
> >>>>>
> >>>>>> ______________________________________________
> >>>>>> R-devel@r-project.org mailing list
> >>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones

Reply via email to