Yes, again in a script sourced by source(encoding = ...). But also by typing it directly in R console.
Most of the time, I use RStudio as a front-end. For this experiment, I also verified it in Rgui. In both front-ends, it behaves completely in the same way. An optional parameter to source() function which would translate all UTF-8 characters in string literals to their "\Uxxxx" codes sounds as a great idea (and I hope it would fix 99.9% of problems I have - because that is the way I overcome these problems nowadays) - and the same behaviour in command line... Tomas > What do you mean it is "converted before"? Under what context? Again a > script sourced by source(encoding=) ? > > And, are you using Rgui as front-end? >> The only problem is that I >> cannot simple use enc2utf8("œ") - it is converted to "o" before >> executing the function. Instead of that, I have to explicitly type >> "\U00159" throughout my code. On Wed, Apr 10, 2019 at 5:29 PM Tomas Kalibera <tomas.kalib...@gmail.com> wrote: > > On 4/10/19 3:02 PM, Tomáš Bořil wrote: > > The thing is, I would rather prefer R (in that rare occasions where an > > old function does not support anything but ANSI encoding) throwing an > > error: > > "Unicode encoding not supported, please change the string in your > > code" instead of silently converting some characters to different ones > > without any warning. > In principle it probably could be optional as Yihui Xie asks on R-devel, > we will discuss that internally. If the Windows "best fit" is a big > problem on its own, this is something that could be done quickly, if > optional. We could turn into error only conversions that we have control > of (inside R code), indeed, but that should be most. > > I understand that there are some functions which are not > > Unicode-compatible yet but according to the Stackoverflow discussion I > > cited before, in many cases (90% or more?) everything works right with > > Encoding("\U00159") == "UTF-8" (in my scripts, I have not found any > > problem with explicit UTF-8 coding yet). > > Well there has been a lot of effort invested to make that possible, so > that many internal string functions do not convert unnecessarily into > UTF-8, mostly by Duncan Murdoch, but much more needs to be done and > there is the problem with packages. Of course if you find a concrete R > function that unnecessarily converts (source() is debatable, I know > about it, so some other), you are welcome to report, I or someone can > fix. A common problem is I/O (connections) and there the fix won't be > easy, it would have to be re-designed. The problem is that when we have > something typed "char *" inside R, it needs to be always in native > encoding, any mix would lead to total chaos. > > The full solution would however only be fully switching to UTF-8 > internally on Windows (and then char * would always mean UTF-8), we have > discussed this many times inside R Core (and many times before I > joined), I am sure it will be discussed again at some point and we are > aware of course of the problem. Please trust us it is hard to do - we > know the code as we (collectively) have written it. People contributing > to SO are users and package developers, not developers of the core. You > can get more correct information from people on R-devel (package > developers and sometimes core developers). > > > The only problem is that I > > cannot simple use enc2utf8("œ") - it is converted to "o" before > > executing the function. Instead of that, I have to explicitly type > > "\U00159" throughout my code. > > What do you mean it is "converted before"? Under what context? Again a > script sourced by source(encoding=) ? > > And, are you using Rgui as front-end? > > > In my lectures, I have Czech, Russian and English students and it is > > also impossible to create a script that works for everyone. In fact, I > > know that Czech "ř" can be translated to my native (Czech) encoding. I > > have just chosen the example as it is reproducible in English locale. > > > > Originally, I had a problem with IPA characted (phonetic symbol) "œ", > > i.e. "\U00153". In Czech locale, it is translated to "o". In English, > > it is not converted - it remains "œ". But if I use "\U00153" in Czech > > locale, nothing is converted and everything works right. > > Yes, the \u* sequence I hear is commonly used to represent UTF-8 string > literals in something that is not UTF-8 itself. Note if you have a > package, you can have R source files with UTF-8 encoded literal strings > if you declare Encoding: UTF-8 in the DESCRIPTION file (see Writing R > Extensions for details), even though sometimes people run into > trouble/bugs as well. > > You probably know none of these problems exist on Linux nor macOS, where > UTF-8 is the native encoding. > > Tomas > > > > > Tomas > > > > > > > > On Wed, Apr 10, 2019 at 2:37 PM Tomas Kalibera <tomas.kalib...@gmail.com> > > wrote: > >> On 4/10/19 2:06 PM, Tomáš Bořil wrote: > >> > >> Thank you for the explanation but I just do not understand one thing - why > >> it would need to recreate the R from a scratch to work with Unicode > >> internally? > >> > >> If I call the script with > >> eval(parse("script.R", encoding = "UTF-8")) > >> it works perfectly - it looks like R functions already support Unicode. > >> When I type "\U00159", R also has no problem with that. > >> > >> Well there is support for unicode, but the problem is that at some point > >> translation to native encoding is needed. The parser does not do that, > >> nothing you call in your example script does it, but many other functions > >> do. Note that you can use UTF-8 without problems as long as you only have > >> characters that can be represented also in the current native encoding. > >> So, if you run in a Czech locale, Czech characters in UTF-8 will work > >> fine, just they will sometimes be translated to corresponding Czech > >> characters in your native encoding. > >> > >> If you want to learn more about encodings in R, look at ?Encoding, Writing > >> R Extensions, etc. In principle, ever R object representing a string has a > >> flag whether the string is in UTF-8, in latin1, or in current native > >> encoding. But C structures typed "char *" almost always are in current > >> native encoding, any mixture would lead to chaos. Most functions operating > >> on strings have to specially handle UTF-8, MBCS encodings, ASCII, etc. All > >> of that would have to be rewritten. Many Windows API calls are still using > >> the native encoding version (some can use UTF16-LE via conversion from > >> UTF-8 or other encodings). > >> > >> In principle, it should work to have UTF-8 coded string constants in R > >> programs, and definitely so if you use \uxxxx (see Writing R Extensions > >> for details). But you should always run in a native encoding where these > >> characters can be represented, otherwise it may or may not work, depending > >> on which functions you call. > >> > >> Tomas > >> > >> > >> Thanks, > >> Tomas > >> > >> st 10. 4. 2019 v 13:52 odesílatel Tomas Kalibera > >> <tomas.kalib...@gmail.com> napsal: > >>> On 4/10/19 1:35 PM, Tomáš Bořil wrote: > >>>> Which users make their code depending on an automatic conversion which > >>>> behaves differently in each Europe country, but only on Windows? > >>> I meant the "best fit". The same R scripts for the same data sets would > >>> be returning different results, people capture existing behavior without > >>> necessarily knowing about it. Removing the "best fit" would not remove > >>> the translation to native encoding, you would get NA or some escape > >>> sequence/character code number instead of the "best fit" character. It > >>> would not solve the problem. > >>> > >>> The real problem is that the conversion to native encoding happens. This > >>> question has been discussed many times before, but in short, it would > >>> take probably many 1000s of hours of developer time to rewrite R to use > >>> UTF-8 internally, but convert to UTF16-LE in all Windows API calls. It > >>> will cause changes to documented behavior. What may not be obvious, > >>> there is a problem with package code written in C/C++ that ignores > >>> encoding flags (that is almost all native code in packages). That code > >>> will stop working and there will be no way to test - because the input > >>> data in the contributed examples/tests are ASCII. > >>> > >>> If Windows start supporting UTF-8 as native encoding, the fix will be a > >>> lot easier (I hope ~100hours), and without the compatibility problems - > >>> just users who would wish to use UTF-8 as native encoding will be > >>> affected, and things will probably work for them even with poorly > >>> written packages. > >>> > >>> Tomas > >>> > >>> > >>>> If someone needs the explicit conversion, he can call the iconv() > >>>> function. > >>>> > >>>> Much more people using R for text processing are frustrated they can > >>>> code only in ASCII (0-255), even though their code is saved in > >>>> Unicode. > >>>> > >>>> Tomas > >>>> > >>>> > >>>> > >>>> > >>>> On Wed, Apr 10, 2019 at 1:26 PM Tomas Kalibera > >>>> <tomas.kalib...@gmail.com> wrote: > >>>>> On 4/10/19 1:14 PM, Jeroen Ooms wrote: > >>>>>> On Wed, Apr 10, 2019 at 12:19 PM Tomáš Bořil <bor...@gmail.com> wrote: > >>>>>>> Minimalistic example: > >>>>>>> Let's type "ř" (LATIN SMALL LETTER R WITH CARON) in RGui console: > >>>>>>>> "ř" > >>>>>>> [1] "r" > >>>>>>> > >>>>>>> Although the script is in UTF-8, the characters are replaced by > >>>>>>> "simplified" substitutes uncontrollably (depending on OS locale). The > >>>>>>> same goes with simply entering the code statements in R Console. > >>>>>>> > >>>>>>> The problem does not occur on OS with UTF-8 locale (Mac OS, Linux...) > >>>>>> I think this is a "feature" of win_iconv that is bundled with base R > >>>>>> on Windows (./src/extra/win_iconv). The character from your example is > >>>>>> not part of the latin1 (iso-8859-1) set, however, win-iconv seems to > >>>>>> do so anyway: > >>>>>> > >>>>>>> x <- "\U00159" > >>>>>>> print(x) > >>>>>> [1] "ř" > >>>>>>> iconv(x, 'UTF-8', 'iso-8859-1') > >>>>>> [1] "r" > >>>>>> > >>>>>> On MacOS, iconv tells us this character cannot be represented as > >>>>>> latin1: > >>>>>> > >>>>>>> x <- "\U00159" > >>>>>>> print(x) > >>>>>> [1] "ř" > >>>>>>> iconv(x, 'UTF-8', 'iso-8859-1') > >>>>>> [1] NA > >>>>>> > >>>>>> I'm actually not sure why base-R needs win_iconv (but I'm not an > >>>>>> encoding expert at all). Perhaps we could try to unbundle it and use > >>>>>> the standard libiconv provided by the Rtools toolchain bundle to get > >>>>>> more consistent results. > >>>>> win_iconv just calls into Windows API to do the conversion, it is > >>>>> technically easy to disable the "best fit" conversion, but I think it > >>>>> won't be a good idea. In some cases, perhaps rare, the best fit is good, > >>>>> actually including the conversion from "ř" to "r" which makes perfect > >>>>> sense. But more importantly, changing the behavior could affect users > >>>>> who expect the substitution to happen because it has been happening for > >>>>> many years, and it won't help others much. > >>>>> > >>>>> Tomas > >>>>> > >>>>>> ______________________________________________ > >>>>>> R-devel@r-project.org mailing list > >>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel > > ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel