I do not blame anybody and I do have a huge respect to all authors of R. Actually, I like R very much and I would like to thank to everyone who contributes to it. I use R regularly in my work (moved from Java, C# and Matlab), I have created a package rPraat for phonetic analyses and I think R is a very well designed language which will survive decades. I am trying to bring new users (my students at non-technical University) to use programming for their everyday problems (statistics, phonetic analyses, text processing) and they enjoy R. I am really positive in this (it is hard to express emotions in e-mails without using emoticons in every sentence). And that is why I would like it have even more perfect.
I only suggest to add one line of code (metaphorically) in source() function in R for Windows to make it even better and to warn all users who do not read a whole documentation for each function thoroughly and carefully. Tomas On Thu, Apr 11, 2019 at 9:54 AM Tomas Kalibera <tomas.kalib...@gmail.com> wrote: > > On 4/11/19 9:10 AM, Tomáš Bořil wrote: > > Or, if this cannot be done easily, please, disable the "utf-8" value > > in source(..., ) function on Windows R. > > source(..., encoding = "utf-8") > > -> error: "utf-8" does not work right on Windows. > > -> (or, at least) warning: "utf-8" is handled by "best fit" on Windows > > and some characters in string literals may be automatically changed. > > > > Because, at this state, the UTF-8 encoding of R source files on > > Windows is a fake Unicode as it can handle only 256 different ANSI > > characters in reality. > > This is not a fair statement. source(,encoding="UTF-8") works as > documented. It translates from (full) UTF-8 to current native encoding, > which is documented. I believe the authors who made these design > decisions over a decade ago, under different circumstances, and > carefully implemented the code, tested, and documented for you to use > for free, deserve to be addressed with some respect. It is not their > responsibility to read the documentation for you, and if you had read > and understood it, you would not have used source(,encoding="UTF-8") > with characters not representable in current native encoding on Windows. > The authors should not be blamed for that the design _today_ does not > seem perfect for _todays_ systems (and how could they have guessed at > that time Windows will still not support UTF-8 as native encoding today). > > Tomas > > Thanks, > > Tomas > > > > > > On Thu, Apr 11, 2019 at 8:53 AM Tomáš Bořil <bor...@gmail.com> wrote: > >> For me, this would be a perfect solution. > >> > >> I.e., do not use the “best” fit and leave it to user’s competence: > >> a) in some functions, utf-8 works > >> b) in others -> error is thrown (e.g., incomplete string, NA, etc.) > >> => user has to change the code with his/her intentional “best fit string > >> literal substitute” or use another function that can handle utf-8. > >> > >> Making an R code working right only on some platforms / trying to keep a > >> back-compatibility meaning “the code does not do what you want and the > >> behaviour differs depending on each every locale but at least, it does not > >> throw an error” is generally not a good idea - it is dangerous. Users / > >> coders should know that there is something wrong with their strings and > >> some characters are “eaten alive”. > >> > >> Tomas > >> > >> čt 11. 4. 2019 v 8:26 odesílatel Tomas Kalibera <tomas.kalib...@gmail.com> > >> napsal: > >>> On 4/10/19 6:32 PM, Jeroen Ooms wrote: > >>>> On Wed, Apr 10, 2019 at 5:45 PM Duncan Murdoch > >>>> <murdoch.dun...@gmail.com> wrote: > >>>>> On 10/04/2019 10:29 a.m., Yihui Xie wrote: > >>>>>> Since it is "technically easy" to disable the best fit conversion and > >>>>>> the best fit is rarely good, how about providing an option for > >>>>>> code/package authors to disable it? I'm asking because this is one of > >>>>>> the most painful issues in packages that may need to source() code > >>>>>> containing UTF-8 characters that are not representable in the Windows > >>>>>> native encoding. Examples include knitr/rmarkdown and shiny. Basically > >>>>>> users won't be able to knit documents or run Shiny apps correctly when > >>>>>> the code contains characters that cannot be represented in the native > >>>>>> encoding. > >>>>> Wouldn't things be worse with it disabled than currently? I'd expect > >>>>> the line containing the "ř" to end up as NA instead of converting to > >>>>> "r". > >>>> I don't think it would be worse, because in this case R would not > >>>> implicitly convert strings to (best fit) latin1 on Windows, but > >>>> instead keep the (correct) string in its UTF-8 encoding. The NA only > >>>> appears if the user explicitly forces a conversion to latin1, which is > >>>> not the problem here I think. > >>>> > >>>> The original problem that I can reproduce in RGui is that if you enter > >>>> "ř" in RGui, R opportunistically converts this to latin1, because it > >>>> can. However if you enter text which can definitely not be represented > >>>> in latin1, R encodes the string correctly in UTF-8 form. > >>> Rgui is a "Windows Unicode" application (uses UTF16-LE) but it needs to > >>> convert the input to native encoding before passing it to R, which is > >>> based on locales. However, that string is passed by R to the parser, > >>> which Rgui takes advantage of and converts non-representable characters > >>> to their \uxxxx escapes which are understood by the parser. Using this > >>> trick, Unicode characters can get to the parser from Rgui (but of course > >>> then still in risk of conversion later when the program runs). Rgui only > >>> escapes characters that cannot be represented, unfortunately, the > >>> standard C99 API for that implemented on Windows does the best fit. This > >>> could be fixed in Rgui by calling a special Windows API function and > >>> could be done, but with the mentioned risk that it would break existing > >>> uses that capture the existing behavior. > >>> > >>> This is the only place I know of where removing best fit would lead to > >>> correct representation of UTF-8 characters. Other places will give NA, > >>> some other escapes, code will fail to parse (e.g. "incomplete string", > >>> one can get that easily with source()). > >>> > >>> Tomas > >>> > >>> ______________________________________________ > >>> R-devel@r-project.org mailing list > >>> https://stat.ethz.ch/mailman/listinfo/r-devel > > ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel